diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 00000000..e69de29b
diff --git a/cache.json b/cache.json
new file mode 100644
index 00000000..d77e0d7e
--- /dev/null
+++ b/cache.json
@@ -0,0 +1 @@
+{"2023-12-18T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2312.09494v2","updated":"2023-12-18T02:50:02Z","published":"2023-12-15T02:42:05Z","title":"No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based\n Language Models","summary":" To reduce the computation cost and the energy consumption in large language\nmodels (LLM), skimming-based acceleration dynamically drops unimportant tokens\nof the input sequence progressively along layers of the LLM while preserving\nthe tokens of semantic importance. However, our work for the first time reveals\nthe acceleration may be vulnerable to Denial-of-Service (DoS) attacks. In this\npaper, we propose No-Skim, a general framework to help the owners of\nskimming-based LLM to understand and measure the robustness of their\nacceleration scheme. Specifically, our framework searches minimal and\nunnoticeable perturbations at character-level and token-level to generate\nadversarial inputs that sufficiently increase the remaining token ratio, thus\nincreasing the computation cost and energy consumption. We systematically\nevaluate the vulnerability of the skimming acceleration in various LLM\narchitectures including BERT and RoBERTa on the GLUE benchmark. In the worst\ncase, the perturbation found by No-Skim substantially increases the running\ncost of LLM by over 145% on average. Moreover, No-Skim extends the evaluation\nframework to various scenarios, making the evaluation conductible with\ndifferent level of knowledge.\n","authors":["Shengyao Zhang","Mi Zhang","Xudong Pan","Min Yang"],"pdf_url":"https://arxiv.org/pdf/2312.09494v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.03360v2","updated":"2023-12-18T01:43:56Z","published":"2023-12-06T08:55:55Z","title":"Teaching Specific Scientific Knowledge into Large Language Models\n through Additional Training","summary":" Through additional training, we explore embedding specialized scientific\nknowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that\neffective knowledge integration requires reading texts from multiple\nperspectives, especially in instructional formats. We utilize text augmentation\nto tackle the scarcity of specialized texts, including style conversions and\ntranslations. Hyperparameter optimization proves crucial, with different size\nmodels (7b, 13b, and 70b) reasonably undergoing additional training. Validating\nour methods, we construct a dataset of 65,000 scientific papers. Although we\nhave succeeded in partially embedding knowledge, the study highlights the\ncomplexities and limitations of incorporating specialized information into\nLLMs, suggesting areas for further improvement.\n","authors":["Kan Hatakeyama-Sato","Yasuhiko Igarashi","Shun Katakami","Yuta Nabae","Teruaki Hayakawa"],"pdf_url":"https://arxiv.org/pdf/2312.03360v2.pdf","comment":"added token information for some texts, and fixed typo"},{"id":"http://arxiv.org/abs/2312.09979v2","updated":"2023-12-18T16:46:13Z","published":"2023-12-15T17:45:06Z","title":"LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World\n Knowledge in Language Model Alignment","summary":" Supervised fine-tuning (SFT) is a crucial step for large language models\n(LLMs), enabling them to align with human instructions and enhance their\ncapabilities in downstream tasks. When the models are required to align with a\nbroader range of downstream tasks, or there is a desire to notably improve the\nperformance on a specific task, a substantial increase in fine-tuning data\noften emerges as the solution. However, we find that large-scale increases in\ninstruction data can disrupt the world knowledge previously stored in the LLMs,\ni.e., world knowledge forgetting. In this paper, we introduce LoRAMoE to\naddress the above challenge. The LoRAMoE is a plugin version of Mixture of\nExperts (MoE). The plugin form ensures the integrity of world knowledge by\nfreezing the backbone model during the training phase. We then propose the use\nof localized balancing constraints to coordinate parts of experts for task\nutilization, meanwhile enabling other experts to fully leverage the world\nknowledge stored in the models. Experimental results demonstrate that LoRAMoE\ncan reasonably coordinate experts based on data type during inference, and even\ndramatically increasing instruction data does not result in knowledge\nforgetting. Moreover, LoRAMoE provides additional benefits for the performance\nof downstream tasks, indicating the potential of our approach for multi-task\nlearning.\n","authors":["Shihan Dou","Enyu Zhou","Yan Liu","Songyang Gao","Jun Zhao","Wei Shen","Yuhao Zhou","Zhiheng Xi","Xiao Wang","Xiaoran Fan","Shiliang Pu","Jiang Zhu","Rui Zheng","Tao Gui","Qi Zhang","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2312.09979v2.pdf","comment":"17 pages, 7 figures"},{"id":"http://arxiv.org/abs/2311.15786v4","updated":"2023-12-18T08:08:32Z","published":"2023-11-27T13:01:59Z","title":"YUAN 2.0: A Large Language Model with Localized Filtering-based\n Attention","summary":" In this work, we develop and release Yuan 2.0, a series of large language\nmodels with parameters ranging from 2.1 billion to 102.6 billion. The Localized\nFiltering-based Attention (LFA) is introduced to incorporate prior knowledge of\nlocal dependencies of natural language into Attention. A data filtering and\ngenerating system is presented to build pre-training and fine-tuning dataset in\nhigh quality. A distributed training method with non-uniform pipeline parallel,\ndata parallel, and optimizer parallel is proposed, which greatly reduces the\nbandwidth requirements of intra-node communication, and achieves good\nperformance in large-scale distributed training. Yuan 2.0 models display\nimpressive ability in code generation, math problem-solving, and chatting\ncompared with existing models. The latest version of YUAN 2.0, including model\nweights and source code, is accessible at Github.\n","authors":["Shaohua Wu","Xudong Zhao","Shenling Wang","Jiangang Luo","Lingjun Li","Xi Chen","Bing Zhao","Wei Wang","Tong Yu","Rongguo Zhang","Jiahua Zhang","Chao Wang"],"pdf_url":"https://arxiv.org/pdf/2311.15786v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.02775v3","updated":"2023-12-18T23:23:06Z","published":"2023-11-05T21:43:02Z","title":"AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using\n Open-Source LLMs","summary":" Responding to the thousands of student questions on online QA platforms each\nsemester has a considerable human cost, particularly in computing courses with\nrapidly growing enrollments. To address the challenges of scalable and\nintelligent question-answering (QA), we introduce an innovative solution that\nleverages open-source Large Language Models (LLMs) from the LLaMA-2 family to\nensure data privacy. Our approach combines augmentation techniques such as\nretrieval augmented generation (RAG), supervised fine-tuning (SFT), and\nlearning from human preferences data using Direct Preference Optimization\n(DPO). Through extensive experimentation on a Piazza dataset from an\nintroductory CS course, comprising 10,000 QA pairs and 1,500 pairs of\npreference data, we demonstrate a significant 30% improvement in the quality of\nanswers, with RAG being a particularly impactful addition. Our contributions\ninclude the development of a novel architecture for educational QA, extensive\nevaluations of LLM performance utilizing both human assessments and LLM-based\nmetrics, and insights into the challenges and future directions of educational\ndata processing. This work paves the way for the development of AI-TA, an\nintelligent QA assistant customizable for courses with an online QA platform\n","authors":["Yann Hicke","Anmol Agarwal","Qianou Ma","Paul Denny"],"pdf_url":"https://arxiv.org/pdf/2311.02775v3.pdf","comment":"Updates for camera-ready submission"},{"id":"http://arxiv.org/abs/2309.15016v2","updated":"2023-12-18T21:43:01Z","published":"2023-09-26T15:36:29Z","title":"Question-Answering Approach to Evaluating Legal Summaries","summary":" Traditional evaluation metrics like ROUGE compare lexical overlap between the\nreference and generated summaries without taking argumentative structure into\naccount, which is important for legal summaries. In this paper, we propose a\nnovel legal summarization evaluation framework that utilizes GPT-4 to generate\na set of question-answer pairs that cover main points and information in the\nreference summary. GPT-4 is then used to generate answers based on the\ngenerated summary for the questions from the reference summary. Finally, GPT-4\ngrades the answers from the reference summary and the generated summary. We\nexamined the correlation between GPT-4 grading with human grading. The results\nsuggest that this question-answering approach with GPT-4 can be a useful tool\nfor gauging the quality of the summary.\n","authors":["Huihui Xu","Kevin Ashley"],"pdf_url":"https://arxiv.org/pdf/2309.15016v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11720v1","updated":"2023-12-18T21:42:34Z","published":"2023-12-18T21:42:34Z","title":"Assessing Logical Reasoning Capabilities of Encoder-Only Transformer\n Models","summary":" Logical reasoning is central to complex human activities, such as thinking,\ndebating, and planning; it is also a central component of many AI systems as\nwell. In this paper, we investigate the extent to which encoder-only\ntransformer language models (LMs) can reason according to logical rules. We ask\nwhether those LMs can deduce theorems in propositional calculus and first-order\nlogic; if their relative success in these problems reflects general logical\ncapabilities; and which layers contribute the most to the task. First, we show\nfor several encoder-only LMs that they can be trained, to a reasonable degree,\nto determine logical validity on various datasets. Next, by cross-probing\nfine-tuned models on these datasets, we show that LMs have difficulty in\ntransferring their putative logical reasoning ability, which suggests that they\nmay have learned dataset-specific features, instead of a general capability.\nFinally, we conduct a layerwise probing experiment, which shows that the\nhypothesis classification task is mostly solved through higher layers.\n","authors":["Paulo Pirozelli","Marcos M. José","Paulo de Tarso P. Filho","Anarosa A. F. Brandão","Fabio G. Cozman"],"pdf_url":"https://arxiv.org/pdf/2312.11720v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.03498v2","updated":"2023-12-18T21:09:51Z","published":"2023-11-06T20:13:29Z","title":"In-Context Exemplars as Clues to Retrieving from Large Associative\n Memory","summary":" Recently, large language models (LLMs) have made remarkable progress in\nnatural language processing. The most representative ability of LLMs is\nin-context learning (ICL), which enables LLMs to learn patterns from in-context\nexemplars without training. The performance of ICL greatly depends on the\nexemplars used. However, how to choose exemplars remains unclear due to the\nlack of understanding of how in-context learning works. In this paper, we\npresent a novel perspective on ICL by conceptualizing it as contextual\nretrieval from a model of associative memory. We establish a theoretical\nframework of ICL based on Hopfield Networks. Based on our framework, we look\ninto how in-context exemplars influence the performance of ICL and propose more\nefficient active exemplar selection. Our study sheds new light on the mechanism\nof ICL by connecting it to memory retrieval, with potential implications for\nadvancing the understanding of LLMs.\n","authors":["Jiachen Zhao"],"pdf_url":"https://arxiv.org/pdf/2311.03498v2.pdf","comment":"Presented at Neural Conversational AI @ ICML 2023 and Associative\n Memory & Hopfield Networks @ NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.11703v1","updated":"2023-12-18T21:03:46Z","published":"2023-12-18T21:03:46Z","title":"Shaping Political Discourse using multi-source News Summarization","summary":" Multi-document summarization is the process of automatically generating a\nconcise summary of multiple documents related to the same topic. This summary\ncan help users quickly understand the key information from a large collection\nof documents. Multi-document summarization systems are more complex than\nsingle-document summarization systems due to the need to identify and combine\ninformation from multiple sources. In this paper, we have developed a machine\nlearning model that generates a concise summary of a topic from multiple news\ndocuments. The model is designed to be unbiased by sampling its input equally\nfrom all the different aspects of the topic, even if the majority of the news\nsources lean one way.\n","authors":["Charles Rajan","Nishit Asnani","Shreya Singh"],"pdf_url":"https://arxiv.org/pdf/2312.11703v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11701v1","updated":"2023-12-18T20:58:58Z","published":"2023-12-18T20:58:58Z","title":"Opportunities and Challenges of Applying Large Language Models in\n Building Energy Efficiency and Decarbonization Studies: An Exploratory\n Overview","summary":" In recent years, the rapid advancement and impressive capabilities of Large\nLanguage Models (LLMs) have been evident across various domains. This paper\nexplores the application, implications, and potential of LLMs in building\nenergy efficiency and decarbonization studies. The wide-ranging capabilities of\nLLMs are examined in the context of the building energy field, including\nintelligent control systems, code generation, data infrastructure, knowledge\nextraction, and education. Despite the promising potential of LLMs, challenges\nincluding complex and expensive computation, data privacy, security and\ncopyright, complexity in fine-tuned LLMs, and self-consistency are discussed.\nThe paper concludes with a call for future research focused on the enhancement\nof LLMs for domain-specific tasks, multi-modal LLMs, and collaborative research\nbetween AI and energy experts.\n","authors":["Liang Zhang","Zhelun Chen"],"pdf_url":"https://arxiv.org/pdf/2312.11701v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11681v1","updated":"2023-12-18T20:01:58Z","published":"2023-12-18T20:01:58Z","title":"Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows","summary":" LLM chains enable complex tasks by decomposing work into a sequence of\nsub-tasks. Crowdsourcing workflows similarly decompose complex tasks into\nsmaller tasks for human crowdworkers. Chains address LLM errors analogously to\nthe way crowdsourcing workflows address human error. To characterize\nopportunities for LLM chaining, we survey 107 papers across the crowdsourcing\nand chaining literature to construct a design space for chain development. The\ndesign space connects an LLM designer's objectives to strategies they can use\nto achieve those objectives, and tactics to implement each strategy. To explore\nhow techniques from crowdsourcing may apply to chaining, we adapt crowdsourcing\nworkflows to implement LLM chains across three case studies: creating a\ntaxonomy, shortening text, and writing a short story. From the design space and\nour case studies, we identify which techniques transfer from crowdsourcing to\nLLM chaining and raise implications for future research and development.\n","authors":["Madeleine Grunde-McLaughlin","Michelle S. Lam","Ranjay Krishna","Daniel S. Weld","Jeffrey Heer"],"pdf_url":"https://arxiv.org/pdf/2312.11681v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11671v1","updated":"2023-12-18T19:27:09Z","published":"2023-12-18T19:27:09Z","title":"Evaluating Language-Model Agents on Realistic Autonomous Tasks","summary":" In this report, we explore the ability of language model agents to acquire\nresources, create copies of themselves, and adapt to novel challenges they\nencounter in the wild. We refer to this cluster of capabilities as \"autonomous\nreplication and adaptation\" or ARA. We believe that systems capable of ARA\ncould have wide-reaching and hard-to-anticipate consequences, and that\nmeasuring and forecasting ARA may be useful for informing measures around\nsecurity, monitoring, and alignment. Additionally, once a system is capable of\nARA, placing bounds on a system's capabilities may become significantly more\ndifficult.\n We construct four simple example agents that combine language models with\ntools that allow them to take actions in the world. We then evaluate these\nagents on 12 tasks relevant to ARA. We find that these language model agents\ncan only complete the easiest tasks from this list, although they make some\nprogress on the more challenging tasks. Unfortunately, these evaluations are\nnot adequate to rule out the possibility that near-future agents will be\ncapable of ARA. In particular, we do not think that these evaluations provide\ngood assurance that the ``next generation'' of language models (e.g. 100x\neffective compute scaleup on existing models) will not yield agents capable of\nARA, unless intermediate evaluations are performed during pretraining.\nRelatedly, we expect that fine-tuning of the existing models could produce\nsubstantially more competent agents, even if the fine-tuning is not directly\ntargeted at ARA.\n","authors":["Megan Kinniment","Lucas Jun Koba Sato","Haoxing Du","Brian Goodrich","Max Hasin","Lawrence Chan","Luke Harold Miles","Tao R. Lin","Hjalmar Wijk","Joel Burget","Aaron Ho","Elizabeth Barnes","Paul Christiano"],"pdf_url":"https://arxiv.org/pdf/2312.11671v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2312.11462v1","updated":"2023-12-18T18:59:46Z","published":"2023-12-18T18:59:46Z","title":"Cascade Speculative Drafting for Even Faster LLM Inference","summary":" Speculative decoding enhances the efficiency of large language models (LLMs)\nby leveraging a draft model to draft for a larger target model to review.\nHowever, drafting in speculative decoding involves slow autoregressive\ngeneration and generating tokens of different importance with the same time\nallocation. These two inefficiencies lead to its suboptimal performance. To\naddress this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a\nnovel approach that employs two types of cascades. The Vertical Cascade\neliminates autoregressive generation from neural models. The Horizontal Cascade\nconstitutes efficient time allocation in drafting with its optimality supported\nby our theoretical analysis. Combining both cascades, our CS. Drafting\nalgorithm has achieved up to 72 percent additional speedup over speculative\ndecoding in our experiments while keeping the same output distribution.\n","authors":["Ziyi Chen","Xiaocong Yang","Jiacheng Lin","Chenkai Sun","Jie Huang","Kevin Chen-Chuan Chang"],"pdf_url":"https://arxiv.org/pdf/2312.11462v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17333v2","updated":"2023-12-18T18:55:49Z","published":"2023-10-26T11:59:45Z","title":"Arabic Fine-Grained Entity Recognition","summary":" Traditional NER systems are typically trained to recognize coarse-grained\nentities, and less attention is given to classifying entities into a hierarchy\nof fine-grained lower-level subtypes. This article aims to advance Arabic NER\nwith fine-grained entities. We chose to extend Wojood (an open-source Nested\nArabic Named Entity Corpus) with subtypes. In particular, four main entity\ntypes in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG),\nand facility (FAC), are extended with 31 subtypes. To do this, we first revised\nWojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's\nACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC,\nORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE\nsub-types. We refer to this extended version of Wojood as WojoodF ine. To\nevaluate our annotations, we measured the inter-annotator agreement (IAA) using\nboth Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively.\nTo compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic\nBERT encoders in three settings: flat NER, nested NER and nested NER with\nsubtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our\ncorpus and models are open-source and available at\nhttps://sina.birzeit.edu/wojood/.\n","authors":["Haneen Liqreina","Mustafa Jarrar","Mohammed Khalilia","Ahmed Oumar El-Shangiti","Muhammad Abdul-Mageed"],"pdf_url":"https://arxiv.org/pdf/2310.17333v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11444v1","updated":"2023-12-18T18:47:42Z","published":"2023-12-18T18:47:42Z","title":"An In-depth Look at Gemini's Language Abilities","summary":" The recently released Google Gemini class of models are the first to\ncomprehensively report results that rival the OpenAI GPT series across a wide\nvariety of tasks. In this paper, we do an in-depth exploration of Gemini's\nlanguage abilities, making two contributions. First, we provide a third-party,\nobjective comparison of the abilities of the OpenAI GPT and Google Gemini\nmodels with reproducible code and fully transparent results. Second, we take a\ncloser look at the results, identifying areas where one of the two model\nclasses excels. We perform this analysis over 10 datasets testing a variety of\nlanguage abilities, including reasoning, answering knowledge-based questions,\nsolving math problems, translating between languages, generating code, and\nacting as instruction-following agents. From this analysis, we find that Gemini\nPro achieves accuracy that is close but slightly inferior to the corresponding\nGPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations\nfor some of this under-performance, including failures in mathematical\nreasoning with many digits, sensitivity to multiple-choice answer ordering,\naggressive content filtering, and others. We also identify areas where Gemini\ndemonstrates comparably high performance, including generation into non-English\nlanguages, and handling longer and more complex reasoning chains. Code and data\nfor reproduction can be found at https://github.com/neulab/gemini-benchmark\n","authors":["Syeda Nahida Akter","Zichun Yu","Aashiq Muhamed","Tianyue Ou","Alex Bäuerle","Ángel Alexander Cabrera","Krish Dholakia","Chenyan Xiong","Graham Neubig"],"pdf_url":"https://arxiv.org/pdf/2312.11444v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11441v1","updated":"2023-12-18T18:44:10Z","published":"2023-12-18T18:44:10Z","title":"Social Learning: Towards Collaborative Learning with Large Language\n Models","summary":" We introduce the framework of \"social learning\" in the context of large\nlanguage models (LLMs), whereby models share knowledge with each other in a\nprivacy-aware manner using natural language. We present and evaluate two\napproaches for knowledge transfer between LLMs. In the first scenario, we allow\nthe model to generate abstract prompts aiming to teach the task. In our second\napproach, models transfer knowledge by generating synthetic examples. We\nevaluate these methods across diverse datasets and quantify memorization as a\nproxy for privacy loss. These techniques inspired by social learning yield\npromising results with low memorization of the original data. In particular, we\nshow that performance using these methods is comparable to results with the use\nof original labels and prompts. Our work demonstrates the viability of social\nlearning for LLMs, establishes baseline approaches and highlights several\nunexplored areas for future work.\n","authors":["Amirkeivan Mohtashami","Florian Hartmann","Sian Gooding","Lukas Zilka","Matt Sharifi","Blaise Aguera y Arcas"],"pdf_url":"https://arxiv.org/pdf/2312.11441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11420v1","updated":"2023-12-18T18:21:43Z","published":"2023-12-18T18:21:43Z","title":"Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM\n Finetuning","summary":" This paper introduces an efficient strategy to transform Large Language\nModels (LLMs) into Multi-Modal Large Language Models (MLLMs). By\nconceptualizing this transformation as a domain adaptation process, i.e.,\ntransitioning from text understanding to embracing multiple modalities, we\nintriguingly note that, within each attention block, tuning LayerNorm suffices\nto yield strong performance. Moreover, when benchmarked against other tuning\napproaches like full parameter finetuning or LoRA, its benefits on efficiency\nare substantial. For example, when compared to LoRA on a 13B model scale,\nperformance can be enhanced by an average of over 20% across five multi-modal\ntasks, and meanwhile, results in a significant reduction of trainable\nparameters by 41.9% and a decrease in GPU memory usage by 17.6%. On top of this\nLayerNorm strategy, we showcase that selectively tuning only with\nconversational data can improve efficiency further. Beyond these empirical\noutcomes, we provide a comprehensive analysis to explore the role of LayerNorm\nin adapting LLMs to the multi-modal domain and improving the expressive power\nof the model.\n","authors":["Bingchen Zhao","Haoqin Tu","Chen Wei","Jieru Mei","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2312.11420v1.pdf","comment":"The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2312.11399v1","updated":"2023-12-18T18:02:41Z","published":"2023-12-18T18:02:41Z","title":"News Signals: An NLP Library for Text and Time Series","summary":" We present an open-source Python library for building and using datasets\nwhere inputs are clusters of textual data, and outputs are sequences of real\nvalues representing one or more time series signals. The news-signals library\nsupports diverse data science and NLP problem settings related to the\nprediction of time series behaviour using textual data feeds. For example, in\nthe news domain, inputs are document clusters corresponding to daily news\narticles about a particular entity, and targets are explicitly associated\nreal-valued time series: the volume of news about a particular person or\ncompany, or the number of pageviews of specific Wikimedia pages. Despite many\nindustry and research use cases for this class of problem settings, to the best\nof our knowledge, News Signals is the only open-source library designed\nspecifically to facilitate data science and research settings with natural\nlanguage inputs and time series targets. In addition to the core codebase for\nbuilding and interacting with datasets, we also conduct a suite of experiments\nusing several popular Machine Learning libraries, which are used to establish\nbaselines for time series anomaly prediction using textual inputs.\n","authors":["Chris Hokamp","Demian Gholipour Ghalandari","Parsa Ghaffari"],"pdf_url":"https://arxiv.org/pdf/2312.11399v1.pdf","comment":"EMNLP NLP-OSS Workshop, December 2023"},{"id":"http://arxiv.org/abs/2312.11395v1","updated":"2023-12-18T17:55:05Z","published":"2023-12-18T17:55:05Z","title":"Verb Categorisation for Hindi Word Problem Solving","summary":" Word problem Solving is a challenging NLP task that deals with solving\nmathematical problems described in natural language. Recently, there has been\nrenewed interest in developing word problem solvers for Indian languages. As\npart of this paper, we have built a Hindi arithmetic word problem solver which\nmakes use of verbs. Additionally, we have created verb categorization data for\nHindi. Verbs are very important for solving word problems with\naddition/subtraction operations as they help us identify the set of operations\nrequired to solve the word problems. We propose a rule-based solver that uses\nverb categorisation to identify operations in a word problem and generate\nanswers for it. To perform verb categorisation, we explore several approaches\nand present a comparative study.\n","authors":["Harshita Sharma","Pruthwik Mishra","Dipti Misra Sharma"],"pdf_url":"https://arxiv.org/pdf/2312.11395v1.pdf","comment":"16 pages, 17 figures, ICON 2023 Conference"},{"id":"http://arxiv.org/abs/2312.11370v1","updated":"2023-12-18T17:36:20Z","published":"2023-12-18T17:36:20Z","title":"G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model","summary":" Large language models (LLMs) have shown remarkable proficiency in human-level\nreasoning and generation capabilities, which encourages extensive research on\ntheir application in mathematical problem solving. However, current work has\nbeen largely focused on text-based mathematical problems, with limited\ninvestigation in problems involving geometric information. Addressing this gap,\nwe aim to enable LLMs to solve geometric problems by understanding image input.\nWe first analyze the limitations of current Multimodal Large Language Models\n(MLLMs) in this area: they struggle to accurately comprehending basic geometric\nelements and their relationships. To overcome these challenges, we take\nadvantage of the unique characteristics of geometric problems (such as unique\ngeometric logical form, and geometric scalability) and the capacity of the\ntextual LLMs to build an enriched multimodal geometry dataset based on existing\ndata. The augmented dataset, Geo170K, contains more than 170K geometric\nimage-caption and question-answer pairs. Utilizing our constructed Geo170K\ndataset, we develop G-LLaVA, which demonstrates exceptional performance in\nsolving geometric problems, significantly outperforming GPT-4-V on the\nMathVista benchmark with only 7B parameters.\n","authors":["Jiahui Gao","Renjie Pi","Jipeng Zhang","Jiacheng Ye","Wanjun Zhong","Yufei Wang","Lanqing Hong","Jianhua Han","Hang Xu","Zhenguo Li","Lingpeng Kong"],"pdf_url":"https://arxiv.org/pdf/2312.11370v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2312.11361v1","updated":"2023-12-18T17:18:04Z","published":"2023-12-18T17:18:04Z","title":"NoMIRACL: Knowing When You Don't Know for Robust Multilingual\n Retrieval-Augmented Generation","summary":" Retrieval-augmented generation (RAG) grounds large language model (LLM)\noutput by leveraging external knowledge sources to reduce factual\nhallucinations. However, prior works lack a comprehensive evaluation of\ndifferent language families, making it challenging to evaluate LLM robustness\nagainst errors in external retrieved knowledge. To overcome this, we establish\nNoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across\n18 typologically diverse languages. NoMIRACL includes both a non-relevant and a\nrelevant subset. Queries in the non-relevant subset contain passages manually\njudged as non-relevant or noisy, whereas queries in the relevant subset include\nat least a single judged relevant passage. We measure LLM robustness using two\nmetrics: (i) hallucination rate, measuring model tendency to hallucinate an\nanswer, when the answer is not present in passages in the non-relevant subset,\nand (ii) error rate, measuring model inaccuracy to recognize relevant passages\nin the relevant subset. We build a GPT-4 baseline which achieves a 33.2%\nhallucination rate on the non-relevant and a 14.9% error rate on the relevant\nsubset on average. Our evaluation reveals that GPT-4 hallucinates frequently in\nhigh-resource languages, such as French or English. This work highlights an\nimportant avenue for future research to improve LLM robustness to learn how to\nbetter reject non-relevant information in RAG.\n","authors":["Nandan Thakur","Luiz Bonifacio","Xinyu Zhang","Odunayo Ogundepo","Ehsan Kamalloo","David Alfonso-Hermelo","Xiaoguang Li","Qun Liu","Boxing Chen","Mehdi Rezagholizadeh","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2312.11361v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11356v1","updated":"2023-12-18T17:12:35Z","published":"2023-12-18T17:12:35Z","title":"The Problem of Coherence in Natural Language Explanations of\n Recommendations","summary":" Providing natural language explanations for recommendations is particularly\nuseful from the perspective of a non-expert user. Although several methods for\nproviding such explanations have recently been proposed, we argue that an\nimportant aspect of explanation quality has been overlooked in their\nexperimental evaluation. Specifically, the coherence between generated text and\npredicted rating, which is a necessary condition for an explanation to be\nuseful, is not properly captured by currently used evaluation measures. In this\npaper, we highlight the issue of explanation and prediction coherence by 1)\npresenting results from a manual verification of explanations generated by one\nof the state-of-the-art approaches 2) proposing a method of automatic coherence\nevaluation 3) introducing a new transformer-based method that aims to produce\nmore coherent explanations than the state-of-the-art approaches 4) performing\nan experimental evaluation which demonstrates that this method significantly\nimproves the explanation coherence without affecting the other aspects of\nrecommendation performance.\n","authors":["Jakub Raczyński","Mateusz Lango","Jerzy Stefanowski"],"pdf_url":"https://arxiv.org/pdf/2312.11356v1.pdf","comment":"ECAI 2023"},{"id":"http://arxiv.org/abs/2312.11345v1","updated":"2023-12-18T16:51:26Z","published":"2023-12-18T16:51:26Z","title":"Implicit Affordance Acquisition via Causal Action-Effect Modeling in the\n Video Domain","summary":" Affordance knowledge is a fundamental aspect of commonsense knowledge. Recent\nfindings indicate that world knowledge emerges through large-scale\nself-supervised pretraining, motivating our exploration of acquiring affordance\nknowledge from the visual domain. To this end, we augment an existing\ninstructional video resource to create the new Causal Action-Effect (CAE)\ndataset and design two novel pretraining tasks -- Masked Action Modeling (MAM)\nand Masked Effect Modeling (MEM) -- promoting the acquisition of two affordance\nproperties in models: behavior and entity equivalence, respectively. We\nempirically demonstrate the effectiveness of our proposed methods in learning\naffordance properties. Furthermore, we show that a model pretrained on both\ntasks outperforms a strong image-based visual-linguistic foundation model\n(FLAVA) as well as pure linguistic models on a zero-shot physical reasoning\nprobing task.\n","authors":["Hsiu-Yu Yang","Carina Silberer"],"pdf_url":"https://arxiv.org/pdf/2312.11345v1.pdf","comment":"Accepted at IJCNLP-AACL 2023"},{"id":"http://arxiv.org/abs/2312.11344v1","updated":"2023-12-18T16:50:27Z","published":"2023-12-18T16:50:27Z","title":"Muted: Multilingual Targeted Offensive Speech Identification and\n Visualization","summary":" Offensive language such as hate, abuse, and profanity (HAP) occurs in various\ncontent on the web. While previous work has mostly dealt with sentence level\nannotations, there have been a few recent attempts to identify offensive spans\nas well. We build upon this work and introduce Muted, a system to identify\nmultilingual HAP content by displaying offensive arguments and their targets\nusing heat maps to indicate their intensity. Muted can leverage any\ntransformer-based HAP-classification model and its attention mechanism\nout-of-the-box to identify toxic spans, without further fine-tuning. In\naddition, we use the spaCy library to identify the specific targets and\narguments for the words predicted by the attention heatmaps. We present the\nmodel's performance on identifying offensive spans and their targets in\nexisting datasets and present new annotations on German text. Finally, we\ndemonstrate our proposed visualization tool on multilingual inputs.\n","authors":["Christoph Tillmann","Aashka Trivedi","Sara Rosenthal","Santosh Borse","Rong Zhang","Avirup Sil","Bishwaranjan Bhattacharjee"],"pdf_url":"https://arxiv.org/pdf/2312.11344v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.16456v2","updated":"2023-12-18T16:27:03Z","published":"2023-07-31T07:31:48Z","title":"Camoscio: an Italian Instruction-tuned LLaMA","summary":" In recent years Large Language Models (LLMs) have increased the state of the\nart on several natural language processing tasks. However, their accessibility\nis often limited to paid API services, posing challenges for researchers in\nconducting extensive investigations. On the other hand, while some open-source\nmodels have been proposed by the community, they are typically English-centric\nor multilingual without a specific adaptation for the Italian language. In an\neffort to democratize the available and open resources for the Italian\nlanguage, in this paper we introduce Camoscio: a language model specifically\ntuned to follow users' prompts in Italian. Specifically, we finetuned the\nsmallest variant of LLaMA (7b) with LoRA on a corpus of instruction prompts\ntranslated to Italian via ChatGPT. Results indicate that the model's zero-shot\nperformance on various downstream tasks in Italian competes favorably with\nexisting models specifically finetuned for those tasks. All the artifacts\n(code, dataset, model) are released to the community at the following url:\nhttps://github.com/teelinsan/camoscio\n","authors":["Andrea Santilli","Emanuele Rodolà"],"pdf_url":"https://arxiv.org/pdf/2307.16456v2.pdf","comment":"Published at CLiC-it 2023"},{"id":"http://arxiv.org/abs/2308.09720v2","updated":"2023-12-18T16:17:36Z","published":"2023-08-09T09:15:07Z","title":"On the Unexpected Abilities of Large Language Models","summary":" Large Language Models (LLMs) are capable of displaying a wide range of\nabilities that are not directly connected with the task for which they are\ntrained: predicting the next words of human-written texts. In this article, I\nreview recent research investigating the cognitive abilities developed by LLMs\nand their relation to human cognition. I discuss the nature of the indirect\nprocess that leads to the acquisition of these cognitive abilities, their\nrelation to other indirect processes, and the implications for the acquisition\nof integrated abilities. Moreover, I propose the factors that enable the\ndevelopment of abilities that are related only very indirectly to the proximal\nobjective of the training task. Finally, I discuss whether the full set of\ncapabilities that LLMs could possibly develop is predictable.\n","authors":["Stefano Nolfi"],"pdf_url":"https://arxiv.org/pdf/2308.09720v2.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2308.15452v6","updated":"2023-12-18T16:15:33Z","published":"2023-08-29T17:22:39Z","title":"When Do Program-of-Thoughts Work for Reasoning?","summary":" In the realm of embodied artificial intelligence, the reasoning capabilities\nof Large Language Models (LLMs) play a pivotal role. Although there are\neffective methods like program-of-thought prompting for LLMs which uses\nprogramming language to tackle complex reasoning tasks, the specific impact of\ncode data on the improvement of reasoning capabilities remains under-explored.\nTo address this gap, we propose complexity-impacted reasoning score (CIRS),\nwhich combines structural and logical attributes, to measure the correlation\nbetween code and reasoning abilities. Specifically, we use the abstract syntax\ntree to encode the structural information and calculate logical complexity by\nconsidering the difficulty and the cyclomatic complexity. Through an empirical\nanalysis, we find not all code data of complexity can be learned or understood\nby LLMs. Optimal level of complexity is critical to the improvement of\nreasoning abilities by program-aided prompting. Then we design an\nauto-synthesizing and stratifying algorithm, and apply it to instruction\ngeneration for mathematical reasoning and code data filtering for code\ngeneration tasks. Extensive results demonstrates the effectiveness of our\nproposed approach. Code will be integrated into the EasyInstruct framework at\nhttps://github.com/zjunlp/EasyInstruct.\n","authors":["Zhen Bi","Ningyu Zhang","Yinuo Jiang","Shumin Deng","Guozhou Zheng","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2308.15452v6.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2301.10405v7","updated":"2023-12-18T16:09:49Z","published":"2023-01-25T04:45:06Z","title":"Editing Language Model-based Knowledge Graph Embeddings","summary":" Recently decades have witnessed the empirical success of framing Knowledge\nGraph (KG) embeddings via language models. However, language model-based KG\nembeddings are usually deployed as static artifacts, making them difficult to\nmodify post-deployment without re-training after deployment. To address this\nissue, we propose a new task of editing language model-based KG embeddings in\nthis paper. This task is designed to facilitate rapid, data-efficient updates\nto KG embeddings without compromising the performance of other aspects. We\nbuild four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and\nevaluate several knowledge editing baselines demonstrating the limited ability\nof previous models to handle the proposed challenging task. We further propose\na simple yet strong baseline dubbed KGEditor, which utilizes additional\nparametric layers of the hypernetwork to edit/add facts. Our comprehensive\nexperimental results reveal that KGEditor excels in updating specific facts\nwithout impacting the overall performance, even when faced with limited\ntraining resources. Code and datasets are available in\nhttps://github.com/zjunlp/PromptKG/tree/main/deltaKG.\n","authors":["Siyuan Cheng","Bozhong Tian","Xi Chen","Ningyu Zhang","Qingbing Liu","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2301.10405v7.pdf","comment":"AAAI 2024. The project website is\n https://zjunlp.github.io/project/KGE_Editing/"},{"id":"http://arxiv.org/abs/2312.11312v1","updated":"2023-12-18T16:06:18Z","published":"2023-12-18T16:06:18Z","title":"APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT\n Training Data Creation","summary":" Automatic Post-Editing (APE) is the task of automatically identifying and\ncorrecting errors in the Machine Translation (MT) outputs. We propose a\nrepair-filter-use methodology that uses an APE system to correct errors on the\ntarget side of the MT training data. We select the sentence pairs from the\noriginal and corrected sentence pairs based on the quality scores computed\nusing a Quality Estimation (QE) model. To the best of our knowledge, this is a\nnovel adaptation of APE and QE to extract quality parallel corpus from the\npseudo-parallel corpus. By training with this filtered corpus, we observe an\nimprovement in the Machine Translation system's performance by 5.64 and 9.91\nBLEU points, for English-Marathi and Marathi-English, over the baseline model.\nThe baseline model is the one that is trained on the whole pseudo-parallel\ncorpus. Our work is not limited by the characteristics of English or Marathi\nlanguages; and is language pair-agnostic, given the necessary QE and APE data.\n","authors":["Akshay Batheja","Sourabh Deoghare","Diptesh Kanojia","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2312.11312v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2306.03507"},{"id":"http://arxiv.org/abs/2308.08796v2","updated":"2023-12-18T15:51:43Z","published":"2023-08-17T06:04:28Z","title":"Chinese Spelling Correction as Rephrasing Language Model","summary":" This paper studies Chinese Spelling Correction (CSC), which aims to detect\nand correct the potential spelling errors in a given sentence. Current\nstate-of-the-art methods regard CSC as a sequence tagging task and fine-tune\nBERT-based models on sentence pairs. However, we note a critical flaw in the\nprocess of tagging one character to another, that the correction is excessively\nconditioned on the error. This is opposite from human mindset, where\nindividuals rephrase the complete sentence based on its semantics, rather than\nsolely on the error patterns memorized before. Such a counter-intuitive\nlearning process results in the bottleneck of generalizability and\ntransferability of machine spelling correction. To address this, we propose\nRephrasing Language Model (ReLM), where the model is trained to rephrase the\nentire sentence by infilling additional slots, instead of\ncharacter-to-character tagging. This novel training paradigm achieves the new\nstate-of-the-art results across fine-tuned and zero-shot CSC benchmarks,\noutperforming previous counterparts by a large margin. Our method also learns\ntransferable language representation when CSC is jointly trained with other\ntasks.\n","authors":["Linfeng Liu","Hongqiu Wu","Hai Zhao"],"pdf_url":"https://arxiv.org/pdf/2308.08796v2.pdf","comment":"Accepted by AAAI'2024"},{"id":"http://arxiv.org/abs/2312.11296v1","updated":"2023-12-18T15:45:39Z","published":"2023-12-18T15:45:39Z","title":"From Generalized Laughter to Personalized Chuckles: Unleashing the Power\n of Data Fusion in Subjective Humor Detection","summary":" The vast area of subjectivity in Natural Language Processing (NLP) poses a\nchallenge to the solutions typically used in generalized tasks. As exploration\nin the scope of generalized NLP is much more advanced, it implies the\ntremendous gap that is still to be addressed amongst all feasible tasks where\nan opinion, taste, or feelings are inherent, thus creating a need for a\nsolution, where a data fusion could take place. We have chosen the task of\nfunniness, as it heavily relies on the sense of humor, which is fundamentally\nsubjective. Our experiments across five personalized and four generalized\ndatasets involving several personalized deep neural architectures have shown\nthat the task of humor detection greatly benefits from the inclusion of\npersonalized data in the training process. We tested five scenarios of training\ndata fusion that focused on either generalized (majority voting) or\npersonalized approaches to humor detection. The best results were obtained for\nthe setup, in which all available personalized datasets were joined to train\nthe personalized reasoning model. It boosted the prediction performance by up\nto approximately 35% of the macro F1 score. Such a significant gain was\nobserved for all five personalized test sets. At the same time, the impact of\nthe model's architecture was much less than the personalization itself. It\nseems that concatenating personalized datasets, even with the cost of\nnormalizing the range of annotations across all datasets, if combined with the\npersonalized models, results in an enormous increase in the performance of\nhumor detection.\n","authors":["Julita Bielaniewicz","Przemysław Kazienko"],"pdf_url":"https://arxiv.org/pdf/2312.11296v1.pdf","comment":"10 pages, 13 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.11282v1","updated":"2023-12-18T15:23:06Z","published":"2023-12-18T15:23:06Z","title":"LLM-ARK: Knowledge Graph Reasoning Using Large Language Models via Deep\n Reinforcement Learning","summary":" With the evolution of pre-training methods, large language models (LLMs) have\nexhibited exemplary reasoning capabilities via prompt engineering. However, the\nabsence of Knowledge Graph (KG) environment awareness and the challenge of\nengineering viable optimization mechanisms for intermediary reasoning\nprocesses, constrict the performance of LLMs on KG reasoning tasks compared to\nsmaller models. We introduce LLM-ARK, a LLM grounded KG reasoning agent\ndesigned to deliver precise and adaptable predictions on KG paths. LLM-ARK\nutilizes Full Textual Environment (FTE) prompts to assimilate state information\nfor each step-sized intelligence. Leveraging LLMs to richly encode and\nrepresent various types of inputs and integrate the knowledge graph further\nwith path environment data, before making the final decision. Reframing the\nKnowledge Graph (KG) multi-hop inference problem as a sequential\ndecision-making issue, we optimize our model using the Proximal Policy\nOptimization (PPO) online policy gradient reinforcement learning algorithm\nwhich allows the model to learn from a vast array of reward signals across\ndiverse tasks and environments. We evaluate state-of-the-art LLM(GPT-4) and our\nmethod which using open-source models of varying sizes on OpenDialKG dataset.\nOur experiment shows that LLaMA7B-ARK provides excellent results with a\nperformance rate of 48.75% for the target@1 evaluation metric, far exceeding\nthe current state-of-the-art model by 17.64 percentage points. Meanwhile, GPT-4\naccomplished a score of only 14.91%, further highlighting the efficacy and\ncomplexity of our methodology. Our code is available on GitHub for further\naccess.\n","authors":["Yuxuan Huang"],"pdf_url":"https://arxiv.org/pdf/2312.11282v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11272v1","updated":"2023-12-18T15:16:54Z","published":"2023-12-18T15:16:54Z","title":"Disentangling continuous and discrete linguistic signals in\n transformer-based sentence embeddings","summary":" Sentence and word embeddings encode structural and semantic information in a\ndistributed manner. Part of the information encoded -- particularly lexical\ninformation -- can be seen as continuous, whereas other -- like structural\ninformation -- is most often discrete. We explore whether we can compress\ntransformer-based sentence embeddings into a representation that separates\ndifferent linguistic signals -- in particular, information relevant to\nsubject-verb agreement and verb alternations. We show that by compressing an\ninput sequence that shares a targeted phenomenon into the latent layer of a\nvariational autoencoder-like system, the targeted linguistic information\nbecomes more explicit. A latent layer with both discrete and continuous\ncomponents captures better the targeted phenomena than a latent layer with only\ndiscrete or only continuous components. These experiments are a step towards\nseparating linguistic signals from distributed text embeddings and linking them\nto more symbolic representations.\n","authors":["Vivi Nastase","Paola Merlo"],"pdf_url":"https://arxiv.org/pdf/2312.11272v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11242v1","updated":"2023-12-18T14:40:20Z","published":"2023-12-18T14:40:20Z","title":"MAC-SQL: Multi-Agent Collaboration for Text-to-SQL","summary":" Recent advancements in Text-to-SQL methods employing Large Language Models\n(LLMs) have demonstrated remarkable performance. Nonetheless, these approaches\ncontinue to encounter difficulties when handling extensive databases, intricate\nuser queries, and erroneous SQL results. To tackle these challenges, we present\n\\textbf{MAC-SQL}, a LLM-based multi-agent collaborative Text- to-SQL framework\nbased on LLMs. This framework comprises three agents: the \\textit{Selector},\naccountable for condensing voluminous databases and preserving relevant table\nschemas for user questions; the \\textit{Decomposer}, which disassembles complex\nuser questions into more straightforward sub-problems and resolves them\nprogressively; and the \\textit{Refiner}, tasked with validating and refining\ndefective SQL queries. We perform thorough experiments on two Text-to-SQL\ndatasets, BIRD and Spider, attaining a state-of-the-art execution accuracy of\n59.59\\% on the BIRD test set. Moreover, we have open-sourced an instruction\nfine-tuning model, \\textbf{SQL-Llama}, based on Code Llama 7B, in addition to\nan agent instruction dataset derived from training data based on BIRD and\nSpider. The SQL-Llama model has demonstrated encouraging outcomes on the\ndevelopment sets of both BIRD and Spider. However, when compared to the GPT-4\nmodel, there remains a notable potential for enhancement. Our code and data can\nbe accessed publicly at\n\\href{https://github.com/wbbeyourself/MAC-SQL}{https://github.com/wbbeyourself/MAC-SQL}.\n","authors":["Bing Wang","Changyu Ren","Jian Yang","Xinnian Liang","Jiaqi Bai","Qian-Wen Zhang","Zhao Yan","Zhoujun Li"],"pdf_url":"https://arxiv.org/pdf/2312.11242v1.pdf","comment":"Working in progress"},{"id":"http://arxiv.org/abs/2310.12439v2","updated":"2023-12-18T13:20:46Z","published":"2023-10-19T03:25:28Z","title":"PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models","summary":" Prompts have significantly improved the performance of pretrained Large\nLanguage Models (LLMs) on various downstream tasks recently, making them\nincreasingly indispensable for a diverse range of LLM application scenarios.\nHowever, the backdoor vulnerability, a serious security threat that can\nmaliciously alter the victim model's normal predictions, has not been\nsufficiently explored for prompt-based LLMs. In this paper, we present\nPOISONPROMPT, a novel backdoor attack capable of successfully compromising both\nhard and soft prompt-based LLMs. We evaluate the effectiveness, fidelity, and\nrobustness of POISONPROMPT through extensive experiments on three popular\nprompt methods, using six datasets and three widely used LLMs. Our findings\nhighlight the potential security threats posed by backdoor attacks on\nprompt-based LLMs and emphasize the need for further research in this area.\n","authors":["Hongwei Yao","Jian Lou","Zhan Qin"],"pdf_url":"https://arxiv.org/pdf/2310.12439v2.pdf","comment":"To Appear in IEEE ICASSP 2024, code is available at:\n https://github.com/grasses/PoisonPrompt"},{"id":"http://arxiv.org/abs/2309.15512v2","updated":"2023-12-18T12:52:08Z","published":"2023-09-27T09:27:03Z","title":"High-Fidelity Speech Synthesis with Minimal Supervision: All Using\n Diffusion Models","summary":" Text-to-speech (TTS) methods have shown promising results in voice cloning,\nbut they require a large number of labeled text-speech pairs.\nMinimally-supervised speech synthesis decouples TTS by combining two types of\ndiscrete speech representations(semantic \\& acoustic) and using two\nsequence-to-sequence tasks to enable training with minimal supervision.\nHowever, existing methods suffer from information redundancy and dimension\nexplosion in semantic representation, and high-frequency waveform distortion in\ndiscrete acoustic representation. Autoregressive frameworks exhibit typical\ninstability and uncontrollability issues. And non-autoregressive frameworks\nsuffer from prosodic averaging caused by duration prediction models. To address\nthese issues, we propose a minimally-supervised high-fidelity speech synthesis\nmethod, where all modules are constructed based on the diffusion models. The\nnon-autoregressive framework enhances controllability, and the duration\ndiffusion model enables diversified prosodic expression. Contrastive\nToken-Acoustic Pretraining (CTAP) is used as an intermediate semantic\nrepresentation to solve the problems of information redundancy and dimension\nexplosion in existing semantic coding methods. Mel-spectrogram is used as the\nacoustic representation. Both semantic and acoustic representations are\npredicted by continuous variable regression tasks to solve the problem of\nhigh-frequency fine-grained waveform distortion. Experimental results show that\nour proposed method outperforms the baseline method. We provide audio samples\non our website.\n","authors":["Chunyu Qiang","Hao Li","Yixin Tian","Yi Zhao","Ying Zhang","Longbiao Wang","Jianwu Dang"],"pdf_url":"https://arxiv.org/pdf/2309.15512v2.pdf","comment":"Accepted by ICASSP 2024. arXiv admin note: substantial text overlap\n with arXiv:2307.15484; text overlap with arXiv:2309.00424"},{"id":"http://arxiv.org/abs/2309.00424v5","updated":"2023-12-18T12:49:49Z","published":"2023-09-01T12:35:43Z","title":"Learning Speech Representation From Contrastive Token-Acoustic\n Pretraining","summary":" For fine-grained generation and recognition tasks such as\nminimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic\nspeech recognition (ASR), the intermediate representations extracted from\nspeech should serve as a \"bridge\" between text and acoustic information,\ncontaining information from both modalities. The semantic content is\nemphasized, while the paralinguistic information such as speaker identity and\nacoustic details should be de-emphasized. However, existing methods for\nextracting fine-grained intermediate representations from speech suffer from\nissues of excessive redundancy and dimension explosion. Contrastive learning is\na good method for modeling intermediate representations from two modalities.\nHowever, existing contrastive learning methods in the audio field focus on\nextracting global descriptive information for downstream audio classification\ntasks, making them unsuitable for TTS, VC, and ASR tasks. To address these\nissues, we propose a method named \"Contrastive Token-Acoustic Pretraining\n(CTAP)\", which uses two encoders to bring phoneme and speech into a joint\nmultimodal space, learning how to connect phoneme and speech at the frame\nlevel. The CTAP model is trained on 210k speech and phoneme pairs, achieving\nminimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a\npromising solution for fine-grained generation and recognition downstream tasks\nin speech processing. We provide a website with audio samples.\n","authors":["Chunyu Qiang","Hao Li","Yixin Tian","Ruibo Fu","Tao Wang","Longbiao Wang","Jianwu Dang"],"pdf_url":"https://arxiv.org/pdf/2309.00424v5.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2307.15484v3","updated":"2023-12-18T12:48:01Z","published":"2023-07-28T11:20:23Z","title":"Minimally-Supervised Speech Synthesis with Conditional Diffusion Model\n and Language Model: A Comparative Study of Semantic Coding","summary":" Recently, there has been a growing interest in text-to-speech (TTS) methods\nthat can be trained with minimal supervision by combining two types of discrete\nspeech representations and using two sequence-to-sequence tasks to decouple\nTTS. However, existing methods suffer from three problems: the high\ndimensionality and waveform distortion of discrete speech representations, the\nprosodic averaging problem caused by the duration prediction model in\nnon-autoregressive frameworks, and the information redundancy and dimension\nexplosion problems of existing semantic encoding methods. To address these\nproblems, three progressive methods are proposed. First, we propose\nDiff-LM-Speech, an autoregressive structure consisting of a language model and\ndiffusion models, which models the semantic embedding into the mel-spectrogram\nbased on a diffusion model to achieve higher audio quality. We also introduce a\nprompt encoder structure based on a variational autoencoder and a prosody\nbottleneck to improve prompt representation ability. Second, we propose\nTetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion\nmodel-based modules that design a duration diffusion model to achieve diverse\nprosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive\nstructure consisting of three diffusion model-based modules that verify the\nnon-necessity of existing semantic encoding models and achieve the best\nresults. Experimental results show that our proposed methods outperform\nbaseline methods. We provide a website with audio samples.\n","authors":["Chunyu Qiang","Hao Li","Hao Ni","He Qu","Ruibo Fu","Tao Wang","Longbiao Wang","Jianwu Dang"],"pdf_url":"https://arxiv.org/pdf/2307.15484v3.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.11152v1","updated":"2023-12-18T12:46:09Z","published":"2023-12-18T12:46:09Z","title":"Prompt Based Tri-Channel Graph Convolution Neural Network for Aspect\n Sentiment Triplet Extraction","summary":" Aspect Sentiment Triplet Extraction (ASTE) is an emerging task to extract a\ngiven sentence's triplets, which consist of aspects, opinions, and sentiments.\nRecent studies tend to address this task with a table-filling paradigm, wherein\nword relations are encoded in a two-dimensional table, and the process involves\nclarifying all the individual cells to extract triples. However, these studies\nignore the deep interaction between neighbor cells, which we find quite helpful\nfor accurate extraction. To this end, we propose a novel model for the ASTE\ntask, called Prompt-based Tri-Channel Graph Convolution Neural Network\n(PT-GCN), which converts the relation table into a graph to explore more\ncomprehensive relational information. Specifically, we treat the original table\ncells as nodes and utilize a prompt attention score computation module to\ndetermine the edges' weights. This enables us to construct a target-aware\ngrid-like graph to enhance the overall extraction process. After that, a\ntriple-channel convolution module is conducted to extract precise sentiment\nknowledge. Extensive experiments on the benchmark datasets show that our model\nachieves state-of-the-art performance. The code is available at\nhttps://github.com/KunPunCN/PT-GCN.\n","authors":["Kun Peng","Lei Jiang","Hao Peng","Rui Liu","Zhengtao Yu","Jiaqian Ren","Zhifeng Hao","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2312.11152v1.pdf","comment":"Accepted in SIAM International Conference on Data Mining (SDM24)"},{"id":"http://arxiv.org/abs/2312.11142v1","updated":"2023-12-18T12:32:42Z","published":"2023-12-18T12:32:42Z","title":"Efficiency-oriented approaches for self-supervised speech representation\n learning","summary":" Self-supervised learning enables the training of large neural models without\nthe need for large, labeled datasets. It has been generating breakthroughs in\nseveral fields, including computer vision, natural language processing,\nbiology, and speech. In particular, the state-of-the-art in several speech\nprocessing applications, such as automatic speech recognition or speaker\nidentification, are models where the latent representation is learned using\nself-supervised approaches. Several configurations exist in self-supervised\nlearning for speech, including contrastive, predictive, and multilingual\napproaches. There is, however, a crucial limitation in most existing\napproaches: their high computational costs. These costs limit the deployment of\nmodels, the size of the training dataset, and the number of research groups\nthat can afford research with large self-supervised models. Likewise, we should\nconsider the environmental costs that high energy consumption implies. Efforts\nin this direction comprise optimization of existing models, neural architecture\nefficiency, improvements in finetuning for speech processing tasks, and data\nefficiency. But despite current efforts, more work could be done to address\nhigh computational costs in self-supervised representation learning.\n","authors":["Luis Lugo","Valentin Vielzeuf"],"pdf_url":"https://arxiv.org/pdf/2312.11142v1.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2305.19972v2","updated":"2023-12-18T12:29:00Z","published":"2023-05-31T16:01:20Z","title":"VILAS: Exploring the Effects of Vision and Language Context in Automatic\n Speech Recognition","summary":" Enhancing automatic speech recognition (ASR) performance by leveraging\nadditional multimodal information has shown promising results in previous\nstudies. However, most of these works have primarily focused on utilizing\nvisual cues derived from human lip motions. In fact, context-dependent visual\nand linguistic cues can also benefit in many scenarios. In this paper, we first\npropose ViLaS (Vision and Language into Automatic Speech Recognition), a novel\nmultimodal ASR model based on the continuous integrate-and-fire (CIF)\nmechanism, which can integrate visual and textual context simultaneously or\nseparately, to facilitate speech recognition. Next, we introduce an effective\ntraining strategy that improves performance in modal-incomplete test scenarios.\nThen, to explore the effects of integrating vision and language, we create\nVSDial, a multimodal ASR dataset with multimodal context cues in both Chinese\nand English versions. Finally, empirical results are reported on the public\nFlickr8K and self-constructed VSDial datasets. We explore various cross-modal\nfusion schemes, analyze fine-grained crossmodal alignment on VSDial, and\nprovide insights into the effects of integrating multimodal information on\nspeech recognition.\n","authors":["Ziyi Ni","Minglun Han","Feilong Chen","Linghui Meng","Jing Shi","Pin Lv","Bo Xu"],"pdf_url":"https://arxiv.org/pdf/2305.19972v2.pdf","comment":"Accepted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.11135v1","updated":"2023-12-18T12:26:27Z","published":"2023-12-18T12:26:27Z","title":"Linear Attention via Orthogonal Memory","summary":" Efficient attentions have greatly improved the computational efficiency of\nTransformers. However, most existing linear attention mechanisms suffer from an\n\\emph{efficiency degradation} problem, leading to inefficiencies in causal\nlanguage modeling and hindering their application in long-range language\nmodels. This problem is more pronounced under language modeling with unbounded\ncontexts. In this paper, we propose \\textbf{L}inear \\textbf{A}ttention\n\\textbf{V}ia \\textbf{O}rthogonal memory~(\\shortname) to address these\nlimitations, achieving strong performance while maintaining linear complexity.\n\\shortname employs orthogonal decomposition to compress a context into a\nfixed-size orthogonal memory while effectively minimizing redundancy within the\ncontext. Given that orthogonal memory compresses global information, we further\ndissect the context to amplify fine-grained local information. Additionally, we\nembed the relative position encoding into \\shortname to improve the\nextrapolation ability. Experimental results show that \\shortname greatly\nimproves the efficiency of the causal language model with the best\nextrapolation performance and outperforms other efficient baselines. Further,\nwe endeavor to employ \\shortname for unbounded language modeling and\nsuccessfully scale the context length to 128K.\n","authors":["Jun Zhang","Shuyang Jiang","Jiangtao Feng","Lin Zheng","Lingpeng Kong"],"pdf_url":"https://arxiv.org/pdf/2312.11135v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2309.05173v3","updated":"2023-12-18T12:17:54Z","published":"2023-09-11T00:02:05Z","title":"DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning","summary":" Prompt tuning (PT), where a small amount of trainable soft (continuous)\nprompt vectors is affixed to the input of language models (LM), has shown\npromising results across various tasks and models for parameter-efficient\nfine-tuning (PEFT). PT stands out from other PEFT approaches because it\nmaintains competitive performance with fewer trainable parameters and does not\ndrastically scale up its parameters as the model size expands. However, PT\nintroduces additional soft prompt tokens, leading to longer input sequences,\nwhich significantly impacts training and inference time and memory usage due to\nthe Transformer's quadratic complexity. Particularly concerning for Large\nLanguage Models (LLMs) that face heavy daily querying. To address this issue,\nwe propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt\ninto a shorter soft prompt and a pair of low-rank matrices that are then\noptimised with two different learning rates. This allows DePT to achieve better\nperformance while saving over 20% memory and time costs compared to vanilla PT\nand its variants, without changing trainable parameter sizes. Through extensive\nexperiments on 23 natural language processing (NLP) and vision-language (VL)\ntasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches,\nincluding the full fine-tuning baseline in some scenarios. Additionally, we\nempirically show that DEPT grows more efficient as the model size increases.\nOur further study reveals that DePT integrates seamlessly with\nparameter-efficient transfer learning in the few-shot learning setting and\nhighlights its adaptability to various model architectures and sizes.\n","authors":["Zhengxiang Shi","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2309.05173v3.pdf","comment":"Code is available at https://github.com/ZhengxiangShi/DePT"},{"id":"http://arxiv.org/abs/2311.04498v4","updated":"2023-12-18T12:15:26Z","published":"2023-11-08T07:15:05Z","title":"NExT-Chat: An LMM for Chat, Detection and Segmentation","summary":" The development of large language models (LLMs) has greatly advanced the\nfield of multimodal understanding, leading to the emergence of large multimodal\nmodels (LMMs). In order to enhance the level of visual comprehension, recent\nstudies have equipped LMMs with region-level understanding capabilities by\nrepresenting object bounding box coordinates as a series of text sequences\n(pix2seq). In this paper, we introduce a novel paradigm for object location\nmodeling called pix2emb method, where we ask the LMM to output the location\nembeddings and then decode them with different decoders. This paradigm allows\nus to use different location formats (such as bounding boxes and masks) in\nmultimodal conversations. Leveraging the proposed pix2emb method, we train an\nLMM named NExT-Chat and demonstrate its capability of handling multiple tasks\nlike visual grounding, region captioning, and grounded reasoning. Comprehensive\nexperiments show the effectiveness of our NExT-Chat on various tasks, e.g.,\nNExT-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NExT-Chat (68.9) vs. LISA\n(67.9) on referring expression segmentation task, and NExT-Chat (79.6) vs.\nKosmos-2 (62.3) on region caption task. The code and model are released at\nhttps://github.com/NExT-ChatV/NExT-Chat.\n","authors":["Ao Zhang","Yuan Yao","Wei Ji","Zhiyuan Liu","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2311.04498v4.pdf","comment":"Technical Report (https://next-chatv.github.io/)"},{"id":"http://arxiv.org/abs/2310.17589v2","updated":"2023-12-18T11:34:31Z","published":"2023-10-26T17:11:42Z","title":"An Open Source Data Contamination Report for Large Language Models","summary":" Data contamination in language model evaluation is increasingly prevalent as\nthe popularity of large language models. It allows models to \"cheat\" via\nmemorisation instead of displaying true capabilities. Therefore, contamination\nanalysis has became an crucial part of reliable model evaluation to validate\nresults. However, existing contamination analysis is usually conducted\ninternally by LLM developers and often lacks transparency and completeness.\nThis paper present an open source data contamination reports for the Llama\nseries models. We analyse six popular multi-choice QA benchmarks and quantify\ntheir overlapping with the training set of Llama. Various levels of\ncontamination ranging from 1\\% to 8.7\\% are found across benchmarks. Our\ncomparison also reveals that Llama models can gain over 5\\% higher accuracy on\ncontaminated subsets versus clean subsets. Data and code are available at:\nhttps://github.com/liyucheng09/Contamination_Detector.\n","authors":["Yucheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.17589v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11069v1","updated":"2023-12-18T10:06:50Z","published":"2023-12-18T10:06:50Z","title":"Patterns of Closeness and Abstractness in Colexifications: The Case of\n Indigenous Languages in the Americas","summary":" Colexification refers to linguistic phenomena where multiple concepts\n(meanings) are expressed by the same lexical form, such as polysemy or\nhomophony. Colexifications have been found to be pervasive across languages and\ncultures. The problem of concreteness/abstractness of concepts is\ninterdisciplinary, studied from a cognitive standpoint in linguistics,\npsychology, psycholinguistics, neurophysiology, etc. In this paper, we\nhypothesize that concepts that are closer in concreteness/abstractness are more\nlikey to colexify, and we test the hypothesis across indigenous languages in\nAmericas.\n","authors":["Yiyi Chen","Johannes Bjerva"],"pdf_url":"https://arxiv.org/pdf/2312.11069v1.pdf","comment":"3 pages, 2 figures, 1 table, AmericasNLP 2023"},{"id":"http://arxiv.org/abs/2312.11062v1","updated":"2023-12-18T09:58:19Z","published":"2023-12-18T09:58:19Z","title":"Entity or Relation Embeddings? An Analysis of Encoding Strategies for\n Relation Extraction","summary":" Relation extraction is essentially a text classification problem, which can\nbe tackled by fine-tuning a pre-trained language model (LM). However, a key\nchallenge arises from the fact that relation extraction cannot\nstraightforwardly be reduced to sequence or token classification. Existing\napproaches therefore solve the problem in an indirect way: they fine-tune an LM\nto learn embeddings of the head and tail entities, and then predict the\nrelationship from these entity embeddings. Our hypothesis in this paper is that\nrelation extraction models can be improved by capturing relationships in a more\ndirect way. In particular, we experiment with appending a prompt with a [MASK]\ntoken, whose contextualised representation is treated as a relation embedding.\nWhile, on its own, this strategy significantly underperforms the aforementioned\napproach, we find that the resulting relation embeddings are highly\ncomplementary to what is captured by embeddings of the head and tail entity. By\njointly considering both types of representations, we end up with a simple\nmodel that outperforms the state-of-the-art across several relation extraction\nbenchmarks.\n","authors":["Frank Mtumbuka","Steven Schockaert"],"pdf_url":"https://arxiv.org/pdf/2312.11062v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11043v1","updated":"2023-12-18T09:18:43Z","published":"2023-12-18T09:18:43Z","title":"TDeLTA: A Light-weight and Robust Table Detection Method based on\n Learning Text Arrangement","summary":" The diversity of tables makes table detection a great challenge, leading to\nexisting models becoming more tedious and complex. Despite achieving high\nperformance, they often overfit to the table style in training set, and suffer\nfrom significant performance degradation when encountering out-of-distribution\ntables in other domains. To tackle this problem, we start from the essence of\nthe table, which is a set of text arranged in rows and columns. Based on this,\nwe propose a novel, light-weighted and robust Table Detection method based on\nLearning Text Arrangement, namely TDeLTA. TDeLTA takes the text blocks as\ninput, and then models the arrangement of them with a sequential encoder and an\nattention module. To locate the tables precisely, we design a\ntext-classification task, classifying the text blocks into 4 categories\naccording to their semantic roles in the tables. Experiments are conducted on\nboth the text blocks parsed from PDF and extracted by open-source OCR tools,\nrespectively. Compared to several state-of-the-art methods, TDeLTA achieves\ncompetitive results with only 3.1M model parameters on the large-scale public\ndatasets. Moreover, when faced with the cross-domain data under the 0-shot\nsetting, TDeLTA outperforms baselines by a large margin of nearly 7%, which\nshows the strong robustness and transferability of the proposed model.\n","authors":["Yang Fan","Xiangping Wu","Qingcai Chen","Heng Li","Yan Huang","Zhixiang Cai","Qitian Wu"],"pdf_url":"https://arxiv.org/pdf/2312.11043v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11036v1","updated":"2023-12-18T09:13:41Z","published":"2023-12-18T09:13:41Z","title":"UniGen: A Unified Generative Framework for Retrieval and Question\n Answering with Large Language Models","summary":" Generative information retrieval, encompassing two major tasks of Generative\nDocument Retrieval (GDR) and Grounded Answer Generation (GAR), has gained\nsignificant attention in the area of information retrieval and natural language\nprocessing. Existing methods for GDR and GAR rely on separate retrieval and\nreader modules, which hinder simultaneous optimization. To overcome this, we\npresent \\textbf{UniGen}, a \\textbf{Uni}fied \\textbf{Gen}erative framework for\nretrieval and question answering that integrates both tasks into a single\ngenerative model leveraging the capabilities of large language models. UniGen\nemploys a shared encoder and two distinct decoders for generative retrieval and\nquestion answering. To facilitate the learning of both tasks, we introduce\nconnectors, generated by large language models, to bridge the gaps between\nquery inputs and generation targets, as well as between document identifiers\nand answers. Furthermore, we propose an iterative enhancement strategy that\nleverages generated answers and retrieved documents to iteratively improve both\ntasks. Through extensive experiments on the MS MARCO and NQ datasets, we\ndemonstrate the effectiveness of UniGen, showcasing its superior performance in\nboth the retrieval and the question answering tasks.\n","authors":["Xiaoxi Li","Yujia Zhou","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2312.11036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11374v3","updated":"2023-12-18T08:58:51Z","published":"2023-10-17T16:15:34Z","title":"DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models\n for Emotion Recognition in Conversations","summary":" Large language models (LLMs) and their variants have shown extraordinary\nefficacy across numerous downstream natural language processing (NLP) tasks,\nwhich has presented a new vision for the development of NLP. Despite their\nremarkable performance in natural language generating (NLG), LLMs lack a\ndistinct focus on the emotion understanding domain. As a result, using LLMs for\nemotion recognition may lead to suboptimal and inadequate precision. Another\nlimitation of LLMs is that they are typical trained without leveraging\nmulti-modal information. To overcome these limitations, we propose DialogueLLM,\na context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA\nmodels with 13,638 multi-modal (i.e., texts and videos) emotional dialogues.\nThe visual information is considered as the supplementary knowledge to\nconstruct high-quality instructions. We offer a comprehensive evaluation of our\nproposed model on three benchmarking emotion recognition in conversations (ERC)\ndatasets and compare the results against the SOTA baselines and other SOTA\nLLMs. Additionally, DialogueLLM-7B can be easily trained using LoRA on a 40GB\nA100 GPU in 5 hours, facilitating reproducibility for other researchers.\n","authors":["Yazhou Zhang","Mengyao Wang","Youxi Wu","Prayag Tiwari","Qiuchi Li","Benyou Wang","Jing Qin"],"pdf_url":"https://arxiv.org/pdf/2310.11374v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11020v1","updated":"2023-12-18T08:45:39Z","published":"2023-12-18T08:45:39Z","title":"Information Type Classification with Contrastive Task-Specialized\n Sentence Encoders","summary":" User-generated information content has become an important information source\nin crisis situations. However, classification models suffer from noise and\nevent-related biases which still poses a challenging task and requires\nsophisticated task-adaptation. To address these challenges, we propose the use\nof contrastive task-specialized sentence encoders for downstream\nclassification. We apply the task-specialization on the CrisisLex, HumAID, and\nTrecIS information type classification tasks and show performance gains w.r.t.\nF1-score. Furthermore, we analyse the cross-corpus and cross-lingual\ncapabilities for two German event relevancy classification datasets.\n","authors":["Philipp Seeberger","Tobias Bocklet","Korbinian Riedhammer"],"pdf_url":"https://arxiv.org/pdf/2312.11020v1.pdf","comment":"Accepted at KONVENS 2023"},{"id":"http://arxiv.org/abs/2312.11011v1","updated":"2023-12-18T08:27:33Z","published":"2023-12-18T08:27:33Z","title":"VinaLLaMA: LLaMA-based Vietnamese Foundation Model","summary":" In this technical report, we present VinaLLaMA, an open-weight,\nstate-of-the-art (SOTA) Large Language Model for the Vietnamese language, built\nupon LLaMA-2 with an additional 800 billion trained tokens. VinaLLaMA not only\ndemonstrates fluency in Vietnamese but also exhibits a profound understanding\nof Vietnamese culture, making it a truly indigenous model. VinaLLaMA-7B-chat,\ntrained on 1 million high-quality synthetic samples, achieves SOTA results on\nkey benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese, marking\na significant advancement in the Vietnamese AI landscape and offering a\nversatile resource for various applications.\n","authors":["Quan Nguyen","Huy Pham","Dung Dao"],"pdf_url":"https://arxiv.org/pdf/2312.11011v1.pdf","comment":"VinaLLaMA Technical Report - 13 pages"},{"id":"http://arxiv.org/abs/2308.06077v3","updated":"2023-12-18T08:26:48Z","published":"2023-08-11T11:29:51Z","title":"Fly-Swat or Cannon? Cost-Effective Language Model Choice via\n Meta-Modeling","summary":" Generative language models (LMs) have become omnipresent across data science.\nFor a wide variety of tasks, inputs can be phrased as natural language prompts\nfor an LM, from whose output the solution can then be extracted. LM performance\nhas consistently been increasing with model size - but so has the monetary cost\nof querying the ever larger models. Importantly, however, not all inputs are\nequally hard: some require larger LMs for obtaining a satisfactory solution,\nwhereas for others smaller LMs suffice. Based on this fact, we design a\nframework for cost-effective language model choice, called \"Fly-swat or cannon\"\n(FORC). Given a set of inputs and a set of candidate LMs, FORC judiciously\nassigns each input to an LM predicted to do well on the input according to a\nso-called meta-model, aiming to achieve high overall performance at low cost.\nThe cost-performance tradeoff can be flexibly tuned by the user. Options\ninclude, among others, maximizing total expected performance (or the number of\nprocessed inputs) while staying within a given cost budget, or minimizing total\ncost while processing all inputs. We evaluate FORC on 14 datasets covering five\nnatural language tasks, using four candidate LMs of vastly different size and\ncost. With FORC, we match the performance of the largest available LM while\nachieving a cost reduction of 63%. Via our publicly available library,\nresearchers as well as practitioners can thus save large amounts of money\nwithout sacrificing performance.\n","authors":["Marija Šakota","Maxime Peyrard","Robert West"],"pdf_url":"https://arxiv.org/pdf/2308.06077v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10997v1","updated":"2023-12-18T07:47:33Z","published":"2023-12-18T07:47:33Z","title":"Retrieval-Augmented Generation for Large Language Models: A Survey","summary":" Large language models (LLMs) demonstrate powerful capabilities, but they\nstill face challenges in practical applications, such as hallucinations, slow\nknowledge updates, and lack of transparency in answers. Retrieval-Augmented\nGeneration (RAG) refers to the retrieval of relevant information from external\nknowledge bases before answering questions with LLMs. RAG has been demonstrated\nto significantly enhance answer accuracy, reduce model hallucination,\nparticularly for knowledge-intensive tasks. By citing sources, users can verify\nthe accuracy of answers and increase trust in model outputs. It also\nfacilitates knowledge updates and the introduction of domain-specific\nknowledge. RAG effectively combines the parameterized knowledge of LLMs with\nnon-parameterized external knowledge bases, making it one of the most important\nmethods for implementing large language models. This paper outlines the\ndevelopment paradigms of RAG in the era of LLMs, summarizing three paradigms:\nNaive RAG, Advanced RAG, and Modular RAG. It then provides a summary and\norganization of the three main components of RAG: retriever, generator, and\naugmentation methods, along with key technologies in each component.\nFurthermore, it discusses how to evaluate the effectiveness of RAG models,\nintroducing two evaluation methods for RAG, emphasizing key metrics and\nabilities for evaluation, and presenting the latest automatic evaluation\nframework. Finally, potential future research directions are introduced from\nthree aspects: vertical optimization, horizontal scalability, and the technical\nstack and ecosystem of RAG.\n","authors":["Yunfan Gao","Yun Xiong","Xinyu Gao","Kangxiang Jia","Jinliu Pan","Yuxi Bi","Yi Dai","Jiawei Sun","Haofen Wang"],"pdf_url":"https://arxiv.org/pdf/2312.10997v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.01090v2","updated":"2023-12-18T07:30:48Z","published":"2023-12-02T09:45:45Z","title":"Self Generated Wargame AI: Double Layer Agent Task Planning Based on\n Large Language Model","summary":" The large language models represented by ChatGPT have a disruptive impact on\nthe field of artificial intelligence. But it mainly focuses on natural language\nprocessing, speech recognition, machine learning and natural language\nunderstanding. This paper innovatively applies the large language model to the\nfield of intelligent decision-making, places the large language model in the\ndecision-making center, and constructs an agent architecture with the large\nlanguage model as the core. Based on this, it further proposes a two-layer\nagent task planning, issues and executes decision commands through the\ninteraction of natural language, and carries out simulation verification\nthrough the wargame simulation environment. Through the game confrontation\nsimulation experiment, it is found that the intelligent decision-making ability\nof the large language model is significantly stronger than the commonly used\nreinforcement learning AI and rule AI, and the intelligence, understandability\nand generalization are all better. And through experiments, it was found that\nthe intelligence of the large language model is closely related to prompt. This\nwork also extends the large language model from previous human-computer\ninteraction to the field of intelligent decision-making, which has important\nreference value and significance for the development of intelligent\ndecision-making.\n","authors":["Y. Sun","J. Zhao","C. Yu","W. Wang","X. Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.01090v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13177v2","updated":"2023-12-18T07:29:55Z","published":"2023-08-25T04:54:32Z","title":"How to Evaluate the Generalization of Detection? A Benchmark for\n Comprehensive Open-Vocabulary Detection","summary":" Object detection (OD) in computer vision has made significant progress in\nrecent years, transitioning from closed-set labels to open-vocabulary detection\n(OVD) based on large-scale vision-language pre-training (VLP). However, current\nevaluation methods and datasets are limited to testing generalization over\nobject types and referral expressions, which do not provide a systematic,\nfine-grained, and accurate benchmark of OVD models' abilities. In this paper,\nwe propose a new benchmark named OVDEval, which includes 9 sub-tasks and\nintroduces evaluations on commonsense knowledge, attribute understanding,\nposition understanding, object relation comprehension, and more. The dataset is\nmeticulously created to provide hard negatives that challenge models' true\nunderstanding of visual and linguistic input. Additionally, we identify a\nproblem with the popular Average Precision (AP) metric when benchmarking models\non these fine-grained label datasets and propose a new metric called\nNon-Maximum Suppression Average Precision (NMS-AP) to address this issue.\nExtensive experimental results show that existing top OVD models all fail on\nthe new tasks except for simple object types, demonstrating the value of the\nproposed dataset in pinpointing the weakness of current OVD models and guiding\nfuture research. Furthermore, the proposed NMS-AP metric is verified by\nexperiments to provide a much more truthful evaluation of OVD models, whereas\ntraditional AP metrics yield deceptive results. Data is available at\n\\url{https://github.com/om-ai-lab/OVDEval}\n","authors":["Yiyang Yao","Peng Liu","Tiancheng Zhao","Qianqian Zhang","Jiajia Liao","Chunxin Fang","Kyusong Lee","Qing Wang"],"pdf_url":"https://arxiv.org/pdf/2308.13177v2.pdf","comment":"Long paper accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.10987v1","updated":"2023-12-18T07:22:39Z","published":"2023-12-18T07:22:39Z","title":"Data Contamination Issues in Brain-to-Text Decoding","summary":" Decoding non-invasive cognitive signals to natural language has long been the\ngoal of building practical brain-computer interfaces (BCIs). Recent major\nmilestones have successfully decoded cognitive signals like functional Magnetic\nResonance Imaging (fMRI) and electroencephalogram (EEG) into text under open\nvocabulary setting. However, how to split the datasets for training,\nvalidating, and testing in cognitive signal decoding task still remains\ncontroversial. In this paper, we conduct systematic analysis on current dataset\nsplitting methods and find the existence of data contamination largely\nexaggerates model performance. Specifically, first we find the leakage of test\nsubjects' cognitive signals corrupts the training of a robust encoder. Second,\nwe prove the leakage of text stimuli causes the auto-regressive decoder to\nmemorize information in test set. The decoder generates highly accurate text\nnot because it truly understands cognitive signals. To eliminate the influence\nof data contamination and fairly evaluate different models' generalization\nability, we propose a new splitting method for different types of cognitive\ndatasets (e.g. fMRI, EEG). We also test the performance of SOTA Brain-to-Text\ndecoding models under the proposed dataset splitting paradigm as baselines for\nfurther research.\n","authors":["Congchi Yin","Qian Yu","Zhiwei Fang","Jie He","Changping Peng","Zhangang Lin","Jingping Shao","Piji Li"],"pdf_url":"https://arxiv.org/pdf/2312.10987v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2309.11082v2","updated":"2023-12-18T06:47:29Z","published":"2023-09-20T06:08:11Z","title":"Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial\n Margin Contrastive Learning","summary":" In recent years, the explosion of web videos makes text-video retrieval\nincreasingly essential and popular for video filtering, recommendation, and\nsearch. Text-video retrieval aims to rank relevant text/video higher than\nirrelevant ones. The core of this task is to precisely measure the cross-modal\nsimilarity between texts and videos. Recently, contrastive learning methods\nhave shown promising results for text-video retrieval, most of which focus on\nthe construction of positive and negative pairs to learn text and video\nrepresentations. Nevertheless, they do not pay enough attention to hard\nnegative pairs and lack the ability to model different levels of semantic\nsimilarity. To address these two issues, this paper improves contrastive\nlearning using two novel techniques. First, to exploit hard examples for robust\ndiscriminative power, we propose a novel Dual-Modal Attention-Enhanced Module\n(DMAE) to mine hard negative pairs from textual and visual clues. By further\nintroducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively\nidentify all these hard negatives and explicitly highlight their impacts in the\ntraining loss. Second, our work argues that triplet samples can better model\nfine-grained semantic similarity compared to pairwise samples. We thereby\npresent a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to\nconstruct partial order triplet samples by automatically generating\nfine-grained hard negatives for matched text-video pairs. The proposed TPM-CL\ndesigns an adaptive token masking strategy with cross-modal interaction to\nmodel subtle semantic differences. Extensive experiments demonstrate that the\nproposed approach outperforms existing methods on four widely-used text-video\nretrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.\n","authors":["Chen Jiang","Hong Liu","Xuzheng Yu","Qing Wang","Yuan Cheng","Jia Xu","Zhongyi Liu","Qingpei Guo","Wei Chu","Ming Yang","Yuan Qi"],"pdf_url":"https://arxiv.org/pdf/2309.11082v2.pdf","comment":"Accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2312.10967v1","updated":"2023-12-18T06:41:23Z","published":"2023-12-18T06:41:23Z","title":"Knowledge Graphs and Pre-trained Language Models enhanced Representation\n Learning for Conversational Recommender Systems","summary":" Conversational recommender systems (CRS) utilize natural language\ninteractions and dialogue history to infer user preferences and provide\naccurate recommendations. Due to the limited conversation context and\nbackground knowledge, existing CRSs rely on external sources such as knowledge\ngraphs to enrich the context and model entities based on their inter-relations.\nHowever, these methods ignore the rich intrinsic information within entities.\nTo address this, we introduce the Knowledge-Enhanced Entity Representation\nLearning (KERL) framework, which leverages both the knowledge graph and a\npre-trained language model to improve the semantic understanding of entities\nfor CRS. In our KERL framework, entity textual descriptions are encoded via a\npre-trained language model, while a knowledge graph helps reinforce the\nrepresentation of these entities. We also employ positional encoding to\neffectively capture the temporal information of entities in a conversation. The\nenhanced entity representation is then used to develop a recommender component\nthat fuses both entity and contextual representations for more informed\nrecommendations, as well as a dialogue component that generates informative\nentity-related information in the response text. A high-quality knowledge graph\nwith aligned entity descriptions is constructed to facilitate our study, namely\nthe Wiki Movie Knowledge Graph (WikiMKG). The experimental results show that\nKERL achieves state-of-the-art results in both recommendation and response\ngeneration tasks.\n","authors":["Zhangchi Qiu","Ye Tao","Shirui Pan","Alan Wee-Chung Liew"],"pdf_url":"https://arxiv.org/pdf/2312.10967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10964v1","updated":"2023-12-18T06:40:24Z","published":"2023-12-18T06:40:24Z","title":"Generative linguistic representation for spoken language identification","summary":" Effective extraction and application of linguistic features are central to\nthe enhancement of spoken Language IDentification (LID) performance. With the\nsuccess of recent large models, such as GPT and Whisper, the potential to\nleverage such pre-trained models for extracting linguistic features for LID\ntasks has become a promising area of research. In this paper, we explore the\nutilization of the decoder-based network from the Whisper model to extract\nlinguistic features through its generative mechanism for improving the\nclassification accuracy in LID tasks. We devised two strategies - one based on\nthe language embedding method and the other focusing on direct optimization of\nLID outputs while simultaneously enhancing the speech recognition tasks. We\nconducted experiments on the large-scale multilingual datasets MLS,\nVoxLingua107, and CommonVoice to test our approach. The experimental results\ndemonstrated the effectiveness of the proposed method on both in-domain and\nout-of-domain datasets for LID tasks.\n","authors":["Peng Shen","Xuguang Lu","Hisashi Kawai"],"pdf_url":"https://arxiv.org/pdf/2312.10964v1.pdf","comment":"Accepted by IEEE ASRU2023"},{"id":"http://arxiv.org/abs/2312.10961v1","updated":"2023-12-18T06:31:13Z","published":"2023-12-18T06:31:13Z","title":"Aspect-Based Sentiment Analysis with Explicit Sentiment Augmentations","summary":" Aspect-based sentiment analysis (ABSA), a fine-grained sentiment\nclassification task, has received much attention recently. Many works\ninvestigate sentiment information through opinion words, such as ''good'' and\n''bad''. However, implicit sentiment widely exists in the ABSA dataset, which\nrefers to the sentence containing no distinct opinion words but still expresses\nsentiment to the aspect term. To deal with implicit sentiment, this paper\nproposes an ABSA method that integrates explicit sentiment augmentations. And\nwe propose an ABSA-specific augmentation method to create such augmentations.\nSpecifically, we post-trains T5 by rule-based data. We employ Syntax Distance\nWeighting and Unlikelihood Contrastive Regularization in the training procedure\nto guide the model to generate an explicit sentiment. Meanwhile, we utilize the\nConstrained Beam Search to ensure the augmentation sentence contains the aspect\nterms. We test ABSA-ESA on two of the most popular benchmarks of ABSA. The\nresults show that ABSA-ESA outperforms the SOTA baselines on implicit and\nexplicit sentiment accuracy.\n","authors":["Jihong Ouyang","Zhiyao Yang","Silong Liang","Bing Wang","Yimeng Wang","Ximing Li"],"pdf_url":"https://arxiv.org/pdf/2312.10961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10959v1","updated":"2023-12-18T06:29:53Z","published":"2023-12-18T06:29:53Z","title":"Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition","summary":" Multi-talker overlapped speech recognition remains a significant challenge,\nrequiring not only speech recognition but also speaker diarization tasks to be\naddressed. In this paper, to better address these tasks, we first introduce\nspeaker labels into an autoregressive transformer-based speech recognition\nmodel to support multi-speaker overlapped speech recognition. Then, to improve\nspeaker diarization, we propose a novel speaker mask branch to detection the\nspeech segments of individual speakers. With the proposed model, we can perform\nboth speech recognition and speaker diarization tasks simultaneously using a\nsingle model. Experimental results on the LibriSpeech-based overlapped dataset\ndemonstrate the effectiveness of the proposed method in both speech recognition\nand speaker diarization tasks, particularly enhancing the accuracy of speaker\ndiarization in relatively complex multi-talker scenarios.\n","authors":["Peng Shen","Xugang Lu","Hisashi Kawai"],"pdf_url":"https://arxiv.org/pdf/2312.10959v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10952v1","updated":"2023-12-18T06:08:51Z","published":"2023-12-18T06:08:51Z","title":"Soft Alignment of Modality Space for End-to-end Speech Translation","summary":" End-to-end Speech Translation (ST) aims to convert speech into target text\nwithin a unified model. The inherent differences between speech and text\nmodalities often impede effective cross-modal and cross-lingual transfer.\nExisting methods typically employ hard alignment (H-Align) of individual speech\nand text segments, which can degrade textual representations. To address this,\nwe introduce Soft Alignment (S-Align), using adversarial training to align the\nrepresentation spaces of both modalities. S-Align creates a modality-invariant\nspace while preserving individual modality quality. Experiments on three\nlanguages from the MuST-C dataset show S-Align outperforms H-Align across\nmultiple tasks and offers translation capabilities on par with specialized\ntranslation models.\n","authors":["Yuhao Zhang","Kaiqi Kou","Bei Li","Chen Xu","Chunliang Zhang","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2312.10952v1.pdf","comment":"Accepted to ICASSP2024"},{"id":"http://arxiv.org/abs/2312.11572v1","updated":"2023-12-18T05:52:05Z","published":"2023-12-18T05:52:05Z","title":"Regularized Conditional Alignment for Multi-Domain Text Classification","summary":" The most successful multi-domain text classification (MDTC) approaches employ\nthe shared-private paradigm to facilitate the enhancement of domain-invariant\nfeatures through domain-specific attributes. Additionally, they employ\nadversarial training to align marginal feature distributions. Nevertheless,\nthese methodologies encounter two primary challenges: (1) Neglecting\nclass-aware information during adversarial alignment poses a risk of\nmisalignment; (2) The limited availability of labeled data across multiple\ndomains fails to ensure adequate discriminative capacity for the model. To\ntackle these issues, we propose a method called Regularized Conditional\nAlignment (RCA) to align the joint distributions of domains and classes, thus\nmatching features within the same category and amplifying the discriminative\nqualities of acquired features. Moreover, we employ entropy minimization and\nvirtual adversarial training to constrain the uncertainty of predictions\npertaining to unlabeled data and enhance the model's robustness. Empirical\nresults on two benchmark datasets demonstrate that our RCA approach outperforms\nstate-of-the-art MDTC techniques.\n","authors":["Juntao Hu","Yuan Wu"],"pdf_url":"https://arxiv.org/pdf/2312.11572v1.pdf","comment":"This paper has been accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.10945v1","updated":"2023-12-18T05:50:10Z","published":"2023-12-18T05:50:10Z","title":"LaViP:Language-Grounded Visual Prompts","summary":" We introduce a language-grounded visual prompting method to adapt the visual\nencoder of vision-language models for downstream tasks. By capitalizing on\nlanguage integration, we devise a parameter-efficient strategy to adjust the\ninput of the visual encoder, eliminating the need to modify or add to the\nmodel's parameters. Due to this design choice, our algorithm can operate even\nin black-box scenarios, showcasing adaptability in situations where access to\nthe model's parameters is constrained. We will empirically demonstrate that,\ncompared to prior art, grounding visual prompts with language enhances both the\naccuracy and speed of adaptation. Moreover, our algorithm excels in\nbase-to-novel class generalization, overcoming limitations of visual prompting\nand exhibiting the capacity to generalize beyond seen classes. We thoroughly\nassess and evaluate our method across a variety of image recognition datasets,\nsuch as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning\nsituations, including few-shot learning, base-to-novel class generalization,\nand transfer learning.\n","authors":["Nilakshan Kunananthaseelan","Jing Zhang","Mehrtash Harandi"],"pdf_url":"https://arxiv.org/pdf/2312.10945v1.pdf","comment":"The 38th Annual AAAI Conference on Artificial Intelligence"},{"id":"http://arxiv.org/abs/2312.08793v2","updated":"2023-12-18T05:23:30Z","published":"2023-12-14T10:27:15Z","title":"Forbidden Facts: An Investigation of Competing Objectives in Llama-2","summary":" LLMs often face competing pressures (for example helpfulness vs.\nharmlessness). To understand how models resolve such conflicts, we study\nLlama-2-chat models on the forbidden fact task. Specifically, we instruct\nLlama-2 to truthfully complete a factual recall statement while forbidding it\nfrom saying the correct answer. This often makes the model give incorrect\nanswers. We decompose Llama-2 into 1000+ components, and rank each one with\nrespect to how useful it is for forbidding the correct answer. We find that in\naggregate, around 35 components are enough to reliably implement the full\nsuppression behavior. However, these components are fairly heterogeneous and\nmany operate using faulty heuristics. We discover that one of these heuristics\ncan be exploited via a manually designed adversarial attack which we call The\nCalifornia Attack. Our results highlight some roadblocks standing in the way of\nbeing able to successfully interpret advanced ML systems. Project website\navailable at https://forbiddenfacts.github.io .\n","authors":["Tony T. Wang","Miles Wang","Kaivalya Hariharan","Nir Shavit"],"pdf_url":"https://arxiv.org/pdf/2312.08793v2.pdf","comment":"Accepted to the ATTRIB and SoLaR workshops at NeurIPS 2023; (v2:\n fixed typos)"},{"id":"http://arxiv.org/abs/2312.08583v2","updated":"2023-12-18T05:08:23Z","published":"2023-12-14T01:06:37Z","title":"ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric\n Strategy for Diverse Generative Tasks","summary":" This study examines 4-bit quantization methods like GPTQ in large language\nmodels (LLMs), highlighting GPTQ's overfitting and limited enhancement in\nZero-Shot tasks. While prior works merely focusing on zero-shot measurement, we\nextend task scope to more generative categories such as code generation and\nabstractive summarization, in which we found that INT4 quantization can\nsignificantly underperform. However, simply shifting to higher precision\nformats like FP6 has been particularly challenging, thus overlooked, due to\npoor performance caused by the lack of sophisticated integration and system\nacceleration strategies on current AI hardware. Our results show that FP6, even\nwith a coarse-grain quantization scheme, performs robustly across various\nalgorithms and tasks, demonstrating its superiority in accuracy and\nversatility. Notably, with the FP6 quantization, \\codestar-15B model performs\ncomparably to its FP16 counterpart in code generation, and for smaller models\nlike the 406M it closely matches their baselines in summarization. Neither can\nbe achieved by INT4. To better accommodate various AI hardware and achieve the\nbest system performance, we propose a novel 4+2 design for FP6 to achieve\nsimilar latency to the state-of-the-art INT4 fine-grain quantization. With our\ndesign, FP6 can become a promising solution to the current 4-bit quantization\nmethods used in LLMs.\n","authors":["Xiaoxia Wu","Haojun Xia","Stephen Youn","Zhen Zheng","Shiyang Chen","Arash Bakhtiari","Michael Wyatt","Reza Yazdani Aminabadi","Yuxiong He","Olatunji Ruwase","Leon Song","Zhewei Yao"],"pdf_url":"https://arxiv.org/pdf/2312.08583v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00347v2","updated":"2023-12-18T04:59:01Z","published":"2023-12-01T04:51:01Z","title":"RTQ: Rethinking Video-language Understanding Based on Image-text Model","summary":" Recent advancements in video-language understanding have been established on\nthe foundation of image-text models, resulting in promising outcomes due to the\nshared knowledge between images and videos. However, video-language\nunderstanding presents unique challenges due to the inclusion of highly complex\nsemantic details, which result in information redundancy, temporal dependency,\nand scene complexity. Current techniques have only partially tackled these\nissues, and our quantitative analysis indicates that some of these methods are\ncomplementary. In light of this, we propose a novel framework called RTQ\n(Refine, Temporal model, and Query), which addresses these challenges\nsimultaneously. The approach involves refining redundant information within\nframes, modeling temporal relations among frames, and querying task-specific\ninformation from the videos. Remarkably, our model demonstrates outstanding\nperformance even in the absence of video-language pre-training, and the results\nare comparable with or superior to those achieved by state-of-the-art\npre-training methods. Code is available at\nhttps://github.com/SCZwangxiao/RTQ-MM2023.\n","authors":["Xiao Wang","Yaoyu Li","Tian Gan","Zheng Zhang","Jingjing Lv","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2312.00347v2.pdf","comment":"Accepted by ACM MM 2023 as Oral representation"},{"id":"http://arxiv.org/abs/2312.01040v2","updated":"2023-12-18T04:57:05Z","published":"2023-12-02T05:54:06Z","title":"From Beginner to Expert: Modeling Medical Knowledge into General LLMs","summary":" Recently, large language model (LLM) based artificial intelligence (AI)\nsystems have demonstrated remarkable capabilities in natural language\nunderstanding and generation. However, these models face a significant\nchallenge when it comes to sensitive applications, such as reasoning over\nmedical knowledge and answering medical questions in a physician-like manner.\nPrior studies attempted to overcome this challenge by increasing the model size\n(>100B) to learn more general medical knowledge, while there is still room for\nimprovement in LLMs with smaller-scale model sizes (<100B). In this work, we\nstart from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a\nmedical beginner towards a medical expert (called AntGLM-Med-10B), which\nleverages a 3-stage optimization procedure, i.e., general medical knowledge\ninjection, medical domain instruction tuning, and specific medical task\nadaptation. Our contributions are threefold: (1) We specifically investigate\nhow to adapt a pre-trained general LLM in medical domain, especially for a\nspecific medical task. (2) We collect and construct large-scale medical\ndatasets for each stage of the optimization process. These datasets encompass\nvarious data types and tasks, such as question-answering, medical reasoning,\nmulti-choice questions, and medical conversations. (3) Specifically for\nmulti-choice questions in the medical domain, we propose a novel\nVerification-of-Choice approach for prompting engineering, which significantly\nenhances the reasoning ability of LLMs. Remarkably, by combining the above\napproaches, our AntGLM-Med-10B model can outperform the most of LLMs on\nPubMedQA, including both general and medical LLMs, even when these LLMs have\nlarger model size.\n","authors":["Qiang Li","Xiaoyan Yang","Haowen Wang","Qin Wang","Lei Liu","Junjie Wang","Yang Zhang","Mingyuan Chu","Sen Hu","Yicheng Chen","Yue Shen","Cong Fan","Wangshu Zhang","Teng Xu","Jinjie Gu","Jing Zheng","Guannan Zhang Ant Group"],"pdf_url":"https://arxiv.org/pdf/2312.01040v2.pdf","comment":"Developed by Ant Group for PubMedQA leaderboard"},{"id":"http://arxiv.org/abs/2305.03453v4","updated":"2023-12-18T04:49:01Z","published":"2023-05-05T11:56:30Z","title":"T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large\n Language Model Signals for Science Question Answering","summary":" Large Language Models (LLMs) have recently demonstrated exceptional\nperformance in various Natural Language Processing (NLP) tasks. They have also\nshown the ability to perform chain-of-thought (CoT) reasoning to solve complex\nproblems. Recent studies have explored CoT reasoning in complex multimodal\nscenarios, such as the science question answering task, by fine-tuning\nmultimodal models with high-quality human-annotated CoT rationales. However,\ncollecting high-quality COT rationales is usually time-consuming and costly.\nBesides, the annotated rationales are hardly accurate due to the external\nessential information missed. To address these issues, we propose a novel\nmethod termed T-SciQ that aims at teaching science question answering with LLM\nsignals. The T-SciQ approach generates high-quality CoT rationales as teaching\nsignals and is advanced to train much smaller models to perform CoT reasoning\nin complex modalities. Additionally, we introduce a novel data mixing strategy\nto produce more effective teaching data samples for simple and complex science\nquestion answer problems. Extensive experimental results show that our T-SciQ\nmethod achieves a new state-of-the-art performance on the ScienceQA benchmark,\nwith an accuracy of 96.18%. Moreover, our approach outperforms the most\npowerful fine-tuned baseline by 4.5%. The code is publicly available at\nhttps://github.com/T-SciQ/T-SciQ.\n","authors":["Lei Wang","Yi Hu","Jiabang He","Xing Xu","Ning Liu","Hui Liu","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2305.03453v4.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2308.09936v3","updated":"2023-12-18T04:33:17Z","published":"2023-08-19T07:53:43Z","title":"BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual\n Questions","summary":" Vision Language Models (VLMs), which extend Large Language Models (LLM) by\nincorporating visual understanding capability, have demonstrated significant\nadvancements in addressing open-ended visual question-answering (VQA) tasks.\nHowever, these models cannot accurately interpret images infused with text, a\ncommon occurrence in real-world scenarios. Standard procedures for extracting\ninformation from images often involve learning a fixed set of query embeddings.\nThese embeddings are designed to encapsulate image contexts and are later used\nas soft prompt inputs in LLMs. Yet, this process is limited to the token count,\npotentially curtailing the recognition of scenes with text-rich context. To\nimprove upon them, the present study introduces BLIVA: an augmented version of\nInstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings\nfrom InstructBLIP and also directly projects encoded patch embeddings into the\nLLM, a technique inspired by LLaVA. This approach assists the model to capture\nintricate details potentially missed during the query decoding process.\nEmpirical evidence demonstrates that our model, BLIVA, significantly enhances\nperformance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA\nbenchmark) and in undertaking general (not particularly text-rich) VQA\nbenchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved\n17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME),\ncomparing to our baseline InstructBLIP. BLIVA demonstrates significant\ncapability in decoding real-world images, irrespective of text presence. To\ndemonstrate the broad industry applications enabled by BLIVA, we evaluate the\nmodel using a new dataset comprising YouTube thumbnails paired with\nquestion-answer sets across 11 diverse categories. Our code and models are\nfreely accessible at https://github.com/mlpc-ucsd/BLIVA.\n","authors":["Wenbo Hu","Yifan Xu","Yi Li","Weiyue Li","Zeyuan Chen","Zhuowen Tu"],"pdf_url":"https://arxiv.org/pdf/2308.09936v3.pdf","comment":"Accepted at AAAI Conference on Artificial Intelligence (AAAI-24)"},{"id":"http://arxiv.org/abs/2311.08189v3","updated":"2023-12-18T04:01:35Z","published":"2023-11-14T14:22:47Z","title":"All Data on the Table: Novel Dataset and Benchmark for Cross-Modality\n Scientific Information Extraction","summary":" Extracting key information from scientific papers has the potential to help\nresearchers work more efficiently and accelerate the pace of scientific\nprogress. Over the last few years, research on Scientific Information\nExtraction (SciIE) witnessed the release of several new systems and benchmarks.\nHowever, existing paper-focused datasets mostly focus only on specific parts of\na manuscript (e.g., abstracts) and are single-modality (i.e., text- or\ntable-only), due to complex processing and expensive annotations. Moreover,\ncore information can be present in either text or tables or across both. To\nclose this gap in data availability and enable cross-modality IE, while\nalleviating labeling costs, we propose a semi-supervised pipeline for\nannotating entities in text, as well as entities and relations in tables, in an\niterative procedure. Based on this pipeline, we release novel resources for the\nscientific community, including a high-quality benchmark, a large-scale corpus,\nand a semi-supervised annotation pipeline. We further report the performance of\nstate-of-the-art IE models on the proposed benchmark dataset, as a baseline.\nLastly, we explore the potential capability of large language models such as\nChatGPT for the current task. Our new dataset, results, and analysis validate\nthe effectiveness and efficiency of our semi-supervised pipeline, and we\ndiscuss its remaining limitations.\n","authors":["Yuhan Li","Jian Wu","Zhiwei Yu","Börje F. Karlsson","Wei Shen","Manabu Okumura","Chin-Yew Lin"],"pdf_url":"https://arxiv.org/pdf/2311.08189v3.pdf","comment":"Work in progress; 17 pages, 6 figures, 11 tables"},{"id":"http://arxiv.org/abs/2311.16502v2","updated":"2023-12-18T03:47:39Z","published":"2023-11-27T17:33:21Z","title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning\n Benchmark for Expert AGI","summary":" We introduce MMMU: a new benchmark designed to evaluate multimodal models on\nmassive multi-discipline tasks demanding college-level subject knowledge and\ndeliberate reasoning. MMMU includes 11.5K meticulously collected multimodal\nquestions from college exams, quizzes, and textbooks, covering six core\ndisciplines: Art & Design, Business, Science, Health & Medicine, Humanities &\nSocial Science, and Tech & Engineering. These questions span 30 subjects and\n183 subfields, comprising 30 highly heterogeneous image types, such as charts,\ndiagrams, maps, tables, music sheets, and chemical structures. Unlike existing\nbenchmarks, MMMU focuses on advanced perception and reasoning with\ndomain-specific knowledge, challenging models to perform tasks akin to those\nfaced by experts. The evaluation of 14 open-source LMMs as well as the\nproprietary GPT-4V(ision) and Gemini highlights the substantial challenges\nposed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve\naccuracies of 56% and 59% respectively, indicating significant room for\nimprovement. We believe MMMU will stimulate the community to build\nnext-generation multimodal foundation models towards expert artificial general\nintelligence.\n","authors":["Xiang Yue","Yuansheng Ni","Kai Zhang","Tianyu Zheng","Ruoqi Liu","Ge Zhang","Samuel Stevens","Dongfu Jiang","Weiming Ren","Yuxuan Sun","Cong Wei","Botao Yu","Ruibin Yuan","Renliang Sun","Ming Yin","Boyuan Zheng","Zhenzhu Yang","Yibo Liu","Wenhao Huang","Huan Sun","Yu Su","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2311.16502v2.pdf","comment":"117 pages, 99 figures"},{"id":"http://arxiv.org/abs/2312.10905v1","updated":"2023-12-18T03:21:58Z","published":"2023-12-18T03:21:58Z","title":"Satellite Captioning: Large Language Models to Augment Labeling","summary":" With the growing capabilities of modern object detection networks and\ndatasets to train them, it has gotten more straightforward and, importantly,\nless laborious to get up and running with a model that is quite adept at\ndetecting any number of various objects. However, while image datasets for\nobject detection have grown and continue to proliferate (the current most\nextensive public set, ImageNet, contains over 14m images with over 14m\ninstances), the same cannot be said for textual caption datasets. While they\nhave certainly been growing in recent years, caption datasets present a much\nmore difficult challenge due to language differences, grammar, and the time it\ntakes for humans to generate them. Current datasets have certainly provided\nmany instances to work with, but it becomes problematic when a captioner may\nhave a more limited vocabulary, one may not be adequately fluent in the\nlanguage, or there are simple grammatical mistakes. These difficulties are\nincreased when the images get more specific, such as remote sensing images.\nThis paper aims to address this issue of potential information and\ncommunication shortcomings in caption datasets. To provide a more precise\nanalysis, we specify our domain of images to be remote sensing images in the\nRSICD dataset and experiment with the captions provided here. Our findings\nindicate that ChatGPT grammar correction is a simple and effective way to\nincrease the performance accuracy of caption models by making data captions\nmore diverse and grammatically correct.\n","authors":["Grant Rosario","David Noever"],"pdf_url":"https://arxiv.org/pdf/2312.10905v1.pdf","comment":"9 pages, 4 figures, 4 tables"},{"id":"http://arxiv.org/abs/2308.10144v2","updated":"2023-12-18T03:11:52Z","published":"2023-08-20T03:03:34Z","title":"ExpeL: LLM Agents Are Experiential Learners","summary":" The recent surge in research interest in applying large language models\n(LLMs) to decision-making tasks has flourished by leveraging the extensive\nworld knowledge embedded in LLMs. While there is a growing demand to tailor\nLLMs for custom decision-making tasks, finetuning them for specific tasks is\nresource-intensive and may diminish the model's generalization capabilities.\nMoreover, state-of-the-art language models like GPT-4 and Claude are primarily\naccessible through API calls, with their parametric weights remaining\nproprietary and unavailable to the public. This scenario emphasizes the growing\nneed for new methodologies that allow learning from agent experiences without\nrequiring parametric updates. To address these problems, we introduce the\nExperiential Learning (ExpeL) agent. Our agent autonomously gathers experiences\nand extracts knowledge using natural language from a collection of training\ntasks. At inference, the agent recalls its extracted insights and past\nexperiences to make informed decisions. Our empirical results highlight the\nrobust learning efficacy of the ExpeL agent, indicating a consistent\nenhancement in its performance as it accumulates experiences. We further\nexplore the emerging capabilities and transfer learning potential of the ExpeL\nagent through qualitative observations and additional experiments.\n","authors":["Andrew Zhao","Daniel Huang","Quentin Xu","Matthieu Lin","Yong-Jin Liu","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2308.10144v2.pdf","comment":"Accepted by the 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI-24)"},{"id":"http://arxiv.org/abs/2312.10897v1","updated":"2023-12-18T02:55:14Z","published":"2023-12-18T02:55:14Z","title":"Generalized Category Discovery with Large Language Models in the Loop","summary":" Generalized Category Discovery (GCD) is a crucial task that aims to recognize\nboth known and novel categories from a set of unlabeled data by utilizing a few\nlabeled data with only known categories. Due to the lack of supervision and\ncategory information, current methods usually perform poorly on novel\ncategories and struggle to reveal semantic meanings of the discovered clusters,\nwhich limits their applications in the real world. To mitigate above issues, we\npropose Loop, an end-to-end active-learning framework that introduces Large\nLanguage Models (LLMs) into the training loop, which can boost model\nperformance and generate category names without relying on any human efforts.\nSpecifically, we first propose Local Inconsistent Sampling (LIS) to select\nsamples that have a higher probability of falling to wrong clusters, based on\nneighborhood prediction consistency and entropy of cluster assignment\nprobabilities. Then we propose a Scalable Query strategy to allow LLMs to\nchoose true neighbors of the selected samples from multiple candidate samples.\nBased on the feedback from LLMs, we perform Refined Neighborhood Contrastive\nLearning (RNCL) to pull samples and their neighbors closer to learn\nclustering-friendly representations. Finally, we select representative samples\nfrom clusters corresponding to novel categories to allow LLMs to generate\ncategory names for them. Extensive experiments on three benchmark datasets show\nthat Loop outperforms SOTA models by a large margin and generates accurate\ncategory names for the discovered clusters. We will release our code and data\nafter publication.\n","authors":["Wenbin An","Wenkai Shi","Feng Tian","Haonan Lin","QianYing Wang","Yaqiang Wu","Mingxiang Cai","Luyan Wang","Yan Chen","Haiping Zhu","Ping Chen"],"pdf_url":"https://arxiv.org/pdf/2312.10897v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2312.12464v1","updated":"2023-12-18T21:11:17Z","published":"2023-12-18T21:11:17Z","title":"Towards Better Serialization of Tabular Data for Few-shot Classification","summary":" We present a study on the integration of Large Language Models (LLMs) in\ntabular data classification, emphasizing an efficient framework. Building upon\nexisting work done in TabLLM (arXiv:2210.10723), we introduce three novel\nserialization techniques, including the standout LaTeX serialization method.\nThis method significantly boosts the performance of LLMs in processing\ndomain-specific datasets, Our method stands out for its memory efficiency and\nability to fully utilize complex data structures. Through extensive\nexperimentation, including various serialization approaches like feature\ncombination and importance, we demonstrate our work's superiority in accuracy\nand efficiency over traditional models.\n","authors":["Sukriti Jaitly","Tanay Shah","Ashish Shugani","Razik Singh Grewal"],"pdf_url":"https://arxiv.org/pdf/2312.12464v1.pdf","comment":"4 pages, 2 figures"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2312.10890v1","updated":"2023-12-18T02:37:30Z","published":"2023-12-18T02:37:30Z","title":"Low-latency Space-time Supersampling for Real-time Rendering","summary":" With the rise of real-time rendering and the evolution of display devices,\nthere is a growing demand for post-processing methods that offer\nhigh-resolution content in a high frame rate. Existing techniques often suffer\nfrom quality and latency issues due to the disjointed treatment of frame\nsupersampling and extrapolation. In this paper, we recognize the shared context\nand mechanisms between frame supersampling and extrapolation, and present a\nnovel framework, Space-time Supersampling (STSS). By integrating them into a\nunified framework, STSS can improve the overall quality with lower latency. To\nimplement an efficient architecture, we treat the aliasing and warping holes\nunified as reshading regions and put forth two key components to compensate the\nregions, namely Random Reshading Masking (RRM) and Efficient Reshading Module\n(ERM). Extensive experiments demonstrate that our approach achieves superior\nvisual fidelity compared to state-of-the-art (SOTA) methods. Notably, the\nperformance is achieved within only 4ms, saving up to 75\\% of time against the\nconventional two-stage pipeline that necessitates 17ms.\n","authors":["Ruian He","Shili Zhou","Yuqi Sun","Ri Cheng","Weimin Tan","Bo Yan"],"pdf_url":"https://arxiv.org/pdf/2312.10890v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.09909v2","updated":"2023-12-18T02:36:23Z","published":"2023-12-15T16:17:34Z","title":"TMP: Temporal Motion Propagation for Online Video Super-Resolution","summary":" Online video super-resolution (online-VSR) highly relies on an effective\nalignment module to aggregate temporal information, while the strict latency\nrequirement makes accurate and efficient alignment very challenging. Though\nmuch progress has been achieved, most of the existing online-VSR methods\nestimate the motion fields of each frame separately to perform alignment, which\nis computationally redundant and ignores the fact that the motion fields of\nadjacent frames are correlated. In this work, we propose an efficient Temporal\nMotion Propagation (TMP) method, which leverages the continuity of motion field\nto achieve fast pixel-level alignment among consecutive frames. Specifically,\nwe first propagate the offsets from previous frames to the current frame, and\nthen refine them in the neighborhood, which significantly reduces the matching\nspace and speeds up the offset estimation process. Furthermore, to enhance the\nrobustness of alignment, we perform spatial-wise weighting on the warped\nfeatures, where the positions with more precise offsets are assigned higher\nimportance. Experiments on benchmark datasets demonstrate that the proposed TMP\nmethod achieves leading online-VSR accuracy as well as inference speed. The\nsource code of TMP can be found at https://github.com/xtudbxk/TMP.\n","authors":["Zhengqiang Zhang","Ruihuang Li","Shi Guo","Yang Cao","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.09909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08916v2","updated":"2023-12-18T02:06:58Z","published":"2023-12-14T13:21:52Z","title":"Progressive Feature Self-reinforcement for Weakly Supervised Semantic\n Segmentation","summary":" Compared to conventional semantic segmentation with pixel-level supervision,\nWeakly Supervised Semantic Segmentation (WSSS) with image-level labels poses\nthe challenge that it always focuses on the most discriminative regions,\nresulting in a disparity between fully supervised conditions. A typical\nmanifestation is the diminished precision on the object boundaries, leading to\na deteriorated accuracy of WSSS. To alleviate this issue, we propose to\nadaptively partition the image content into deterministic regions (e.g.,\nconfident foreground and background) and uncertain regions (e.g., object\nboundaries and misclassified categories) for separate processing. For uncertain\ncues, we employ an activation-based masking strategy and seek to recover the\nlocal information with self-distilled knowledge. We further assume that the\nunmasked confident regions should be robust enough to preserve the global\nsemantics. Building upon this, we introduce a complementary self-enhancement\nmethod that constrains the semantic consistency between these confident regions\nand an augmented image with the same class labels. Extensive experiments\nconducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed\nsingle-stage approach for WSSS not only outperforms state-of-the-art benchmarks\nremarkably but also surpasses multi-stage methodologies that trade complexity\nfor accuracy. The code can be found at\n\\url{https://github.com/Jessie459/feature-self-reinforcement}.\n","authors":["Jingxuan He","Lechao Cheng","Chaowei Fang","Zunlei Feng","Tingting Mu","Mingli Song"],"pdf_url":"https://arxiv.org/pdf/2312.08916v2.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.10877v1","updated":"2023-12-18T01:49:42Z","published":"2023-12-18T01:49:42Z","title":"Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial\n Animation","summary":" Speech-driven 3D facial animation aims to synthesize vivid facial animations\nthat accurately synchronize with speech and match the unique speaking style.\nHowever, existing works primarily focus on achieving precise lip\nsynchronization while neglecting to model the subject-specific speaking style,\noften resulting in unrealistic facial animations. To the best of our knowledge,\nthis work makes the first attempt to explore the coupled information between\nthe speaking style and the semantic content in facial motions. Specifically, we\nintroduce an innovative speaking style disentanglement method, which enables\narbitrary-subject speaking style encoding and leads to a more realistic\nsynthesis of speech-driven facial animations. Subsequently, we propose a novel\nframework called \\textbf{Mimic} to learn disentangled representations of the\nspeaking style and content from facial motions by building two latent spaces\nfor style and content, respectively. Moreover, to facilitate disentangled\nrepresentation learning, we introduce four well-designed constraints: an\nauxiliary style classifier, an auxiliary inverse classifier, a content\ncontrastive loss, and a pair of latent cycle losses, which can effectively\ncontribute to the construction of the identity-related style space and\nsemantic-related content space. Extensive qualitative and quantitative\nexperiments conducted on three publicly available datasets demonstrate that our\napproach outperforms state-of-the-art methods and is capable of capturing\ndiverse speaking styles for speech-driven 3D facial animation. The source code\nand supplementary video are publicly available at:\nhttps://zeqing-wang.github.io/Mimic/\n","authors":["Hui Fu","Zeqing Wang","Ke Gong","Keze Wang","Tianshui Chen","Haojie Li","Haifeng Zeng","Wenxiong Kang"],"pdf_url":"https://arxiv.org/pdf/2312.10877v1.pdf","comment":"7 pages, 6 figures, accepted by AAAI-24"},{"id":"http://arxiv.org/abs/2312.10872v1","updated":"2023-12-18T01:23:22Z","published":"2023-12-18T01:23:22Z","title":"Country-Scale Cropland Mapping in Data-Scarce Settings Using Deep\n Learning: A Case Study of Nigeria","summary":" Cropland maps are a core and critical component of remote-sensing-based\nagricultural monitoring, providing dense and up-to-date information about\nagricultural development. Machine learning is an effective tool for large-scale\nagricultural mapping, but relies on geo-referenced ground-truth data for model\ntraining and testing, which can be scarce or time-consuming to obtain. In this\nstudy, we explore the usefulness of combining a global cropland dataset and a\nhand-labeled dataset to train machine learning models for generating a new\ncropland map for Nigeria in 2020 at 10 m resolution. We provide the models with\npixel-wise time series input data from remote sensing sources such as\nSentinel-1 and 2, ERA5 climate data, and DEM data, in addition to binary labels\nindicating cropland presence. We manually labeled 1827 evenly distributed\npixels across Nigeria, splitting them into 50\\% training, 25\\% validation, and\n25\\% test sets used to fit the models and test our output map. We evaluate and\ncompare the performance of single- and multi-headed Long Short-Term Memory\n(LSTM) neural network classifiers, a Random Forest classifier, and three\nexisting 10 m resolution global land cover maps (Google's Dynamic World, ESRI's\nLand Cover, and ESA's WorldCover) on our proposed test set. Given the regional\nvariations in cropland appearance, we additionally experimented with excluding\nor sub-setting the global crowd-sourced Geowiki cropland dataset, to\nempirically assess the trade-off between data quantity and data quality in\nterms of the similarity to the target data distribution of Nigeria. We find\nthat the existing WorldCover map performs the best with an F1-score of 0.825\nand accuracy of 0.870 on the test set, followed by a single-headed LSTM model\ntrained with our hand-labeled training samples and the Geowiki data points in\nNigeria, with a F1-score of 0.814 and accuracy of 0.842.\n","authors":["Joaquin Gajardo","Michele Volpi","Daniel Onwude","Thijs Defraeye"],"pdf_url":"https://arxiv.org/pdf/2312.10872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10912v2","updated":"2023-12-18T00:37:52Z","published":"2023-10-17T01:12:08Z","title":"Towards Training-free Open-world Segmentation via Image Prompt\n Foundation Models","summary":" The realm of computer vision has witnessed a paradigm shift with the advent\nof foundational models, mirroring the transformative influence of large\nlanguage models in the domain of natural language processing. This paper delves\ninto the exploration of open-world segmentation, presenting a novel approach\ncalled Image Prompt Segmentation (IPSeg) that harnesses the power of vision\nfoundational models. IPSeg lies the principle of a training-free paradigm,\nwhich capitalizes on image prompt techniques. Specifically, IPSeg utilizes a\nsingle image containing a subjective visual concept as a flexible prompt to\nquery vision foundation models like DINOv2 and Stable Diffusion. Our approach\nextracts robust features for the prompt image and input image, then matches the\ninput representations to the prompt representations via a novel feature\ninteraction module to generate point prompts highlighting target objects in the\ninput image. The generated point prompts are further utilized to guide the\nSegment Anything Model to segment the target object in the input image. The\nproposed method stands out by eliminating the need for exhaustive training\nsessions, thereby offering a more efficient and scalable solution. Experiments\non COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for\nflexible open-world segmentation using intuitive image prompts. This work\npioneers tapping foundation models for open-world understanding through visual\nconcepts conveyed in images.\n","authors":["Lv Tang","Peng-Tao Jiang","Hao-Ke Xiao","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2310.10912v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10854v1","updated":"2023-12-18T00:05:28Z","published":"2023-12-18T00:05:28Z","title":"The Right Losses for the Right Gains: Improving the Semantic Consistency\n of Deep Text-to-Image Generation with Distribution-Sensitive Losses","summary":" One of the major challenges in training deep neural networks for\ntext-to-image generation is the significant linguistic discrepancy between\nground-truth captions of each image in most popular datasets. The large\ndifference in the choice of words in such captions results in synthesizing\nimages that are semantically dissimilar to each other and to their ground-truth\ncounterparts. Moreover, existing models either fail to generate the\nfine-grained details of the image or require a huge number of parameters that\nrenders them inefficient for text-to-image synthesis. To fill this gap in the\nliterature, we propose using the contrastive learning approach with a novel\ncombination of two loss functions: fake-to-fake loss to increase the semantic\nconsistency between generated images of the same caption, and fake-to-real loss\nto reduce the gap between the distributions of real images and fake ones. We\ntest this approach on two baseline models: SSAGAN and AttnGAN (with style\nblocks to enhance the fine-grained details of the images.) Results show that\nour approach improves the qualitative results on AttnGAN with style blocks on\nthe CUB dataset. Additionally, on the challenging COCO dataset, our approach\nachieves competitive results against the state-of-the-art Lafite model,\noutperforms the FID score of SSAGAN model by 44.\n","authors":["Mahmoud Ahmed","Omer Moussa","Ismail Shaheen","Mohamed Abdelfattah","Amr Abdalla","Marwan Eid","Hesham Eraqi","Mohamed Moustafa"],"pdf_url":"https://arxiv.org/pdf/2312.10854v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11748v1","updated":"2023-12-18T23:21:00Z","published":"2023-12-18T23:21:00Z","title":"Ultrasound Image Enhancement using CycleGAN and Perceptual Loss","summary":" Purpose: The objective of this work is to introduce an advanced framework\ndesigned to enhance ultrasound images, especially those captured by portable\nhand-held devices, which often produce lower quality images due to hardware\nconstraints. Additionally, this framework is uniquely capable of effectively\nhandling non-registered input ultrasound image pairs, addressing a common\nchallenge in medical imaging. Materials and Methods: In this retrospective\nstudy, we utilized an enhanced generative adversarial network (CycleGAN) model\nfor ultrasound image enhancement across five organ systems. Perceptual loss,\nderived from deep features of pretrained neural networks, is applied to ensure\nthe human-perceptual quality of the enhanced images. These images are compared\nwith paired images acquired from high resolution devices to demonstrate the\nmodel's ability to generate realistic high-quality images across organ systems.\nResults: Preliminary validation of the framework reveals promising performance\nmetrics. The model generates images that result in a Structural Similarity\nIndex (SSI) score of 0.722, Locally Normalized Cross-Correlation (LNCC) score\nof 0.902 and 28.802 for the Peak Signal-to-Noise Ratio (PSNR) metric.\nConclusion: This work presents a significant advancement in medical imaging\nthrough the development of a CycleGAN model enhanced with Perceptual Loss (PL),\neffectively bridging the quality gap between ultrasound images from varied\ndevices. By training on paired images, the model not only improves image\nquality but also ensures the preservation of vital anatomic structural content.\nThis approach may improve equity in access to healthcare by enhancing portable\ndevice capabilities, although further validation and optimizations are\nnecessary for broader clinical application.\n","authors":["Shreeram Athreya","Ashwath Radhachandran","Vedrana Ivezić","Vivek Sant","Corey W. Arnold","William Speier"],"pdf_url":"https://arxiv.org/pdf/2312.11748v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2205.03553v3","updated":"2023-12-18T23:04:51Z","published":"2022-05-07T04:55:05Z","title":"From heavy rain removal to detail restoration: A faster and better\n network","summary":" The profound accumulation of precipitation during intense rainfall events can\nmarkedly degrade the quality of images, leading to the erosion of textural\ndetails. Despite the improvements observed in existing learning-based methods\nspecialized for heavy rain removal, it is discerned that a significant\nproportion of these methods tend to overlook the precise reconstruction of the\nintricate details. In this work, we introduce a simple dual-stage progressive\nenhancement network, denoted as DPENet, aiming to achieve effective deraining\nwhile preserving the structural accuracy of rain-free images. This approach\ncomprises two key modules, a rain streaks removal network (R$^2$Net) focusing\non accurate rain removal, and a details reconstruction network (DRNet) designed\nto recover the textural details of rain-free images. Firstly, we introduce a\ndilated dense residual block (DDRB) within R$^2$Net, enabling the aggregation\nof high-level and low-level features. Secondly, an enhanced residual pixel-wise\nattention block (ERPAB) is integrated into DRNet to facilitate the\nincorporation of contextual information. To further enhance the fidelity of our\napproach, we employ a comprehensive loss function that accentuates both the\nmarginal and regional accuracy of rain-free images. Extensive experiments\nconducted on publicly available benchmarks demonstrates the noteworthy\nefficiency and effectiveness of our proposed DPENet. The source code and\npre-trained models are currently available at\n\\url{https://github.com/chdwyb/DPENet}.\n","authors":["Yuanbo Wen","Tao Gao","Jing Zhang","Kaihao Zhang","Ting Chen"],"pdf_url":"https://arxiv.org/pdf/2205.03553v3.pdf","comment":"Accepted by Pattern Recognition"},{"id":"http://arxiv.org/abs/2309.10625v2","updated":"2023-12-18T21:50:51Z","published":"2023-09-19T14:04:04Z","title":"NoisyNN: Exploring the Influence of Information Entropy Change in\n Learning Systems","summary":" We explore the impact of entropy change in deep learning systems via noise\ninjection at different levels, i.e., the latent space and input image. The\nseries of models that employ our methodology are collectively known as Noisy\nNeural Networks (NoisyNN), with examples such as NoisyViT and NoisyCNN. Noise\nis conventionally viewed as a harmful perturbation in various deep learning\narchitectures, such as convolutional neural networks (CNNs) and vision\ntransformers (ViTs), as well as different learning tasks like image\nclassification and transfer learning. However, this work shows noise can be an\neffective way to change the entropy of the learning system. We demonstrate that\nspecific noise can boost the performance of various deep architectures under\ncertain conditions. We theoretically prove the enhancement gained from positive\nnoise by reducing the task complexity defined by information entropy and\nexperimentally show the significant performance gain in large image datasets,\nsuch as the ImageNet. Herein, we use the information entropy to define the\ncomplexity of the task. We categorize the noise into two types, positive noise\n(PN) and harmful noise (HN), based on whether the noise can help reduce the\ncomplexity of the task. Extensive experiments of CNNs and ViTs have shown\nperformance improvements by proactively injecting positive noise, where we\nachieved an unprecedented top 1 accuracy of over 95$\\%$ on ImageNet. Both\ntheoretical analysis and empirical evidence have confirmed that the presence of\npositive noise, can benefit the learning process, while the traditionally\nperceived harmful noise indeed impairs deep learning models. The different\nroles of noise offer new explanations for deep models on specific tasks and\nprovide a new paradigm for improving model performance. Moreover, it reminds us\nthat we can influence the performance of learning systems via information\nentropy change.\n","authors":["Xiaowei Yu","Yao Xue","Lu Zhang","Li Wang","Tianming Liu","Dajiang Zhu"],"pdf_url":"https://arxiv.org/pdf/2309.10625v2.pdf","comment":"Information Entropy, NoisyNN, ViT, CNN"},{"id":"http://arxiv.org/abs/2312.11716v1","updated":"2023-12-18T21:27:34Z","published":"2023-12-18T21:27:34Z","title":"Squeezed Edge YOLO: Onboard Object Detection on Edge Devices","summary":" Demand for efficient onboard object detection is increasing due to its key\nrole in autonomous navigation. However, deploying object detection models such\nas YOLO on resource constrained edge devices is challenging due to the high\ncomputational requirements of such models. In this paper, an compressed object\ndetection model named Squeezed Edge YOLO is examined. This model is compressed\nand optimized to kilobytes of parameters in order to fit onboard such edge\ndevices. To evaluate Squeezed Edge YOLO, two use cases - human and shape\ndetection - are used to show the model accuracy and performance. Moreover, the\nmodel is deployed onboard a GAP8 processor with 8 RISC-V cores and an NVIDIA\nJetson Nano with 4GB of memory. Experimental results show Squeezed Edge YOLO\nmodel size is optimized by a factor of 8x which leads to 76% improvements in\nenergy efficiency and 3.3x faster throughout.\n","authors":["Edward Humes","Mozhgan Navardi","Tinoosh Mohsenin"],"pdf_url":"https://arxiv.org/pdf/2312.11716v1.pdf","comment":"ML with New Compute Paradigms (MLNCP) Workshop at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.11707v1","updated":"2023-12-18T21:07:03Z","published":"2023-12-18T21:07:03Z","title":"Unified framework for diffusion generative models in SO(3): applications\n in computer vision and astrophysics","summary":" Diffusion-based generative models represent the current state-of-the-art for\nimage generation. However, standard diffusion models are based on Euclidean\ngeometry and do not translate directly to manifold-valued data. In this work,\nwe develop extensions of both score-based generative models (SGMs) and\nDenoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D\nrotations, SO(3). SO(3) is of particular interest in many disciplines such as\nrobotics, biochemistry and astronomy/cosmology science. Contrary to more\ngeneral Riemannian manifolds, SO(3) admits a tractable solution to heat\ndiffusion, and allows us to implement efficient training of diffusion models.\nWe apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and\ndemonstrate state-of-the-art results. Additionally, we demonstrate the\npracticality of our model on pose estimation tasks and in predicting correlated\ngalaxy orientations for astrophysics/cosmology.\n","authors":["Yesukhei Jagvaral","Francois Lanusse","Rachel Mandelbaum"],"pdf_url":"https://arxiv.org/pdf/2312.11707v1.pdf","comment":"Accepted at AAAI-2024 Main Track"},{"id":"http://arxiv.org/abs/2312.07374v3","updated":"2023-12-18T20:17:55Z","published":"2023-12-12T15:43:36Z","title":"Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt\n for Segmenting Camouflaged Objects","summary":" Camouflaged object detection (COD) approaches heavily rely on pixel-level\nannotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse\nannotations like scribbles or points to reduce annotation effort, but this can\nlead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable\nsegmentation ability with sparse prompts like points. However, manual prompt is\nnot always feasible, as it may not be accessible in real-world application.\nAdditionally, it only provides localization information instead of semantic\none, which can intrinsically cause ambiguity in interpreting the targets. In\nthis work, we aim to eliminate the need for manual prompt. The key idea is to\nemploy Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts\nusing the semantic information given by a generic text prompt. To that end, we\nintroduce a test-time adaptation per-instance mechanism called Generalizable\nSAM (GenSAM) to automatically enerate and optimize visual prompts the generic\ntask prompt for WSCOD. In particular, CCTP maps a single generic text prompt\nonto image-specific consensus foreground and background heatmaps using\nvision-language models, acquiring reliable visual prompts. Moreover, to\ntest-time adapt the visual prompts, we further propose Progressive Mask\nGeneration (PMG) to iteratively reweight the input image, guiding the model to\nfocus on the targets in a coarse-to-fine manner. Crucially, all network\nparameters are fixed, avoiding the need for additional training. Experiments\ndemonstrate the superiority of GenSAM. Experiments on three benchmarks\ndemonstrate that GenSAM outperforms point supervision approaches and achieves\ncomparable results to scribble supervision ones, solely relying on general task\ndescriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.\n","authors":["Jian Hu","Jiayi Lin","Weitong Cai","Shaogang Gong"],"pdf_url":"https://arxiv.org/pdf/2312.07374v3.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.11666v1","updated":"2023-12-18T19:19:32Z","published":"2023-12-18T19:19:32Z","title":"HAAR: Text-Conditioned Generative Model of 3D Strand-based Human\n Hairstyles","summary":" We present HAAR, a new strand-based generative model for 3D human hairstyles.\nSpecifically, based on textual inputs, HAAR produces 3D hairstyles that could\nbe used as production-level assets in modern computer graphics engines. Current\nAI-based generative models take advantage of powerful 2D priors to reconstruct\n3D content in the form of point clouds, meshes, or volumetric functions.\nHowever, by using the 2D priors, they are intrinsically limited to only\nrecovering the visual parts. Highly occluded hair structures can not be\nreconstructed with those methods, and they only model the ''outer shell'',\nwhich is not ready to be used in physics-based rendering or simulation\npipelines. In contrast, we propose a first text-guided generative method that\nuses 3D hair strands as an underlying representation. Leveraging 2D visual\nquestion-answering (VQA) systems, we automatically annotate synthetic hair\nmodels that are generated from a small set of artist-created hairstyles. This\nallows us to train a latent diffusion model that operates in a common hairstyle\nUV space. In qualitative and quantitative studies, we demonstrate the\ncapabilities of the proposed model and compare it to existing hairstyle\ngeneration approaches.\n","authors":["Vanessa Sklyarova","Egor Zakharov","Otmar Hilliges","Michael J. Black","Justus Thies"],"pdf_url":"https://arxiv.org/pdf/2312.11666v1.pdf","comment":"For more results please refer to the project page\n https://haar.is.tue.mpg.de/"},{"id":"http://arxiv.org/abs/2312.11463v1","updated":"2023-12-18T18:59:51Z","published":"2023-12-18T18:59:51Z","title":"Appearance-based Refinement for Object-Centric Motion Segmentation","summary":" The goal of this paper is to discover, segment, and track independently\nmoving objects in complex visual scenes. Previous approaches have explored the\nuse of optical flow for motion segmentation, leading to imperfect predictions\ndue to partial motion, background distraction, and object articulations and\ninteractions. To address this issue, we introduce an appearance-based\nrefinement method that leverages temporal consistency in video streams to\ncorrect inaccurate flow-based proposals. Our approach involves a simple\nselection mechanism that identifies accurate flow-predicted masks as exemplars,\nand an object-centric architecture that refines problematic masks based on\nexemplar information. The model is pre-trained on synthetic data and then\nadapted to real-world videos in a self-supervised manner, eliminating the need\nfor human annotations. Its performance is evaluated on multiple video\nsegmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59.\nWe achieve competitive performance on single-object segmentation, while\nsignificantly outperforming existing models on the more challenging problem of\nmulti-object segmentation. Finally, we investigate the benefits of using our\nmodel as a prompt for a per-frame Segment Anything Model.\n","authors":["Junyu Xie","Weidi Xie","Andrew Zisserman"],"pdf_url":"https://arxiv.org/pdf/2312.11463v1.pdf","comment":"Total 26 pages, 13 figures (including main text: 9 pages, 5 figures)"},{"id":"http://arxiv.org/abs/2312.11461v1","updated":"2023-12-18T18:59:12Z","published":"2023-12-18T18:59:12Z","title":"GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning","summary":" Gaussian splatting has emerged as a powerful 3D representation that harnesses\nthe advantages of both explicit (mesh) and implicit (NeRF) 3D representations.\nIn this paper, we seek to leverage Gaussian splatting to generate realistic\nanimatable avatars from textual descriptions, addressing the limitations (e.g.,\nflexibility and efficiency) imposed by mesh or NeRF-based representations.\nHowever, a naive application of Gaussian splatting cannot generate high-quality\nanimatable avatars and suffers from learning instability; it also cannot\ncapture fine avatar geometries and often leads to degenerate body parts. To\ntackle these problems, we first propose a primitive-based 3D Gaussian\nrepresentation where Gaussians are defined inside pose-driven primitives to\nfacilitate animation. Second, to stabilize and amortize the learning of\nmillions of Gaussians, we propose to use neural implicit fields to predict the\nGaussian attributes (e.g., colors). Finally, to capture fine avatar geometries\nand extract detailed meshes, we propose a novel SDF-based implicit mesh\nlearning approach for 3D Gaussians that regularizes the underlying geometries\nand extracts highly detailed textured meshes. Our proposed method, GAvatar,\nenables the large-scale generation of diverse animatable avatars using only\ntext prompts. GAvatar significantly surpasses existing methods in terms of both\nappearance and geometry quality, and achieves extremely fast rendering (100\nfps) at 1K resolution.\n","authors":["Ye Yuan","Xueting Li","Yangyi Huang","Shalini De Mello","Koki Nagano","Jan Kautz","Umar Iqbal"],"pdf_url":"https://arxiv.org/pdf/2312.11461v1.pdf","comment":"Project website: https://nvlabs.github.io/GAvatar"},{"id":"http://arxiv.org/abs/2312.11460v1","updated":"2023-12-18T18:59:06Z","published":"2023-12-18T18:59:06Z","title":"Hybrid Internal Model: A Simple and Efficient Learner for Agile Legged\n Locomotion","summary":" Robust locomotion control depends on accurate state estimations. However, the\nsensors of most legged robots can only provide partial and noisy observations,\nmaking the estimation particularly challenging, especially for external states\nlike terrain frictions and elevation maps. Inspired by the classical Internal\nModel Control principle, we consider these external states as disturbances and\nintroduce Hybrid Internal Model (HIM) to estimate them according to the\nresponse of the robot. The response, which we refer to as the hybrid internal\nembedding, contains the robot's explicit velocity and implicit stability\nrepresentation, corresponding to two primary goals for locomotion tasks:\nexplicitly tracking velocity and implicitly maintaining stability. We use\ncontrastive learning to optimize the embedding to be close to the robot's\nsuccessor state, in which the response is naturally embedded. HIM has several\nappealing benefits: It only needs the robot's proprioceptions, i.e., those from\njoint encoders and IMU as observations. It innovatively maintains consistent\nobservations between simulation reference and reality that avoids information\nloss in mimicking learning. It exploits batch-level information that is more\nrobust to noises and keeps better sample efficiency. It only requires 1 hour of\ntraining on an RTX 4090 to enable a quadruped robot to traverse any terrain\nunder any disturbances. A wealth of real-world experiments demonstrates its\nagility, even in high-difficulty tasks and cases never occurred during the\ntraining process, revealing remarkable open-world generalizability.\n","authors":["Junfeng Long","Zirui Wang","Quanyi Li","Jiawei Gao","Liu Cao","Jiangmiao Pang"],"pdf_url":"https://arxiv.org/pdf/2312.11460v1.pdf","comment":"Use 1 hour to train a quadruped robot capable of traversing any\n terrain under any disturbances in the open world, Project Page:\n https://github.com/OpenRobotLab/HIMLoco"},{"id":"http://arxiv.org/abs/2312.11459v1","updated":"2023-12-18T18:59:05Z","published":"2023-12-18T18:59:05Z","title":"VolumeDiffusion: Flexible Text-to-3D Generation with Efficient\n Volumetric Encoder","summary":" This paper introduces a pioneering 3D volumetric encoder designed for\ntext-to-3D generation. To scale up the training data for the diffusion model, a\nlightweight network is developed to efficiently acquire feature volumes from\nmulti-view images. The 3D volumes are then trained on a diffusion model for\ntext-to-3D generation using a 3D U-Net. This research further addresses the\nchallenges of inaccurate object captions and high-dimensional feature volumes.\nThe proposed model, trained on the public Objaverse dataset, demonstrates\npromising outcomes in producing diverse and recognizable samples from text\nprompts. Notably, it empowers finer control over object part characteristics\nthrough textual cues, fostering model creativity by seamlessly combining\nmultiple concepts within a single object. This research significantly\ncontributes to the progress of 3D generation by introducing an efficient,\nflexible, and scalable representation methodology. Code is available at\nhttps://github.com/tzco/VolumeDiffusion.\n","authors":["Zhicong Tang","Shuyang Gu","Chunyu Wang","Ting Zhang","Jianmin Bao","Dong Chen","Baining Guo"],"pdf_url":"https://arxiv.org/pdf/2312.11459v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11458v1","updated":"2023-12-18T18:59:03Z","published":"2023-12-18T18:59:03Z","title":"GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View\n Synthesis","summary":" We propose a method for dynamic scene reconstruction using deformable 3D\nGaussians that is tailored for monocular video. Building upon the efficiency of\nGaussian splatting, our approach extends the representation to accommodate\ndynamic elements via a deformable set of Gaussians residing in a canonical\nspace, and a time-dependent deformation field defined by a multi-layer\nperceptron (MLP). Moreover, under the assumption that most natural scenes have\nlarge regions that remain static, we allow the MLP to focus its\nrepresentational power by additionally including a static Gaussian point cloud.\nThe concatenated dynamic and static point clouds form the input for the\nGaussian Splatting rasterizer, enabling real-time rendering. The differentiable\npipeline is optimized end-to-end with a self-supervised rendering loss. Our\nmethod achieves results that are comparable to state-of-the-art dynamic neural\nradiance field methods while allowing much faster optimization and rendering.\nProject website: https://lynl7130.github.io/gaufre/index.html\n","authors":["Yiqing Liang","Numair Khan","Zhengqin Li","Thu Nguyen-Phuoc","Douglas Lanman","James Tompkin","Lei Xiao"],"pdf_url":"https://arxiv.org/pdf/2312.11458v1.pdf","comment":"10 pages, 8 figures, 4 tables"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2312.10885v1","updated":"2023-12-18T02:18:33Z","published":"2023-12-18T02:18:33Z","title":"A novel diffusion recommendation algorithm based on multi-scale cnn and\n residual lstm","summary":" Sequential recommendation aims to infer user preferences from historical\ninteraction sequences and predict the next item that users may be interested in\nthe future. The current mainstream design approach is to represent items as\nfixed vectors, capturing the underlying relationships between items and user\npreferences based on the order of interactions. However, relying on a single\nfixed-item embedding may weaken the modeling capability of the system, and the\nglobal dynamics and local saliency exhibited by user preferences need to be\ndistinguished. To address these issues, this paper proposes a novel diffusion\nrecommendation algorithm based on multi-scale cnn and residual lstm (AREAL). We\nintroduce diffusion models into the recommend system, representing items as\nprobability distributions instead of fixed vectors. This approach enables\nadaptive reflection of multiple aspects of the items and generates item\ndistributions in a denoising manner. We use multi-scale cnn and residual lstm\nmethods to extract the local and global dependency features of user history\ninteractions, and use attention mechanism to distinguish weights as the guide\nfeatures of reverse diffusion recovery. The effectiveness of the proposed\nmethod is validated through experiments conducted on two real-world datasets.\nSpecifically, AREAL obtains improvements over the best baselines by 2.63% and\n4.25% in terms of HR@20 and 5.05% and 3.94% in terms of NDCG@20 on all\ndatasets.\n","authors":["Yong Niu","Xing Xing","Zhichun Jia","Ruidi Liu","Mindong Xin"],"pdf_url":"https://arxiv.org/pdf/2312.10885v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10864v1","updated":"2023-12-18T00:57:03Z","published":"2023-12-18T00:57:03Z","title":"On-Device Recommender Systems: A Tutorial on The New-Generation\n Recommendation Paradigm","summary":" Given the sheer volume of contemporary e-commerce applications, recommender\nsystems (RSs) have gained significant attention in both academia and industry.\nHowever, traditional cloud-based RSs face inevitable challenges, such as\nresource-intensive computation, reliance on network access, and privacy\nbreaches. In response, a new paradigm called on-device recommender systems\n(ODRSs) has emerged recently in various industries like Taobao, Google, and\nKuaishou. ODRSs unleash the computational capacity of user devices with\nlightweight recommendation models tailored for resource-constrained\nenvironments, enabling real-time inference with users' local data. This\ntutorial aims to systematically introduce methodologies of ODRSs, including (1)\nan overview of existing research on ODRSs; (2) a comprehensive taxonomy of\nODRSs, where the core technical content to be covered span across three major\nODRS research directions, including on-device deployment and inference,\non-device training, and privacy/security of ODRSs; (3) limitations and future\ndirections of ODRSs. This tutorial expects to lay the foundation and spark new\ninsights for follow-up research and applications concerning this new\nrecommendation paradigm.\n","authors":["Hongzhi Yin","Tong Chen","Liang Qu","Bin Cui"],"pdf_url":"https://arxiv.org/pdf/2312.10864v1.pdf","comment":"Technical tutorial; to appear at The Web Conference 2024"},{"id":"http://arxiv.org/abs/2310.19251v2","updated":"2023-12-18T00:26:07Z","published":"2023-10-30T03:37:32Z","title":"Pre-trained Recommender Systems: A Causal Debiasing Perspective","summary":" Recent studies on pre-trained vision/language models have demonstrated the\npractical benefit of a new, promising solution-building paradigm in AI where\nmodels can be pre-trained on broad data describing a generic task space and\nthen adapted successfully to solve a wide range of downstream tasks, even when\ntraining data is severely limited (e.g., in zero- or few-shot learning\nscenarios). Inspired by such progress, we investigate in this paper the\npossibilities and challenges of adapting such a paradigm to the context of\nrecommender systems, which is less investigated from the perspective of\npre-trained model. In particular, we propose to develop a generic recommender\nthat captures universal interaction patterns by training on generic user-item\ninteraction data extracted from different domains, which can then be fast\nadapted to improve few-shot learning performance in unseen new domains (with\nlimited data).\n However, unlike vision/language data which share strong conformity in the\nsemantic space, universal patterns underlying recommendation data collected\nacross different domains (e.g., different countries or different E-commerce\nplatforms) are often occluded by both in-domain and cross-domain biases\nimplicitly imposed by the cultural differences in their user and item bases, as\nwell as their uses of different e-commerce platforms. As shown in our\nexperiments, such heterogeneous biases in the data tend to hinder the\neffectiveness of the pre-trained model. To address this challenge, we further\nintroduce and formalize a causal debiasing perspective, which is substantiated\nvia a hierarchical Bayesian deep learning model, named PreRec. Our empirical\nstudies on real-world data show that the proposed model could significantly\nimprove the recommendation performance in zero- and few-shot learning settings\nunder both cross-market and cross-platform scenarios.\n","authors":["Ziqian Lin","Hao Ding","Nghia Hoang","Branislav Kveton","Anoop Deoras","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19251v2.pdf","comment":"8 pages, WSDM 24"},{"id":"http://arxiv.org/abs/2312.09602v2","updated":"2023-12-18T05:18:58Z","published":"2023-12-15T08:33:06Z","title":"Multi-Modality is All You Need for Transferable Recommender Systems","summary":" ID-based Recommender Systems (RecSys), where each item is assigned a unique\nidentifier and subsequently converted into an embedding vector, have dominated\nthe designing of RecSys. Though prevalent, such ID-based paradigm is not\nsuitable for developing transferable RecSys and is also susceptible to the\ncold-start issue. In this paper, we unleash the boundaries of the ID-based\nparadigm and propose a Pure Multi-Modality based Recommender system (PMMRec),\nwhich relies solely on the multi-modal contents of the items (e.g., texts and\nimages) and learns transition patterns general enough to transfer across\ndomains and platforms. Specifically, we design a plug-and-play framework\narchitecture consisting of multi-modal item encoders, a fusion module, and a\nuser encoder. To align the cross-modal item representations, we propose a novel\nnext-item enhanced cross-modal contrastive learning objective, which is\nequipped with both inter- and intra-modality negative samples and explicitly\nincorporates the transition patterns of user behaviors into the item encoders.\nTo ensure the robustness of user representations, we propose a novel noised\nitem detection objective and a robustness-aware contrastive learning objective,\nwhich work together to denoise user sequences in a self-supervised manner.\nPMMRec is designed to be loosely coupled, so after being pre-trained on the\nsource data, each component can be transferred alone, or in conjunction with\nother components, allowing PMMRec to achieve versatility under both\nmulti-modality and single-modality transfer learning settings. Extensive\nexperiments on 4 sources and 10 target datasets demonstrate that PMMRec\nsurpasses the state-of-the-art recommenders in both recommendation performance\nand transferability. Our code and dataset is available at:\nhttps://github.com/ICDE24/PMMRec.\n","authors":["Youhua Li","Hanwen Du","Yongxin Ni","Pengpeng Zhao","Qi Guo","Fajie Yuan","Xiaofang Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.09602v2.pdf","comment":"ICDE'24 Accepted"},{"id":"http://arxiv.org/abs/2312.06165v2","updated":"2023-12-18T15:13:23Z","published":"2023-12-11T07:10:50Z","title":"RecJPQ: Training Large-Catalogue Sequential Recommenders","summary":" Sequential Recommendation is a popular recommendation task that uses the\norder of user-item interaction to model evolving users' interests and\nsequential patterns in their behaviour. Current state-of-the-art\nTransformer-based models for sequential recommendation, such as BERT4Rec and\nSASRec, generate sequence embeddings and compute scores for catalogue items,\nbut the increasing catalogue size makes training these models costly. The Joint\nProduct Quantisation (JPQ) method, originally proposed for passage retrieval,\nmarkedly reduces the size of the retrieval index with minimal effect on model\neffectiveness, by replacing passage embeddings with a limited number of shared\nsub-embeddings. This paper introduces RecJPQ, a novel adaptation of JPQ for\nsequential recommendations, which takes the place of item embeddings tensor and\nreplaces item embeddings with a concatenation of a limited number of shared\nsub-embeddings and, therefore, limits the number of learnable model parameters.\nThe main idea of RecJPQ is to split items into sub-item entities before\ntraining the main recommendation model, which is inspired by splitting words\ninto tokens and training tokenisers in language models. We apply RecJPQ to\nSASRec, BERT4Rec, and GRU4rec models on three large-scale sequential datasets.\nOur results showed that RecJPQ could notably reduce the model size (e.g., 48%\nreduction for the Gowalla dataset with no effectiveness degradation). RecJPQ\ncan also improve model performance through a regularisation effect (e.g. +0.96%\nNDCG@10 improvement on the Booking.com dataset). Overall, RecJPQ allows the\ntraining of state-of-the-art transformer recommenders in industrial\napplications, where datasets with millions of items are common.\n","authors":["Aleksandr V. Petrov","Craig Macdonald"],"pdf_url":"https://arxiv.org/pdf/2312.06165v2.pdf","comment":"Accepted by ACM WSDM 2024"},{"id":"http://arxiv.org/abs/2312.06683v2","updated":"2023-12-18T10:53:15Z","published":"2023-12-09T08:05:20Z","title":"AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate\n Prediction","summary":" Click-through rate (CTR) prediction is a vital task in industrial\nrecommendation systems. Most existing methods focus on the network architecture\ndesign of the CTR model for better accuracy and suffer from the data sparsity\nproblem. Especially in industrial recommendation systems, the widely applied\nnegative sample down-sampling technique due to resource limitation worsens the\nproblem, resulting in a decline in performance. In this paper, we propose\n\\textbf{A}uxiliary Match \\textbf{T}asks for enhancing\n\\textbf{C}lick-\\textbf{T}hrough \\textbf{R}ate prediction accuracy (AT4CTR) by\nalleviating the data sparsity problem. Specifically, we design two match tasks\ninspired by collaborative filtering to enhance the relevance modeling between\nuser and item. As the \"click\" action is a strong signal which indicates the\nuser's preference towards the item directly, we make the first match task aim\nat pulling closer the representation between the user and the item regarding\nthe positive samples. Since the user's past click behaviors can also be treated\nas the user him/herself, we apply the next item prediction as the second match\ntask. For both the match tasks, we choose the InfoNCE as their loss function.\nThe two match tasks can provide meaningful training signals to speed up the\nmodel's convergence and alleviate the data sparsity. We conduct extensive\nexperiments on one public dataset and one large-scale industrial recommendation\ndataset. The result demonstrates the effectiveness of the proposed auxiliary\nmatch tasks. AT4CTR has been deployed in the real industrial advertising system\nand has gained remarkable revenue.\n","authors":["Qi Liu","Xuyang Hou","Defu Lian","Zhe Wang","Haoran Jin","Jia Cheng","Jun Lei"],"pdf_url":"https://arxiv.org/pdf/2312.06683v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.01916v2","updated":"2023-12-18T02:55:44Z","published":"2023-12-04T14:20:16Z","title":"PEACE: Prototype lEarning Augmented transferable framework for\n Cross-domain rEcommendation","summary":" To help merchants/customers to provide/access a variety of services through\nminiapps, online service platforms have occupied a critical position in the\neffective content delivery, in which how to recommend items in the new domain\nlaunched by the service provider for customers has become more urgent. However,\nthe non-negligible gap between the source and diversified target domains poses\na considerable challenge to cross-domain recommendation systems, which often\nleads to performance bottlenecks in industrial settings. While entity graphs\nhave the potential to serve as a bridge between domains, rudimentary\nutilization still fail to distill useful knowledge and even induce the negative\ntransfer issue. To this end, we propose PEACE, a Prototype lEarning Augmented\ntransferable framework for Cross-domain rEcommendation. For domain gap\nbridging, PEACE is built upon a multi-interest and entity-oriented pre-training\narchitecture which could not only benefit the learning of generalized knowledge\nin a multi-granularity manner, but also help leverage more structural\ninformation in the entity graph. Then, we bring the prototype learning into the\npre-training over source domains, so that representations of users and items\nare greatly improved by the contrastive prototype learning module and the\nprototype enhanced attention mechanism for adaptive knowledge utilization. To\nease the pressure of online serving, PEACE is carefully deployed in a\nlightweight manner, and significant performance improvements are observed in\nboth online and offline environments.\n","authors":["Chunjing Gan","Bo Huang","Binbin Hu","Jian Ma","Ziqi Liu","Zhiqiang Zhang","Jun Zhou","Guannan Zhang","Wenliang Zhong"],"pdf_url":"https://arxiv.org/pdf/2312.01916v2.pdf","comment":"Accepted by WSDM 2024"},{"id":"http://arxiv.org/abs/2312.11703v1","updated":"2023-12-18T21:03:46Z","published":"2023-12-18T21:03:46Z","title":"Shaping Political Discourse using multi-source News Summarization","summary":" Multi-document summarization is the process of automatically generating a\nconcise summary of multiple documents related to the same topic. This summary\ncan help users quickly understand the key information from a large collection\nof documents. Multi-document summarization systems are more complex than\nsingle-document summarization systems due to the need to identify and combine\ninformation from multiple sources. In this paper, we have developed a machine\nlearning model that generates a concise summary of a topic from multiple news\ndocuments. The model is designed to be unbiased by sampling its input equally\nfrom all the different aspects of the topic, even if the majority of the news\nsources lean one way.\n","authors":["Charles Rajan","Nishit Asnani","Shreya Singh"],"pdf_url":"https://arxiv.org/pdf/2312.11703v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11361v1","updated":"2023-12-18T17:18:04Z","published":"2023-12-18T17:18:04Z","title":"NoMIRACL: Knowing When You Don't Know for Robust Multilingual\n Retrieval-Augmented Generation","summary":" Retrieval-augmented generation (RAG) grounds large language model (LLM)\noutput by leveraging external knowledge sources to reduce factual\nhallucinations. However, prior works lack a comprehensive evaluation of\ndifferent language families, making it challenging to evaluate LLM robustness\nagainst errors in external retrieved knowledge. To overcome this, we establish\nNoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across\n18 typologically diverse languages. NoMIRACL includes both a non-relevant and a\nrelevant subset. Queries in the non-relevant subset contain passages manually\njudged as non-relevant or noisy, whereas queries in the relevant subset include\nat least a single judged relevant passage. We measure LLM robustness using two\nmetrics: (i) hallucination rate, measuring model tendency to hallucinate an\nanswer, when the answer is not present in passages in the non-relevant subset,\nand (ii) error rate, measuring model inaccuracy to recognize relevant passages\nin the relevant subset. We build a GPT-4 baseline which achieves a 33.2%\nhallucination rate on the non-relevant and a 14.9% error rate on the relevant\nsubset on average. Our evaluation reveals that GPT-4 hallucinates frequently in\nhigh-resource languages, such as French or English. This work highlights an\nimportant avenue for future research to improve LLM robustness to learn how to\nbetter reject non-relevant information in RAG.\n","authors":["Nandan Thakur","Luiz Bonifacio","Xinyu Zhang","Odunayo Ogundepo","Ehsan Kamalloo","David Alfonso-Hermelo","Xiaoguang Li","Qun Liu","Boxing Chen","Mehdi Rezagholizadeh","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2312.11361v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11356v1","updated":"2023-12-18T17:12:35Z","published":"2023-12-18T17:12:35Z","title":"The Problem of Coherence in Natural Language Explanations of\n Recommendations","summary":" Providing natural language explanations for recommendations is particularly\nuseful from the perspective of a non-expert user. Although several methods for\nproviding such explanations have recently been proposed, we argue that an\nimportant aspect of explanation quality has been overlooked in their\nexperimental evaluation. Specifically, the coherence between generated text and\npredicted rating, which is a necessary condition for an explanation to be\nuseful, is not properly captured by currently used evaluation measures. In this\npaper, we highlight the issue of explanation and prediction coherence by 1)\npresenting results from a manual verification of explanations generated by one\nof the state-of-the-art approaches 2) proposing a method of automatic coherence\nevaluation 3) introducing a new transformer-based method that aims to produce\nmore coherent explanations than the state-of-the-art approaches 4) performing\nan experimental evaluation which demonstrates that this method significantly\nimproves the explanation coherence without affecting the other aspects of\nrecommendation performance.\n","authors":["Jakub Raczyński","Mateusz Lango","Jerzy Stefanowski"],"pdf_url":"https://arxiv.org/pdf/2312.11356v1.pdf","comment":"ECAI 2023"},{"id":"http://arxiv.org/abs/2312.11336v1","updated":"2023-12-18T16:41:22Z","published":"2023-12-18T16:41:22Z","title":"DRDT: Dynamic Reflection with Divergent Thinking for LLM-based\n Sequential Recommendation","summary":" The rise of Large Language Models (LLMs) has sparked interest in their\napplication to sequential recommendation tasks as they can provide supportive\nitem information. However, due to the inherent complexities of sequential\nrecommendation, such as sequential patterns across datasets, noise within\nsequences, and the temporal evolution of user preferences, existing LLM\nreasoning strategies, such as in-context learning and chain-of-thought are not\nfully effective. To address these challenges, we introduce a novel reasoning\nprinciple: Dynamic Reflection with Divergent Thinking within a\nretriever-reranker framework. Our approach starts with a collaborative\nin-context demonstration retriever, which collects sequences exhibiting\ncollaborative behaviors as in-context examples. Following this, we abstract\nhigh-level user preferences across multiple aspects, providing a more nuanced\nunderstanding of user interests and circumventing the noise within the raw\nsequences. The cornerstone of our methodology is dynamic reflection, a process\nthat emulates human learning through probing, critiquing, and reflecting, using\nuser feedback to tailor the analysis more effectively to the target user in a\ntemporal manner. We evaluate our approach on three datasets using six\npre-trained LLMs. The superior performance observed across these models\ndemonstrates the efficacy of our reasoning strategy, notably achieved without\nthe need to fine-tune the LLMs. With our principle, we managed to outperform\nGPT-Turbo-3.5 on three datasets using 7b models e.g., Vicuna-7b and Openchat-7b\non NDCG@10. This research not only highlights the potential of LLMs in\nenhancing sequential recommendation systems but also underscores the importance\nof developing tailored reasoning strategies to fully harness their\ncapabilities.\n","authors":["Yu Wang","Zhiwei Liu","Jianguo Zhang","Weiran Yao","Shelby Heinecke","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2312.11336v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.10405v7","updated":"2023-12-18T16:09:49Z","published":"2023-01-25T04:45:06Z","title":"Editing Language Model-based Knowledge Graph Embeddings","summary":" Recently decades have witnessed the empirical success of framing Knowledge\nGraph (KG) embeddings via language models. However, language model-based KG\nembeddings are usually deployed as static artifacts, making them difficult to\nmodify post-deployment without re-training after deployment. To address this\nissue, we propose a new task of editing language model-based KG embeddings in\nthis paper. This task is designed to facilitate rapid, data-efficient updates\nto KG embeddings without compromising the performance of other aspects. We\nbuild four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and\nevaluate several knowledge editing baselines demonstrating the limited ability\nof previous models to handle the proposed challenging task. We further propose\na simple yet strong baseline dubbed KGEditor, which utilizes additional\nparametric layers of the hypernetwork to edit/add facts. Our comprehensive\nexperimental results reveal that KGEditor excels in updating specific facts\nwithout impacting the overall performance, even when faced with limited\ntraining resources. Code and datasets are available in\nhttps://github.com/zjunlp/PromptKG/tree/main/deltaKG.\n","authors":["Siyuan Cheng","Bozhong Tian","Xi Chen","Ningyu Zhang","Qingbing Liu","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2301.10405v7.pdf","comment":"AAAI 2024. The project website is\n https://zjunlp.github.io/project/KGE_Editing/"},{"id":"http://arxiv.org/abs/2210.05662v2","updated":"2023-12-18T14:13:03Z","published":"2022-10-11T17:56:55Z","title":"Understanding or Manipulation: Rethinking Online Performance Gains of\n Modern Recommender Systems","summary":" Recommender systems are expected to be assistants that help human users find\nrelevant information automatically without explicit queries. As recommender\nsystems evolve, increasingly sophisticated learning techniques are applied and\nhave achieved better performance in terms of user engagement metrics such as\nclicks and browsing time. The increase in the measured performance, however,\ncan have two possible attributions: a better understanding of user preferences,\nand a more proactive ability to utilize human bounded rationality to seduce\nuser over-consumption. A natural following question is whether current\nrecommendation algorithms are manipulating user preferences. If so, can we\nmeasure the manipulation level? In this paper, we present a general framework\nfor benchmarking the degree of manipulations of recommendation algorithms, in\nboth slate recommendation and sequential recommendation scenarios. The\nframework consists of four stages, initial preference calculation, training\ndata collection, algorithm training and interaction, and metrics calculation\nthat involves two proposed metrics. We benchmark some representative\nrecommendation algorithms in both synthetic and real-world datasets under the\nproposed framework. We have observed that a high online click-through rate does\nnot necessarily mean a better understanding of user initial preference, but\nends in prompting users to choose more documents they initially did not favor.\nMoreover, we find that the training data have notable impacts on the\nmanipulation degrees, and algorithms with more powerful modeling abilities are\nmore sensitive to such impacts. The experiments also verified the usefulness of\nthe proposed metrics for measuring the degree of manipulations. We advocate\nthat future recommendation algorithm studies should be treated as an\noptimization problem with constrained user preference manipulations.\n","authors":["Zhengbang Zhu","Rongjun Qin","Junjie Huang","Xinyi Dai","Yang Yu","Yong Yu","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2210.05662v2.pdf","comment":"33 pages, 11 figures, 4 tables, ACM Transactions on Information\n Systems"},{"id":"http://arxiv.org/abs/2306.02841v4","updated":"2023-12-18T12:06:56Z","published":"2023-06-05T12:46:40Z","title":"CTRL: Connect Collaborative and Language Model for CTR Prediction","summary":" Traditional click-through rate (CTR) prediction models convert the tabular\ndata into one-hot vectors and leverage the collaborative relations among\nfeatures for inferring the user's preference over items. This modeling paradigm\ndiscards essential semantic information. Though some works like P5 and CTR-BERT\nhave explored the potential of using Pre-trained Language Models (PLMs) to\nextract semantic signals for CTR prediction, they are computationally expensive\nand suffer from low efficiency. Besides, the beneficial collaborative relations\nare not considered, hindering the recommendation performance. To solve these\nproblems, in this paper, we propose a novel framework \\textbf{CTRL}, which is\nindustrial-friendly and model-agnostic with superior inference efficiency.\nSpecifically, the original tabular data is first converted into textual data.\nBoth tabular data and converted textual data are regarded as two different\nmodalities and are separately fed into the collaborative CTR model and\npre-trained language model. A cross-modal knowledge alignment procedure is\nperformed to fine-grained align and integrate the collaborative and semantic\nsignals, and the lightweight collaborative model can be deployed online for\nefficient serving after fine-tuned with supervised signals. Experimental\nresults on three public datasets show that CTRL outperforms the\nstate-of-the-art (SOTA) CTR models significantly. Moreover, we further verify\nits effectiveness on a large-scale industrial recommender system.\n","authors":["Xiangyang Li","Bo Chen","Lu Hou","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2306.02841v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11036v1","updated":"2023-12-18T09:13:41Z","published":"2023-12-18T09:13:41Z","title":"UniGen: A Unified Generative Framework for Retrieval and Question\n Answering with Large Language Models","summary":" Generative information retrieval, encompassing two major tasks of Generative\nDocument Retrieval (GDR) and Grounded Answer Generation (GAR), has gained\nsignificant attention in the area of information retrieval and natural language\nprocessing. Existing methods for GDR and GAR rely on separate retrieval and\nreader modules, which hinder simultaneous optimization. To overcome this, we\npresent \\textbf{UniGen}, a \\textbf{Uni}fied \\textbf{Gen}erative framework for\nretrieval and question answering that integrates both tasks into a single\ngenerative model leveraging the capabilities of large language models. UniGen\nemploys a shared encoder and two distinct decoders for generative retrieval and\nquestion answering. To facilitate the learning of both tasks, we introduce\nconnectors, generated by large language models, to bridge the gaps between\nquery inputs and generation targets, as well as between document identifiers\nand answers. Furthermore, we propose an iterative enhancement strategy that\nleverages generated answers and retrieved documents to iteratively improve both\ntasks. Through extensive experiments on the MS MARCO and NQ datasets, we\ndemonstrate the effectiveness of UniGen, showcasing its superior performance in\nboth the retrieval and the question answering tasks.\n","authors":["Xiaoxi Li","Yujia Zhou","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2312.11036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11018v1","updated":"2023-12-18T08:35:10Z","published":"2023-12-18T08:35:10Z","title":"Hypergrah-Enhanced Dual Convolutional Network for Bundle Recommendation","summary":" Bundle recommendations strive to offer users a set of items as a package\nnamed bundle, enhancing convenience and contributing to the seller's revenue.\nWhile previous approaches have demonstrated notable performance, we argue that\nthey may compromise the ternary relationship among users, items, and bundles.\nThis compromise can result in information loss, ultimately impacting the\noverall model performance. To address this gap, we develop a unified model for\nbundle recommendation, termed hypergraph-enhanced dual convolutional neural\nnetwork (HED). Our approach is characterized by two key aspects. Firstly, we\nconstruct a complete hypergraph to capture interaction dynamics among users,\nitems, and bundles. Secondly, we incorporate U-B interaction information to\nenhance the information representation derived from users and bundle embedding\nvectors. Extensive experimental results on the Youshu and Netease datasets have\ndemonstrated that HED surpasses state-of-the-art baselines, proving its\neffectiveness. In addition, various ablation studies and sensitivity analyses\nrevealed the working mechanism and proved our effectiveness. Codes and datasets\nare available at https://github.com/AAI-Lab/HED\n","authors":["Kangbo Liu","Yang Li","Yaoxin Wu","Zhaoxuan Wang","Xiaoxu Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11018v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10968v1","updated":"2023-12-18T06:45:31Z","published":"2023-12-18T06:45:31Z","title":"PARs: Predicate-based Association Rules for Efficient and Accurate\n Model-Agnostic Anomaly Explanation","summary":" While new and effective methods for anomaly detection are frequently\nintroduced, many studies prioritize the detection task without considering the\nneed for explainability. Yet, in real-world applications, anomaly explanation,\nwhich aims to provide explanation of why specific data instances are identified\nas anomalies, is an equally important task. In this work, we present a novel\napproach for efficient and accurate model-agnostic anomaly explanation for\ntabular data using Predicate-based Association Rules (PARs). PARs can provide\nintuitive explanations not only about which features of the anomaly instance\nare abnormal, but also the reasons behind their abnormality. Our user study\nindicates that the anomaly explanation form of PARs is better comprehended and\npreferred by regular users of anomaly detection systems as compared to existing\nmodel-agnostic explanation options. Furthermore, we conduct extensive\nexperiments on various benchmark datasets, demonstrating that PARs compare\nfavorably to state-of-the-art model-agnostic methods in terms of computing\nefficiency and explanation accuracy on anomaly explanation tasks. The code for\nPARs tool is available at https://github.com/NSIBF/PARs-EXAD.\n","authors":["Cheng Feng"],"pdf_url":"https://arxiv.org/pdf/2312.10968v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04923v2","updated":"2023-12-18T06:43:02Z","published":"2023-07-10T22:14:56Z","title":"Ranking with Long-Term Constraints","summary":" The feedback that users provide through their choices (e.g., clicks,\npurchases) is one of the most common types of data readily available for\ntraining search and recommendation algorithms. However, myopically training\nsystems based on choice data may only improve short-term engagement, but not\nthe long-term sustainability of the platform and the long-term benefits to its\nusers, content providers, and other stakeholders. In this paper, we thus\ndevelop a new framework in which decision makers (e.g., platform operators,\nregulators, users) can express long-term goals for the behavior of the platform\n(e.g., fairness, revenue distribution, legal requirements). These goals take\nthe form of exposure or impact targets that go well beyond individual sessions,\nand we provide new control-based algorithms to achieve these goals. In\nparticular, the controllers are designed to achieve the stated long-term goals\nwith minimum impact on short-term engagement. Beyond the principled theoretical\nderivation of the controllers, we evaluate the algorithms on both synthetic and\nreal-world data. While all controllers perform well, we find that they provide\ninteresting trade-offs in efficiency, robustness, and the ability to plan\nahead.\n","authors":["Kianté Brantley","Zhichong Fang","Sarah Dean","Thorsten Joachims"],"pdf_url":"https://arxiv.org/pdf/2307.04923v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10967v1","updated":"2023-12-18T06:41:23Z","published":"2023-12-18T06:41:23Z","title":"Knowledge Graphs and Pre-trained Language Models enhanced Representation\n Learning for Conversational Recommender Systems","summary":" Conversational recommender systems (CRS) utilize natural language\ninteractions and dialogue history to infer user preferences and provide\naccurate recommendations. Due to the limited conversation context and\nbackground knowledge, existing CRSs rely on external sources such as knowledge\ngraphs to enrich the context and model entities based on their inter-relations.\nHowever, these methods ignore the rich intrinsic information within entities.\nTo address this, we introduce the Knowledge-Enhanced Entity Representation\nLearning (KERL) framework, which leverages both the knowledge graph and a\npre-trained language model to improve the semantic understanding of entities\nfor CRS. In our KERL framework, entity textual descriptions are encoded via a\npre-trained language model, while a knowledge graph helps reinforce the\nrepresentation of these entities. We also employ positional encoding to\neffectively capture the temporal information of entities in a conversation. The\nenhanced entity representation is then used to develop a recommender component\nthat fuses both entity and contextual representations for more informed\nrecommendations, as well as a dialogue component that generates informative\nentity-related information in the response text. A high-quality knowledge graph\nwith aligned entity descriptions is constructed to facilitate our study, namely\nthe Wiki Movie Knowledge Graph (WikiMKG). The experimental results show that\nKERL achieves state-of-the-art results in both recommendation and response\ngeneration tasks.\n","authors":["Zhangchi Qiu","Ye Tao","Shirui Pan","Alan Wee-Chung Liew"],"pdf_url":"https://arxiv.org/pdf/2312.10967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.00229v3","updated":"2023-12-18T06:33:57Z","published":"2022-12-01T02:26:52Z","title":"NIR-Prompt: A Multi-task Generalized Neural Information Retrieval\n Training Framework","summary":" Information retrieval aims to find information that meets users' needs from\nthe corpus. Different needs correspond to different IR tasks such as document\nretrieval, open-domain question answering, retrieval-based dialogue, etc.,\nwhile they share the same schema to estimate the relationship between texts. It\nindicates that a good IR model can generalize to different tasks and domains.\nHowever, previous studies indicate that state-of-the-art neural information\nretrieval (NIR) models, e.g, pre-trained language models (PLMs) are hard to\ngeneralize. Mainly because the end-to-end fine-tuning paradigm makes the model\noveremphasize task-specific signals and domain biases but loses the ability to\ncapture generalized essential signals. To address this problem, we propose a\nnovel NIR training framework named NIR-Prompt for retrieval and reranking\nstages based on the idea of decoupling signal capturing and combination.\nNIR-Prompt exploits Essential Matching Module (EMM) to capture the essential\nmatching signals and gets the description of tasks by Matching Description\nModule (MDM). The description is used as task-adaptation information to combine\nthe essential matching signals to adapt to different tasks. Experiments under\nin-domain multi-task, out-of-domain multi-task, and new task adaptation\nsettings show that NIR-Prompt can improve the generalization of PLMs in NIR for\nboth retrieval and reranking stages compared with baselines.\n","authors":["Shicheng Xu","Liang Pang","Huawei Shen","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2212.00229v3.pdf","comment":"This article is the extension of arXiv:2204.02725 and accepted by\n TOIS"},{"id":"http://arxiv.org/abs/2312.10947v1","updated":"2023-12-18T05:53:44Z","published":"2023-12-18T05:53:44Z","title":"LabelCraft: Empowering Short Video Recommendations with Automated Label\n Crafting","summary":" Short video recommendations often face limitations due to the quality of user\nfeedback, which may not accurately depict user interests. To tackle this\nchallenge, a new task has emerged: generating more dependable labels from\noriginal feedback. Existing label generation methods rely on manual rules,\ndemanding substantial human effort and potentially misaligning with the desired\nobjectives of the platform. To transcend these constraints, we introduce\nLabelCraft, a novel automated label generation method explicitly optimizing\npivotal operational metrics for platform success. By formulating label\ngeneration as a higher-level optimization problem above recommender model\noptimization, LabelCraft introduces a trainable labeling model for automatic\nlabel mechanism modeling. Through meta-learning techniques, LabelCraft\neffectively addresses the bi-level optimization hurdle posed by the recommender\nand labeling models, enabling the automatic acquisition of intricate label\ngeneration mechanisms.Extensive experiments on real-world datasets corroborate\nLabelCraft's excellence across varied operational metrics, encompassing usage\ntime, user engagement, and retention. Codes are available at\nhttps://github.com/baiyimeng/LabelCraft.\n","authors":["Yimeng Bai","Yang Zhang","Jing Lu","Jianxin Chang","Xiaoxue Zang","Yanan Niu","Yang Song","Fuli Feng"],"pdf_url":"https://arxiv.org/pdf/2312.10947v1.pdf","comment":"Accepted by WSDM'24"},{"id":"http://arxiv.org/abs/2312.11569v1","updated":"2023-12-18T04:37:45Z","published":"2023-12-18T04:37:45Z","title":"Application of AI in Nutrition","summary":" In healthcare, artificial intelligence (AI) has been changing the way doctors\nand health experts take care of people. This paper will cover how AI is making\nmajor changes in the health care system, especially with nutrition. Various\nmachine learning and deep learning algorithms have been developed to extract\nvaluable information from healthcare data which help doctors, nutritionists,\nand health experts to make better decisions and make our lifestyle healthy.\nThis paper provides an overview of the current state of AI applications in\nhealthcare with a focus on the utilization of AI-driven recommender systems in\nnutrition. It will discuss the positive outcomes and challenges that arise when\nAI is used in this field. This paper addresses the challenges to develop AI\nrecommender systems in healthcare, providing a well-rounded perspective on the\ncomplexities. Real-world examples and research findings are presented to\nunderscore the tangible and significant impact AI recommender systems have in\nthe field of healthcare, particularly in nutrition. The ongoing efforts of\napplying AI in nutrition lay the groundwork for a future where personalized\nrecommendations play a pivotal role in guiding individuals toward healthier\nlifestyles.\n","authors":["Ritu Ramakrishnan","Tianxiang Xing","Tianfeng Chen","Ming-Hao Lee","Jinzhu Gao"],"pdf_url":"https://arxiv.org/pdf/2312.11569v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.00899v2","updated":"2023-12-18T01:10:59Z","published":"2023-06-01T16:56:04Z","title":"Pitfalls in Link Prediction with Graph Neural Networks: Understanding\n the Impact of Target-link Inclusion & Better Practices","summary":" While Graph Neural Networks (GNNs) are remarkably successful in a variety of\nhigh-impact applications, we demonstrate that, in link prediction, the common\npractices of including the edges being predicted in the graph at training\nand/or test have outsized impact on the performance of low-degree nodes. We\ntheoretically and empirically investigate how these practices impact node-level\nperformance across different degrees. Specifically, we explore three issues\nthat arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test\nleakage. The former two issues lead to poor generalizability to the test data,\nwhile the latter leads to overestimation of the model's performance and\ndirectly impacts the deployment of GNNs. To address these issues in a\nsystematic way, we introduce an effective and efficient GNN training framework,\nSpotTarget, which leverages our insight on low-degree nodes: (1) at training\ntime, it excludes a (training) edge to be predicted if it is incident to at\nleast one low-degree node; and (2) at test time, it excludes all test edges to\nbe predicted (thus, mimicking real scenarios of using GNNs, where the test data\nis not included in the graph). SpotTarget helps researchers and practitioners\nadhere to best practices for learning from graph data, which are frequently\noverlooked even by the most widely-used frameworks. Our experiments on various\nreal-world datasets show that SpotTarget makes GNNs up to 15x more accurate in\nsparse graphs, and significantly improves their performance for low-degree\nnodes in dense graphs.\n","authors":["Jing Zhu","Yuhang Zhou","Vassilis N. Ioannidis","Shengyi Qian","Wei Ai","Xiang Song","Danai Koutra"],"pdf_url":"https://arxiv.org/pdf/2306.00899v2.pdf","comment":"Extended Version of our WSDM'24 paper. 8 pages, 2 page appendix"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2312.10884v1","updated":"2023-12-18T02:15:40Z","published":"2023-12-18T02:15:40Z","title":"Contextual Reinforcement Learning for Offshore Wind Farm Bidding","summary":" We propose a framework for applying reinforcement learning to contextual\ntwo-stage stochastic optimization and apply this framework to the problem of\nenergy market bidding of an off-shore wind farm. Reinforcement learning could\npotentially be used to learn close to optimal solutions for first stage\nvariables of a two-stage stochastic program under different contexts. Under the\nproposed framework, these solutions would be learned without having to solve\nthe full two-stage stochastic program. We present initial results of training\nusing the DDPG algorithm and present intended future steps to improve\nperformance.\n","authors":["David Cole","Himanshu Sharma","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2312.10884v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10879v1","updated":"2023-12-18T01:52:59Z","published":"2023-12-18T01:52:59Z","title":"Development and Evaluation of Ensemble Learning-based Environmental\n Methane Detection and Intensity Prediction Models","summary":" The environmental impacts of global warming driven by methane (CH4) emissions\nhave catalyzed significant research initiatives in developing novel\ntechnologies that enable proactive and rapid detection of CH4. Several\ndata-driven machine learning (ML) models were tested to determine how well they\nidentified fugitive CH4 and its related intensity in the affected areas.\nVarious meteorological characteristics, including wind speed, temperature,\npressure, relative humidity, water vapor, and heat flux, were included in the\nsimulation. We used the ensemble learning method to determine the\nbest-performing weighted ensemble ML models built upon several weaker\nlower-layer ML models to (i) detect the presence of CH4 as a classification\nproblem and (ii) predict the intensity of CH4 as a regression problem.\n","authors":["Reek Majumder","Jacquan Pollard","M Sabbir Salek","David Werth","Gurcan Comert","Adrian Gale","Sakib Mahmud Khan","Samuel Darko","Mashrur Chowdhury"],"pdf_url":"https://arxiv.org/pdf/2312.10879v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09323v2","updated":"2023-12-18T01:49:05Z","published":"2023-12-07T19:58:37Z","title":"Perspectives on the State and Future of Deep Learning -- 2023","summary":" The goal of this series is to chronicle opinions and issues in the field of\nmachine learning as they stand today and as they change over time. The plan is\nto host this survey periodically until the AI singularity\npaperclip-frenzy-driven doomsday, keeping an updated list of topical questions\nand interviewing new community members for each edition. In this issue, we\nprobed people's opinions on interpretable AI, the value of benchmarking in\nmodern NLP, the state of progress towards understanding deep learning, and the\nfuture of academia.\n","authors":["Micah Goldblum","Anima Anandkumar","Richard Baraniuk","Tom Goldstein","Kyunghyun Cho","Zachary C Lipton","Melanie Mitchell","Preetum Nakkiran","Max Welling","Andrew Gordon Wilson"],"pdf_url":"https://arxiv.org/pdf/2312.09323v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.03360v2","updated":"2023-12-18T01:43:56Z","published":"2023-12-06T08:55:55Z","title":"Teaching Specific Scientific Knowledge into Large Language Models\n through Additional Training","summary":" Through additional training, we explore embedding specialized scientific\nknowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that\neffective knowledge integration requires reading texts from multiple\nperspectives, especially in instructional formats. We utilize text augmentation\nto tackle the scarcity of specialized texts, including style conversions and\ntranslations. Hyperparameter optimization proves crucial, with different size\nmodels (7b, 13b, and 70b) reasonably undergoing additional training. Validating\nour methods, we construct a dataset of 65,000 scientific papers. Although we\nhave succeeded in partially embedding knowledge, the study highlights the\ncomplexities and limitations of incorporating specialized information into\nLLMs, suggesting areas for further improvement.\n","authors":["Kan Hatakeyama-Sato","Yasuhiko Igarashi","Shun Katakami","Yuta Nabae","Teruaki Hayakawa"],"pdf_url":"https://arxiv.org/pdf/2312.03360v2.pdf","comment":"added token information for some texts, and fixed typo"},{"id":"http://arxiv.org/abs/2310.17658v3","updated":"2023-12-18T01:42:26Z","published":"2023-10-18T15:24:34Z","title":"Is Channel Independent strategy optimal for Time Series Forecasting?","summary":" There has been an emergence of various models for long-term time series\nforecasting. Recent studies have demonstrated that a single linear layer, using\nChannel Dependent (CD) or Channel Independent (CI) modeling, can even\noutperform a large number of sophisticated models. However, current research\nprimarily considers CD and CI as two complementary yet mutually exclusive\napproaches, unable to harness these two extremes simultaneously. And it is also\na challenging issue that both CD and CI are static strategies that cannot be\ndetermined to be optimal for a specific dataset without extensive experiments.\nIn this paper, we reconsider whether the current CI strategy is the best\nsolution for time series forecasting. First, we propose a simple yet effective\nstrategy called CSC, which stands for $\\mathbf{C}$hannel\n$\\mathbf{S}$elf-$\\mathbf{C}$lustering strategy, for linear models. Our Channel\nSelf-Clustering (CSC) enhances CI strategy's performance improvements while\nreducing parameter size, for exmpale by over 10 times on electricity dataset,\nand significantly cutting training time. Second, we further propose Channel\nRearrangement (CR), a method for deep models inspired by the self-clustering.\nCR attains competitive performance against baselines. Finally, we also discuss\nwhether it is best to forecast the future values using the historical values of\nthe same channel as inputs. We hope our findings and methods could inspire new\nsolutions beyond CD/CI.\n","authors":["Yuan Peiwen","Zhu Changsheng"],"pdf_url":"https://arxiv.org/pdf/2310.17658v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.05429v2","updated":"2023-12-18T01:29:42Z","published":"2023-12-09T01:26:22Z","title":"Mitigating Nonlinear Algorithmic Bias in Binary Classification","summary":" This paper proposes the use of causal modeling to detect and mitigate\nalgorithmic bias that is nonlinear in the protected attribute. We provide a\ngeneral overview of our approach. We use the German Credit data set, which is\navailable for download from the UC Irvine Machine Learning Repository, to\ndevelop (1) a prediction model, which is treated as a black box, and (2) a\ncausal model for bias mitigation. In this paper, we focus on age bias and the\nproblem of binary classification. We show that the probability of getting\ncorrectly classified as \"low risk\" is lowest among young people. The\nprobability increases with age nonlinearly. To incorporate the nonlinearity\ninto the causal model, we introduce a higher order polynomial term. Based on\nthe fitted causal model, the de-biased probability estimates are computed,\nshowing improved fairness with little impact on overall classification\naccuracy. Causal modeling is intuitive and, hence, its use can enhance\nexplicability and promotes trust among different stakeholders of AI.\n","authors":["Wendy Hui","Wai Kwong Lau"],"pdf_url":"https://arxiv.org/pdf/2312.05429v2.pdf","comment":"5 pages, 3 figures, 12 tables. arXiv admin note: text overlap with\n arXiv:2310.12421"},{"id":"http://arxiv.org/abs/2312.10858v1","updated":"2023-12-18T00:21:47Z","published":"2023-12-18T00:21:47Z","title":"Variable Importance in High-Dimensional Settings Requires Grouping","summary":" Explaining the decision process of machine learning algorithms is nowadays\ncrucial for both model's performance enhancement and human comprehension. This\ncan be achieved by assessing the variable importance of single variables, even\nfor high-capacity non-linear methods, e.g. Deep Neural Networks (DNNs). While\nonly removal-based approaches, such as Permutation Importance (PI), can bring\nstatistical validity, they return misleading results when variables are\ncorrelated. Conditional Permutation Importance (CPI) bypasses PI's limitations\nin such cases. However, in high-dimensional settings, where high correlations\nbetween the variables cancel their conditional importance, the use of CPI as\nwell as other methods leads to unreliable results, besides prohibitive\ncomputation costs. Grouping variables statistically via clustering or some\nprior knowledge gains some power back and leads to better interpretations. In\nthis work, we introduce BCPI (Block-Based Conditional Permutation Importance),\na new generic framework for variable importance computation with statistical\nguarantees handling both single and group cases. Furthermore, as handling\ngroups with high cardinality (such as a set of observations of a given\nmodality) are both time-consuming and resource-intensive, we also introduce a\nnew stacking approach extending the DNN architecture with sub-linear layers\nadapted to the group structure. We show that the ensuing approach extended with\nstacking controls the type-I error even with highly-correlated groups and shows\ntop accuracy across benchmarks. Furthermore, we perform a real-world data\nanalysis in a large-scale medical dataset where we aim to show the consistency\nbetween our results and the literature for a biomarker prediction.\n","authors":["Ahmad Chamma","Bertrand Thirion","Denis A. Engemann"],"pdf_url":"https://arxiv.org/pdf/2312.10858v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10854v1","updated":"2023-12-18T00:05:28Z","published":"2023-12-18T00:05:28Z","title":"The Right Losses for the Right Gains: Improving the Semantic Consistency\n of Deep Text-to-Image Generation with Distribution-Sensitive Losses","summary":" One of the major challenges in training deep neural networks for\ntext-to-image generation is the significant linguistic discrepancy between\nground-truth captions of each image in most popular datasets. The large\ndifference in the choice of words in such captions results in synthesizing\nimages that are semantically dissimilar to each other and to their ground-truth\ncounterparts. Moreover, existing models either fail to generate the\nfine-grained details of the image or require a huge number of parameters that\nrenders them inefficient for text-to-image synthesis. To fill this gap in the\nliterature, we propose using the contrastive learning approach with a novel\ncombination of two loss functions: fake-to-fake loss to increase the semantic\nconsistency between generated images of the same caption, and fake-to-real loss\nto reduce the gap between the distributions of real images and fake ones. We\ntest this approach on two baseline models: SSAGAN and AttnGAN (with style\nblocks to enhance the fine-grained details of the images.) Results show that\nour approach improves the qualitative results on AttnGAN with style blocks on\nthe CUB dataset. Additionally, on the challenging COCO dataset, our approach\nachieves competitive results against the state-of-the-art Lafite model,\noutperforms the FID score of SSAGAN model by 44.\n","authors":["Mahmoud Ahmed","Omer Moussa","Ismail Shaheen","Mohamed Abdelfattah","Amr Abdalla","Marwan Eid","Hesham Eraqi","Mohamed Moustafa"],"pdf_url":"https://arxiv.org/pdf/2312.10854v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2312.00347v2","updated":"2023-12-18T04:59:01Z","published":"2023-12-01T04:51:01Z","title":"RTQ: Rethinking Video-language Understanding Based on Image-text Model","summary":" Recent advancements in video-language understanding have been established on\nthe foundation of image-text models, resulting in promising outcomes due to the\nshared knowledge between images and videos. However, video-language\nunderstanding presents unique challenges due to the inclusion of highly complex\nsemantic details, which result in information redundancy, temporal dependency,\nand scene complexity. Current techniques have only partially tackled these\nissues, and our quantitative analysis indicates that some of these methods are\ncomplementary. In light of this, we propose a novel framework called RTQ\n(Refine, Temporal model, and Query), which addresses these challenges\nsimultaneously. The approach involves refining redundant information within\nframes, modeling temporal relations among frames, and querying task-specific\ninformation from the videos. Remarkably, our model demonstrates outstanding\nperformance even in the absence of video-language pre-training, and the results\nare comparable with or superior to those achieved by state-of-the-art\npre-training methods. Code is available at\nhttps://github.com/SCZwangxiao/RTQ-MM2023.\n","authors":["Xiao Wang","Yaoyu Li","Tian Gan","Zheng Zhang","Jingjing Lv","Liqiang Nie"],"pdf_url":"https://arxiv.org/pdf/2312.00347v2.pdf","comment":"Accepted by ACM MM 2023 as Oral representation"},{"id":"http://arxiv.org/abs/2310.17796v3","updated":"2023-12-18T15:09:20Z","published":"2023-10-26T21:57:21Z","title":"ControlLLM: Augment Language Models with Tools by Searching on Graphs","summary":" We present ControlLLM, a novel framework that enables large language models\n(LLMs) to utilize multi-modal tools for solving complex real-world tasks.\nDespite the remarkable performance of LLMs, they still struggle with tool\ninvocation due to ambiguous user prompts, inaccurate tool selection and\nparameterization, and inefficient tool scheduling. To overcome these\nchallenges, our framework comprises three key components: (1) a \\textit{task\ndecomposer} that breaks down a complex task into clear subtasks with\nwell-defined inputs and outputs; (2) a \\textit{Thoughts-on-Graph (ToG)\nparadigm} that searches the optimal solution path on a pre-built tool graph,\nwhich specifies the parameter and dependency relations among different tools;\nand (3) an \\textit{execution engine with a rich toolbox} that interprets the\nsolution path and runs the tools efficiently on different computational\ndevices. We evaluate our framework on diverse tasks involving image, audio, and\nvideo processing, demonstrating its superior accuracy, efficiency, and\nversatility compared to existing methods. The code is at\nhttps://github.com/OpenGVLab/ControlLLM.\n","authors":["Zhaoyang Liu","Zeqiang Lai","Zhangwei Gao","Erfei Cui","Ziheng Li","Xizhou Zhu","Lewei Lu","Qifeng Chen","Yu Qiao","Jifeng Dai","Wenhai Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17796v3.pdf","comment":"24 pages, 9 figures, 12 tables"},{"id":"http://arxiv.org/abs/2312.11576v1","updated":"2023-12-18T09:24:35Z","published":"2023-12-18T09:24:35Z","title":"Emotion Based Prediction in the Context of Optimized Trajectory Planning\n for Immersive Learning","summary":" In the virtual elements of immersive learning, the use of Google Expedition\nand touch-screen-based emotion are examined. The objective is to investigate\npossible ways to combine these technologies to enhance virtual learning\nenvironments and learners emotional engagement. Pedagogical application,\naffordances, and cognitive load are the corresponding measures that are\ninvolved. Students will gain insight into the reason behind their significantly\nhigher post-assessment Prediction Systems scores compared to preassessment\nscores through this work that leverages technology. This suggests that it is\neffective to include emotional elements in immersive learning scenarios. The\nresults of this study may help develop new strategies by leveraging the\nfeatures of immersive learning technology in educational technologies to\nimprove virtual reality and augmented reality experiences. Furthermore, the\neffectiveness of immersive learning environments can be raised by utilizing\nmagnetic, optical, or hybrid trackers that considerably improve object\ntracking.\n","authors":["Akey Sungheetha","Rajesh Sharma R","Chinnaiyan R"],"pdf_url":"https://arxiv.org/pdf/2312.11576v1.pdf","comment":"5 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.11023v1","updated":"2023-12-18T08:55:42Z","published":"2023-12-18T08:55:42Z","title":"Frequency Spectrum is More Effective for Multimodal Representation and\n Fusion: A Multimodal Spectrum Rumor Detector","summary":" Multimodal content, such as mixing text with images, presents significant\nchallenges to rumor detection in social media. Existing multimodal rumor\ndetection has focused on mixing tokens among spatial and sequential locations\nfor unimodal representation or fusing clues of rumor veracity across\nmodalities. However, they suffer from less discriminative unimodal\nrepresentation and are vulnerable to intricate location dependencies in the\ntime-consuming fusion of spatial and sequential tokens. This work makes the\nfirst attempt at multimodal rumor detection in the frequency domain, which\nefficiently transforms spatial features into the frequency spectrum and obtains\nhighly discriminative spectrum features for multimodal representation and\nfusion. A novel Frequency Spectrum Representation and fUsion network (FSRU)\nwith dual contrastive learning reveals the frequency spectrum is more effective\nfor multimodal representation and fusion, extracting the informative components\nfor rumor detection. FSRU involves three novel mechanisms: utilizing the\nFourier transform to convert features in the spatial domain to the frequency\ndomain, the unimodal spectrum compression, and the cross-modal spectrum\nco-selection module in the frequency domain. Substantial experiments show that\nFSRU achieves satisfactory multimodal rumor detection performance.\n","authors":["An Lao","Qi Zhang","Chongyang Shi","Longbing Cao","Kun Yi","Liang Hu","Duoqian Miao"],"pdf_url":"https://arxiv.org/pdf/2312.11023v1.pdf","comment":"12 pages, AAAI-2024"},{"id":"http://arxiv.org/abs/2312.10980v1","updated":"2023-12-18T07:03:35Z","published":"2023-12-18T07:03:35Z","title":"Liquid Leak Detection Using Thermal Images","summary":" This paper presents a comprehensive solution to address the critical\nchallenge of liquid leaks in the oil and gas industry, leveraging advanced\ncomputer vision and deep learning methodologies. Employing You Only Look Once\n(YOLO) and Real-Time Detection Transformer (RT DETR) models, our project\nfocuses on enhancing early identification of liquid leaks in key infrastructure\ncomponents such as pipelines, pumps, and tanks. Through the integration of\nsurveillance thermal cameras and sensors, the combined YOLO and RT DETR models\ndemonstrate remarkable efficacy in the continuous monitoring and analysis of\nvisual data within oil and gas facilities. YOLO's real-time object detection\ncapabilities swiftly recognize leaks and their patterns, while RT DETR excels\nin discerning specific leak-related features, particularly in thermal images.\nThis approach significantly improves the accuracy and speed of leak detection,\nultimately mitigating environmental and financial risks associated with liquid\nleaks.\n","authors":["Kalpak Bansod","Yanshan Wan","Yugesh Rai"],"pdf_url":"https://arxiv.org/pdf/2312.10980v1.pdf","comment":"13 pages, 9 figures"},{"id":"http://arxiv.org/abs/2309.11082v2","updated":"2023-12-18T06:47:29Z","published":"2023-09-20T06:08:11Z","title":"Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial\n Margin Contrastive Learning","summary":" In recent years, the explosion of web videos makes text-video retrieval\nincreasingly essential and popular for video filtering, recommendation, and\nsearch. Text-video retrieval aims to rank relevant text/video higher than\nirrelevant ones. The core of this task is to precisely measure the cross-modal\nsimilarity between texts and videos. Recently, contrastive learning methods\nhave shown promising results for text-video retrieval, most of which focus on\nthe construction of positive and negative pairs to learn text and video\nrepresentations. Nevertheless, they do not pay enough attention to hard\nnegative pairs and lack the ability to model different levels of semantic\nsimilarity. To address these two issues, this paper improves contrastive\nlearning using two novel techniques. First, to exploit hard examples for robust\ndiscriminative power, we propose a novel Dual-Modal Attention-Enhanced Module\n(DMAE) to mine hard negative pairs from textual and visual clues. By further\nintroducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively\nidentify all these hard negatives and explicitly highlight their impacts in the\ntraining loss. Second, our work argues that triplet samples can better model\nfine-grained semantic similarity compared to pairwise samples. We thereby\npresent a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to\nconstruct partial order triplet samples by automatically generating\nfine-grained hard negatives for matched text-video pairs. The proposed TPM-CL\ndesigns an adaptive token masking strategy with cross-modal interaction to\nmodel subtle semantic differences. Extensive experiments demonstrate that the\nproposed approach outperforms existing methods on four widely-used text-video\nretrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.\n","authors":["Chen Jiang","Hong Liu","Xuzheng Yu","Qing Wang","Yuan Cheng","Jia Xu","Zhongyi Liu","Qingpei Guo","Wei Chu","Ming Yang","Yuan Qi"],"pdf_url":"https://arxiv.org/pdf/2309.11082v2.pdf","comment":"Accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2312.10949v1","updated":"2023-12-18T05:55:46Z","published":"2023-12-18T05:55:46Z","title":"Leveraged Mel spectrograms using Harmonic and Percussive Components in\n Speech Emotion Recognition","summary":" Speech Emotion Recognition (SER) affective technology enables the intelligent\nembedded devices to interact with sensitivity. Similarly, call centre employees\nrecognise customers' emotions from their pitch, energy, and tone of voice so as\nto modify their speech for a high-quality interaction with customers. This work\nexplores, for the first time, the effects of the harmonic and percussive\ncomponents of Mel spectrograms in SER. We attempt to leverage the Mel\nspectrogram by decomposing distinguishable acoustic features for exploitation\nin our proposed architecture, which includes a novel feature map generator\nalgorithm, a CNN-based network feature extractor and a multi-layer perceptron\n(MLP) classifier. This study specifically focuses on effective data\naugmentation techniques for building an enriched hybrid-based feature map. This\nprocess results in a function that outputs a 2D image so that it can be used as\ninput data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also\ninvestigate other acoustic features such as MFCCs, chromagram, spectral\ncontrast, and the tonnetz to assess our proposed framework. A test accuracy of\n92.79% on the Berlin EMO-DB database is achieved. Our result is higher than\nprevious works using CNN-VGG16.\n","authors":["David Hason Rudd","Huan Huo","Guandong Xu"],"pdf_url":"https://arxiv.org/pdf/2312.10949v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2312.10937v1","updated":"2023-12-18T05:24:03Z","published":"2023-12-18T05:24:03Z","title":"An Extended Variational Mode Decomposition Algorithm Developed Speech\n Emotion Recognition Performance","summary":" Emotion recognition (ER) from speech signals is a robust approach since it\ncannot be imitated like facial expression or text based sentiment analysis.\nValuable information underlying the emotions are significant for human-computer\ninteractions enabling intelligent machines to interact with sensitivity in the\nreal world. Previous ER studies through speech signal processing have focused\nexclusively on associations between different signal mode decomposition methods\nand hidden informative features. However, improper decomposition parameter\nselections lead to informative signal component losses due to mode duplicating\nand mixing. In contrast, the current study proposes VGG-optiVMD, an empowered\nvariational mode decomposition algorithm, to distinguish meaningful speech\nfeatures and automatically select the number of decomposed modes and optimum\nbalancing parameter for the data fidelity constraint by assessing their effects\non the VGG16 flattening output layer. Various feature vectors were employed to\ntrain the VGG16 network on different databases and assess VGG-optiVMD\nreproducibility and reliability. One, two, and three-dimensional feature\nvectors were constructed by concatenating Mel-frequency cepstral coefficients,\nChromagram, Mel spectrograms, Tonnetz diagrams, and spectral centroids. Results\nconfirmed a synergistic relationship between the fine-tuning of the signal\nsample rate and decomposition parameters with classification accuracy,\nachieving state-of-the-art 96.09% accuracy in predicting seven emotions on the\nBerlin EMO-DB database.\n","authors":["David Hason Rudd","Huan Huo","Guandong Xu"],"pdf_url":"https://arxiv.org/pdf/2312.10937v1.pdf","comment":"12 pages"}]},"2023-12-19T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2312.12436v1","updated":"2023-12-19T18:59:22Z","published":"2023-12-19T18:59:22Z","title":"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise","summary":" The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Chaoyou Fu","Renrui Zhang","Haojia Lin","Zihan Wang","Timin Gao","Yongdong Luo","Yubo Huang","Zhengye Zhang","Longtian Qiu","Gaoxiang Ye","Yunhang Shen","Mengdan Zhang","Peixian Chen","Sirui Zhao","Xiawu Zheng","Shaohui Lin","Deqiang Jiang","Di Yin","Peng Gao","Ke Li","Xing Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2312.12436v1.pdf","comment":"Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2312.12430v1","updated":"2023-12-19T18:56:52Z","published":"2023-12-19T18:56:52Z","title":"Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP","summary":" We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel\ntitle reranking technique to achieve efficient title reranking 20x-40x faster\nthan vanilla passage reranker. However, one of the challenges with the training\nof Efficient Title Reranker is the instability. Analyzing the issue, we found\nsome very difficult ground truths might act as noisy labels causing accuracy to\ndrop as well as some extreme values in model probability output causing nan. To\naddress these issues, we introduce the Sigmoid Trick, a novel technique that\nreduces the gradient update of both cases resulting in better retrieval\nefficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we\nachieved four state-of-the-art positions on the kilt knowledge benchmark.\n","authors":["Ziyi Chen","Heyi Tao","Daqian Zuo","Jize Jiang","Yang Jun","Yuxiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12364v1","updated":"2023-12-19T17:48:26Z","published":"2023-12-19T17:48:26Z","title":"SpokesBiz -- an Open Corpus of Conversational Polish","summary":" This paper announces the early release of SpokesBiz, a freely available\ncorpus of conversational Polish developed within the CLARIN-BIZ project and\ncomprising over 650 hours of recordings. The transcribed recordings have been\ndiarized and manually annotated for punctuation and casing. We outline the\ngeneral structure and content of the corpus, showcasing selected applications\nin linguistic research, evaluation and improvement of automatic speech\nrecognition (ASR) systems\n","authors":["Piotr Pęzik","Sylwia Karasińska","Anna Cichosz","Łukasz Jałowiecki","Konrad Kaczyński","Małgorzata Krawentek","Karolina Walkusz","Paweł Wilk","Mariusz Kleć","Krzysztof Szklanny","Szymon Marszałkowski"],"pdf_url":"https://arxiv.org/pdf/2312.12364v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08456v3","updated":"2023-12-19T17:42:02Z","published":"2023-06-14T11:57:31Z","title":"PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in\n Poetry Generation","summary":" Controllable text generation is a challenging and meaningful field in natural\nlanguage generation (NLG). Especially, poetry generation is a typical one with\nwell-defined and strict conditions for text generation which is an ideal\nplayground for the assessment of current methodologies. While prior works\nsucceeded in controlling either semantic or metrical aspects of poetry\ngeneration, simultaneously addressing both remains a challenge. In this paper,\nwe pioneer the use of the Diffusion model for generating sonnets and Chinese\nSongCi poetry to tackle such challenges. In terms of semantics, our\nPoetryDiffusion model, built upon the Diffusion model, generates entire\nsentences or poetry by comprehensively considering the entirety of sentence\ninformation. This approach enhances semantic expression, distinguishing it from\nautoregressive and large language models (LLMs). For metrical control, the\nseparation feature of diffusion generation and its constraint control module\nenable us to flexibly incorporate a novel metrical controller to manipulate and\nevaluate metrics (format and rhythm). The denoising process in PoetryDiffusion\nallows for gradual enhancement of semantics and flexible integration of the\nmetrical controller which can calculate and impose penalties on states that\nstray significantly from the target control distribution. Experimental results\non two datasets demonstrate that our model outperforms existing models in\nautomatic evaluation of semantic, metrical, and overall performance as well as\nhuman evaluation.\n","authors":["Zhiyuan Hu","Chumin Liu","Yue Feng","Anh Tuan Luu","Bryan Hooi"],"pdf_url":"https://arxiv.org/pdf/2306.08456v3.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.12343v1","updated":"2023-12-19T17:16:43Z","published":"2023-12-19T17:16:43Z","title":"Avoiding Data Contamination in Language Model Evaluation: Dynamic Test\n Construction with Latest Materials","summary":" Data contamination in evaluation is getting increasingly prevalent with the\nemerge of language models pre-trained on super large, automatically-crawled\ncorpora. This problem leads to significant challenges in accurate assessment of\nmodel capabilities and generalisations. In this paper, we propose LatestEval,\nan automatic method leverages the most recent texts to create uncontaminated\nreading comprehension evaluations. LatestEval avoids data contamination by only\nusing texts published within a recent time window, ensuring no overlap with the\ntraining corpora of pre-trained language models. We develop LatestEval\nautomated pipeline to 1) gather latest texts; 2) identify key information, and\n3) construct questions targeting the information while removing the existing\nanswers from the context. This encourages models to infer the answers\nthemselves based on the remaining context, rather than just copy-paste. Our\nexperiments demonstrate that language models exhibit negligible memorisation\nbehaviours on LatestEval as opposed to previous benchmarks, suggesting a\nsignificantly reduced risk of data contamination and leading to a more robust\nevaluation. Data and code are publicly available at:\nhttps://github.com/liyucheng09/LatestEval.\n","authors":["Yucheng Li","Frank Geurin","Chenghua Lin"],"pdf_url":"https://arxiv.org/pdf/2312.12343v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12334v1","updated":"2023-12-19T17:01:58Z","published":"2023-12-19T17:01:58Z","title":"PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis","summary":" Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to\ninterpret the complex nature of human sentiments. Despite significant progress\nin multimodal architecture design, the field lacks comprehensive regularization\nmethods. This paper introduces PowMix, a versatile embedding space regularizer\nthat builds upon the strengths of unimodal mixing-based regularization\napproaches and introduces novel algorithmic components that are specifically\ntailored to multimodal tasks. PowMix is integrated before the fusion stage of\nmultimodal architectures and facilitates intra-modal mixing, such as mixing\ntext with text, to act as a regularizer. PowMix consists of five components: 1)\na varying number of generated mixed examples, 2) mixing factor reweighting, 3)\nanisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing.\nExtensive experimentation across benchmark MSA datasets and a broad spectrum of\ndiverse architectural designs demonstrate the efficacy of PowMix, as evidenced\nby consistent performance improvements over baselines and existing mixing\nmethods. An in-depth ablation study highlights the critical contribution of\neach PowMix component and how they synergistically enhance performance.\nFurthermore, algorithmic analysis demonstrates how PowMix behaves in different\nscenarios, particularly comparing early versus late fusion architectures.\nNotably, PowMix enhances overall performance without sacrificing model\nrobustness or magnifying text dominance. It also retains its strong performance\nin situations of limited data. Our findings position PowMix as a promising\nversatile regularization strategy for MSA. Code will be made available.\n","authors":["Efthymios Georgiou","Yannis Avrithis","Alexandros Potamianos"],"pdf_url":"https://arxiv.org/pdf/2312.12334v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2312.12321v1","updated":"2023-12-19T16:47:12Z","published":"2023-12-19T16:47:12Z","title":"Bypassing the Safety Training of Open-Source LLMs with Priming Attacks","summary":" With the recent surge in popularity of LLMs has come an ever-increasing need\nfor LLM safety training. In this paper, we show that SOTA open-source LLMs are\nvulnerable to simple, optimization-free attacks we refer to as $\\textit{priming\nattacks}$, which are easy to execute and effectively bypass alignment from\nsafety training. Our proposed attack improves the Attack Success Rate on\nHarmful Behaviors, as measured by Llama Guard, by up to $3.3\\times$ compared to\nbaselines. Source code and data are available at\nhttps://github.com/uiuc-focal-lab/llm-priming-attacks .\n","authors":["Jason Vega","Isha Chaudhary","Changming Xu","Gagandeep Singh"],"pdf_url":"https://arxiv.org/pdf/2312.12321v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12299v1","updated":"2023-12-19T16:20:49Z","published":"2023-12-19T16:20:49Z","title":"Instruct-SCTG: Guiding Sequential Controlled Text Generation through\n Instructions","summary":" Instruction-tuned large language models have shown remarkable performance in\naligning generated text with user intentions across various tasks. However,\nmaintaining human-like discourse structure in the generated text remains a\nchallenging research question. In this paper, we propose Instruct-SCTG, a\nflexible and effective sequential framework that harnesses instruction-tuned\nlanguage models to generate structurally coherent text in both fine-tuned and\nzero-shot setups. Our framework generates articles in a section-by-section\nmanner, aligned with the desired human structure using natural language\ninstructions. Furthermore, we introduce a new automatic metric that measures\ndiscourse divergence in a fuzzy manner. Extensive experiments on three datasets\nfrom representative domains of news and recipes demonstrate the\nstate-of-the-art performance of our framework in imposing discourse structure\nduring text generation, as verified by both automatic and human evaluation. Our\ncode will be available on Github.\n","authors":["Yinhong Liu","Yixuan Su","Ehsan Shareghi","Nigel Collier"],"pdf_url":"https://arxiv.org/pdf/2312.12299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.14743v5","updated":"2023-12-19T16:05:51Z","published":"2023-11-21T18:41:26Z","title":"A Baseline Analysis of Reward Models' Ability To Accurately Analyze\n Foundation Models Under Distribution Shift","summary":" Foundation models, specifically Large Language Models (LLM's), have lately\ngained wide-spread attention and adoption. Reinforcement Learning with Human\nFeedback (RLHF) involves training a reward model to capture desired behaviors,\nwhich is then used to align LLM's. These reward models are additionally used at\ninference-time to estimate LLM responses' adherence to those desired behaviors.\nHowever, there is little work measuring how robust these reward models are to\ndistribution shifts. In this work, we evaluate how reward model performance -\nmeasured via accuracy and calibration (i.e. alignment between accuracy and\nconfidence) - is affected by distribution shift. We show novel calibration\npatterns and accuracy drops due to OOD prompts and responses, and that the\nreward model is more sensitive to shifts in responses than prompts.\nAdditionally, we adapt an OOD detection technique commonly used in\nclassification to the reward model setting to detect these distribution shifts\nin prompts and responses.\n","authors":["Will LeVine","Ben Pikus","Tony Chen","Sean Hendryx"],"pdf_url":"https://arxiv.org/pdf/2311.14743v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10493v2","updated":"2023-12-19T15:55:23Z","published":"2023-12-16T16:14:50Z","title":"Debiasing Multimodal Sarcasm Detection with Contrastive Learning","summary":" Despite commendable achievements made by existing work, prevailing multimodal\nsarcasm detection studies rely more on textual content over visual information.\nIt unavoidably induces spurious correlations between textual words and labels,\nthereby significantly hindering the models' generalization capability. To\naddress this problem, we define the task of out-of-distribution (OOD)\nmultimodal sarcasm detection, which aims to evaluate models' generalizability\nwhen the word distribution is different in training and testing settings.\nMoreover, we propose a novel debiasing multimodal sarcasm detection framework\nwith contrastive learning, which aims to mitigate the harmful effect of biased\ntextual factors for robust OOD generalization. In particular, we first design\ncounterfactual data augmentation to construct the positive samples with\ndissimilar word biases and negative samples with similar word biases.\nSubsequently, we devise an adapted debiasing contrastive learning mechanism to\nempower the model to learn robust task-relevant features and alleviate the\nadverse effect of biased words. Extensive experiments show the superiority of\nthe proposed framework.\n","authors":["Mengzhao Jia","Can Xie","Liqiang Jing"],"pdf_url":"https://arxiv.org/pdf/2312.10493v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12269v1","updated":"2023-12-19T15:51:49Z","published":"2023-12-19T15:51:49Z","title":"Automated speech audiometry: Can it work using open-source pre-trained\n Kaldi-NL automatic speech recognition?","summary":" A practical speech audiometry tool is the digits-in-noise (DIN) test for\nhearing screening of populations of varying ages and hearing status. The test\nis usually conducted by a human supervisor (e.g., clinician), who scores the\nresponses spoken by the listener, or online, where a software scores the\nresponses entered by the listener. The test has 24 digit-triplets presented in\nan adaptive staircase procedure, resulting in a speech reception threshold\n(SRT). We propose an alternative automated DIN test setup that can evaluate\nspoken responses whilst conducted without a human supervisor, using the\nopen-source automatic speech recognition toolkit, Kaldi-NL. Thirty\nself-reported normal-hearing Dutch adults (19-64 years) completed one\nDIN+Kaldi-NL test. Their spoken responses were recorded, and used for\nevaluating the transcript of decoded responses by Kaldi-NL. Study 1 evaluated\nthe Kaldi-NL performance through its word error rate (WER), percentage of\nsummed decoding errors regarding only digits found in the transcript compared\nto the total number of digits present in the spoken responses. Average WER\nacross participants was 5.0% (range 0 - 48%, SD = 8.8%), with average decoding\nerrors in three triplets per participant. Study 2 analysed the effect that\ntriplets with decoding errors from Kaldi-NL had on the DIN test output (SRT),\nusing bootstrapping simulations. Previous research indicated 0.70 dB as the\ntypical within-subject SRT variability for normal-hearing adults. Study 2\nshowed that up to four triplets with decoding errors produce SRT variations\nwithin this range, suggesting that our proposed setup could be feasible for\nclinical applications.\n","authors":["Gloria Araiza-Illan","Luke Meyer","Khiet P. Truong","Deniz Baskent"],"pdf_url":"https://arxiv.org/pdf/2312.12269v1.pdf","comment":"25 pages (double spaced), 5 figures, 3 tables, 54 references"},{"id":"http://arxiv.org/abs/2312.12253v1","updated":"2023-12-19T15:37:27Z","published":"2023-12-19T15:37:27Z","title":"Geo-located Aspect Based Sentiment Analysis (ABSA) for Crowdsourced\n Evaluation of Urban Environments","summary":" Sentiment analysis methods are rapidly being adopted by the field of Urban\nDesign and Planning, for the crowdsourced evaluation of urban environments.\nHowever, most models used within this domain are able to identify positive or\nnegative sentiment associated with a textual appraisal as a whole, without\ninferring information about specific urban aspects contained within it, or the\nsentiment associated with them. While Aspect Based Sentiment Analysis (ABSA) is\nbecoming increasingly popular, most existing ABSA models are trained on\nnon-urban themes such as restaurants, electronics, consumer goods and the like.\nThis body of research develops an ABSA model capable of extracting urban\naspects contained within geo-located textual urban appraisals, along with\ncorresponding aspect sentiment classification. We annotate a dataset of 2500\ncrowdsourced reviews of public parks, and train a Bidirectional Encoder\nRepresentations from Transformers (BERT) model with Local Context Focus (LCF)\non this data. Our model achieves significant improvement in prediction accuracy\non urban reviews, for both Aspect Term Extraction (ATE) and Aspect Sentiment\nClassification (ASC) tasks. For demonstrative analysis, positive and negative\nurban aspects across Boston are spatially visualized. We hope that this model\nis useful for designers and planners for fine-grained urban sentiment\nevaluation.\n","authors":["Demircan Tas","Rohit Priyadarshi Sanatani"],"pdf_url":"https://arxiv.org/pdf/2312.12253v1.pdf","comment":"Created for 6.8610, Quantitative Methods for Natural Language\n Processing at MIT Fall 2022. 5 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.12241v1","updated":"2023-12-19T15:25:39Z","published":"2023-12-19T15:25:39Z","title":"GeomVerse: A Systematic Evaluation of Large Models for Geometric\n Reasoning","summary":" Large language models have shown impressive results for multi-hop\nmathematical reasoning when the input question is only textual. Many\nmathematical reasoning problems, however, contain both text and image. With the\never-increasing adoption of vision language models (VLMs), understanding their\nreasoning abilities for such problems is crucial. In this paper, we evaluate\nthe reasoning capabilities of VLMs along various axes through the lens of\ngeometry problems. We procedurally create a synthetic dataset of geometry\nquestions with controllable difficulty levels along multiple axes, thus\nenabling a systematic evaluation. The empirical results obtained using our\nbenchmark for state-of-the-art VLMs indicate that these models are not as\ncapable in subjects like geometry (and, by generalization, other topics\nrequiring similar reasoning) as suggested by previous benchmarks. This is made\nespecially clear by the construction of our benchmark at various depth levels,\nsince solving higher-depth problems requires long chains of reasoning rather\nthan additional memorized knowledge. We release the dataset for further\nresearch in this area.\n","authors":["Mehran Kazemi","Hamidreza Alvari","Ankit Anand","Jialin Wu","Xi Chen","Radu Soricut"],"pdf_url":"https://arxiv.org/pdf/2312.12241v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14160v4","updated":"2023-12-19T15:13:52Z","published":"2023-05-23T15:26:20Z","title":"Label Words are Anchors: An Information Flow Perspective for\n Understanding In-Context Learning","summary":" In-context learning (ICL) emerges as a promising capability of large language\nmodels (LLMs) by providing them with demonstration examples to perform diverse\ntasks. However, the underlying mechanism of how LLMs learn from the provided\ncontext remains under-explored. In this paper, we investigate the working\nmechanism of ICL through an information flow lens. Our findings reveal that\nlabel words in the demonstration examples function as anchors: (1) semantic\ninformation aggregates into label word representations during the shallow\ncomputation layers' processing; (2) the consolidated information in label words\nserves as a reference for LLMs' final predictions. Based on these insights, we\nintroduce an anchor re-weighting method to improve ICL performance, a\ndemonstration compression technique to expedite inference, and an analysis\nframework for diagnosing ICL errors in GPT2-XL. The promising applications of\nour findings again validate the uncovered ICL working mechanism and pave the\nway for future studies.\n","authors":["Lean Wang","Lei Li","Damai Dai","Deli Chen","Hao Zhou","Fandong Meng","Jie Zhou","Xu Sun"],"pdf_url":"https://arxiv.org/pdf/2305.14160v4.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14901v2","updated":"2023-12-19T14:12:04Z","published":"2023-05-24T08:55:08Z","title":"Chain-of-Questions Training with Latent Answers for Robust Multistep\n Question Answering","summary":" We train a language model (LM) to robustly answer multistep questions by\ngenerating and answering sub-questions. We propose Chain-of-Questions, a\nframework that trains a model to generate sub-questions and sub-answers one at\na time by leveraging human annotated question decomposition meaning\nrepresentation (QDMR). The key technical challenge is that QDMR only contains\nsub-questions but not answers to those sub-questions, so we treat sub-answers\nas latent variables and optimize them using a novel dynamic mixture of Hard-EM\nand MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods\nby 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA\nadversarial set, thus demonstrating the effectiveness and robustness of our\nframework.\n","authors":["Wang Zhu","Jesse Thomason","Robin Jia"],"pdf_url":"https://arxiv.org/pdf/2305.14901v2.pdf","comment":"Accepted by the EMNLP 2023"},{"id":"http://arxiv.org/abs/2311.17280v3","updated":"2023-12-19T14:04:33Z","published":"2023-11-28T23:40:13Z","title":"Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?","summary":" Data augmentation via back-translation is common when pretraining\nVision-and-Language Navigation (VLN) models, even though the generated\ninstructions are noisy. But: does that noise matter? We find that nonsensical\nor irrelevant language instructions during pretraining can have little effect\non downstream performance for both HAMT and VLN-BERT on R2R, and is still\nbetter than only using clean, human data. To underscore these results, we\nconcoct an efficient augmentation method, Unigram + Object, which generates\nnonsensical instructions that nonetheless improve downstream performance. Our\nfindings suggest that what matters for VLN R2R pretraining is the quantity of\nvisual trajectories, not the quality of instructions.\n","authors":["Wang Zhu","Ishika Singh","Yuan Huang","Robin Jia","Jesse Thomason"],"pdf_url":"https://arxiv.org/pdf/2311.17280v3.pdf","comment":"Accepted by O-DRUM @ CVPR 2023"},{"id":"http://arxiv.org/abs/2312.12148v1","updated":"2023-12-19T13:31:24Z","published":"2023-12-19T13:31:24Z","title":"Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models:\n A Critical Review and Assessment","summary":" With the continuous growth in the number of parameters of transformer-based\npretrained language models (PLMs), particularly the emergence of large language\nmodels (LLMs) with billions of parameters, many natural language processing\n(NLP) tasks have demonstrated remarkable success. However, the enormous size\nand computational demands of these models pose significant challenges for\nadapting them to specific downstream tasks, especially in environments with\nlimited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers\nan effective solution by reducing the number of fine-tuning parameters and\nmemory usage while achieving comparable performance to full fine-tuning. The\ndemands for fine-tuning PLMs, especially LLMs, have led to a surge in the\ndevelopment of PEFT methods, as depicted in Fig. 1. In this paper, we present a\ncomprehensive and systematic review of PEFT methods for PLMs. We summarize\nthese PEFT methods, discuss their applications, and outline future directions.\nFurthermore, we conduct experiments using several representative PEFT methods\nto better understand their effectiveness in parameter efficiency and memory\nefficiency. By offering insights into the latest advancements and practical\napplications, this survey serves as an invaluable resource for researchers and\npractitioners seeking to navigate the challenges and opportunities presented by\nPEFT in the context of PLMs.\n","authors":["Lingling Xu","Haoran Xie","Si-Zhao Joe Qin","Xiaohui Tao","Fu Lee Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12148v1.pdf","comment":"20 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.11193v2","updated":"2023-12-19T13:24:26Z","published":"2023-12-18T13:40:16Z","title":"\"Paraphrasing The Original Text\" Makes High Accuracy Long-Context QA","summary":" Although LLMs continue to iterate and improve, most open-source models still\nhave a context window of no more than 4k, limiting their ability to handle\nlong-context problems. Most existing open-source models for long-context chat\nstill lack satisfactory accuracy. To address this issue, I approach it from the\nperspective of training data and theoretically prove that training the\ncapability to handle long contexts requires \"effective\" rather than \"long\"\ndata. Based on this, I propose using the \"original text paraphrase\" task, and\nsuccessfully extend the context window of the existing model to 32k by a\nlow-cost and effective method, achieving extremely high accuracy in\nmulti-document-QA and surpassing all existing open-source models of the same\nscale. The model and training data have been open-sourced on\nHuggingFace(https://huggingface.co/yuyijiong/Qwen-14b-chat-yarn-32k) and\nWiseModel(https://wisemodel.cn/models/yuyijiong/Qwen-14b-chat-yarn-32k).\n","authors":["Yijiong Yu"],"pdf_url":"https://arxiv.org/pdf/2312.11193v2.pdf","comment":"Chinese version of this paper can be downloaded from\n (https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)"},{"id":"http://arxiv.org/abs/2310.13023v2","updated":"2023-12-19T13:23:56Z","published":"2023-10-19T06:17:46Z","title":"GraphGPT: Graph Instruction Tuning for Large Language Models","summary":" Graph Neural Networks (GNNs) have advanced graph structure understanding via\nrecursive information exchange and aggregation among graph nodes. To improve\nmodel robustness, self-supervised learning (SSL) has emerged as a promising\napproach for data augmentation. However, existing methods for generating\npre-trained graph embeddings often rely on fine-tuning with specific downstream\ntask labels, which limits their usability in scenarios where labeled data is\nscarce or unavailable. To address this, our research focuses on advancing the\ngeneralization capabilities of graph models in challenging zero-shot learning\nscenarios. Inspired by the success of large language models (LLMs), we aim to\ndevelop a graph-oriented LLM that can achieve high generalization across\ndiverse downstream datasets and tasks, even without any information available\nfrom the downstream graph data. In this work, we present the GraphGPT framework\nthat aligns LLMs with graph structural knowledge with a graph instruction\ntuning paradigm. Our framework incorporates a text-graph grounding component to\nestablish a connection between textual information and graph structures.\nAdditionally, we propose a dual-stage instruction tuning paradigm, accompanied\nby a lightweight graph-text alignment projector. This paradigm explores\nself-supervised graph structural signals and task-specific graph instructions,\nto guide LLMs in understanding complex graph structures and improving their\nadaptability across different downstream tasks. Our framework is evaluated on\nsupervised and zero-shot graph learning tasks, demonstrating superior\ngeneralization and outperforming state-of-the-art baselines.\n","authors":["Jiabin Tang","Yuhao Yang","Wei Wei","Lei Shi","Lixin Su","Suqi Cheng","Dawei Yin","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.13023v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12141v1","updated":"2023-12-19T13:23:18Z","published":"2023-12-19T13:23:18Z","title":"Exploring the Residual Stream of Transformers","summary":" Transformer-based models have achieved great breakthroughs in recent years.\nHowever, there are many significant questions that have not been answered in\nthe field of explaining the reason why the models have powerful outputs. We do\nnot know how to locate the models' important parameters storing the knowledge\nfor predicting the next word, and whether these parameters are stored on the\nsame layer/module or different ones. Moreover, we do not understand the\nmechanism to merge the knowledge into the final embedding for next word\nprediction. In this paper, we explore the residual stream of transformers to\nincrease the interpretability. We find the mechanism behind residual connection\nis a direct addition function on before-softmax values, so the probabilities of\ntokens with larger before-softmax values will increase. Moreover, we prove that\nusing log probability increase as contribution scores is reasonable, and based\non this we can locate important parameters. Besides, we propose a method to\nanalyze how previous layers affect upper layers by comparing the inner\nproducts. The experimental results and case study show that our research can\nincrease the interpretability of transformer-based models. We will release our\ncode on https://github.com/zepingyu0512/residualstream.\n","authors":["Zeping Yu","Kailai Yang","Zhiwei Liu","Sophia Ananiadou"],"pdf_url":"https://arxiv.org/pdf/2312.12141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.09897v2","updated":"2023-12-19T13:05:12Z","published":"2022-12-19T22:37:46Z","title":"Inducing Character-level Structure in Subword-based Language Models with\n Type-level Interchange Intervention Training","summary":" Language tasks involving character-level manipulations (e.g., spelling\ncorrections, arithmetic operations, word games) are challenging for models\noperating on subword units. To address this, we develop a causal intervention\nframework to learn robust and interpretable character representations inside\nsubword-based language models. Our method treats each character as a typed\nvariable in a causal model and learns such causal structures by adapting the\ninterchange intervention training method of Geiger et al. (2021). We\nadditionally introduce a suite of character-level tasks that systematically\nvary in their dependence on meaning and sequence-level context. While\ncharacter-level models still perform best on purely form-based tasks like\nstring reversal, our method outperforms character-level models on more complex\ntasks that blend form, meaning, and context, such as spelling correction in\ncontext and word search games. Compared with standard subword-based models, our\napproach also significantly improves robustness on unseen token sequences and\nleads to human-interpretable internal representations of characters.\n","authors":["Jing Huang","Zhengxuan Wu","Kyle Mahowald","Christopher Potts"],"pdf_url":"https://arxiv.org/pdf/2212.09897v2.pdf","comment":"Findings of the Association for Computational Linguistics: ACL 2023"},{"id":"http://arxiv.org/abs/2310.09767v2","updated":"2023-12-19T13:01:50Z","published":"2023-10-15T07:58:52Z","title":"VLIS: Unimodal Language Models Guide Multimodal Language Generation","summary":" Multimodal language generation, which leverages the synergy of language and\nvision, is a rapidly expanding field. However, existing vision-language models\nface challenges in tasks that require complex linguistic understanding. To\naddress this issue, we introduce Visual-Language models as Importance Sampling\nweights (VLIS), a novel framework that combines the visual conditioning\ncapability of vision-language models with the language understanding of\nunimodal text-only language models without further training. It extracts\npointwise mutual information of each image and text from a visual-language\nmodel and uses the value as an importance sampling weight to adjust the token\nlikelihood from a text-only model. VLIS improves vision-language models on\ndiverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and\nScienceQA) and complex text generation (Concadia, Image Paragraph Captioning,\nand ROCStories). Our results suggest that VLIS represents a promising new\ndirection for multimodal language generation.\n","authors":["Jiwan Chung","Youngjae Yu"],"pdf_url":"https://arxiv.org/pdf/2310.09767v2.pdf","comment":"Accepted as main paper in EMNLP 2023"},{"id":"http://arxiv.org/abs/2101.00153v3","updated":"2023-12-19T12:57:23Z","published":"2021-01-01T03:29:21Z","title":"Graphmax for Text Generation","summary":" In text generation, a large language model (LM) makes a choice of each new\nword based only on the former selection of its context using the softmax\nfunction. Nevertheless, the link statistics information of concurrent words\nbased on a scene-specific corpus is valuable in choosing the next word, which\ncan help to ensure the topic of the generated text to be aligned with the\ncurrent task. To fully explore the co-occurrence information,we propose a\ngraphmax function for task-specific text generation. Using the graph-based\nregularization, graphmax enables the final word choice to be determined by both\nthe global knowledge from the LM and the local knowledge from the\nscene-specific corpus. The traditional softmax function is regularized with a\ngraph total variation (GTV) term, which incorporates the local knowledge into\nthe LM and encourages the model to consider the statistical relationships\nbetween words in a scene-specific corpus. The proposed graphmax is versatile\nand can be readily plugged into any large pre-trained LM for text generation\nand machine translation. Through extensive experiments, we demonstrate that the\nnew GTV-based regularization can improve performances in various natural\nlanguage processing tasks in comparison with existing methods. Moreover,\nthrough human experiments, we observe that participants can easily distinguish\nthe text generated by graphmax or softmax.\n","authors":["Liu Bin","Yin Guosheng"],"pdf_url":"https://arxiv.org/pdf/2101.00153v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07924v4","updated":"2023-12-19T12:56:13Z","published":"2023-07-16T02:11:34Z","title":"Communicative Agents for Software Development","summary":" Software engineering is a domain characterized by intricate decision-making\nprocesses, often relying on nuanced intuition and consultation. Recent\nadvancements in deep learning have started to revolutionize software\nengineering practices through elaborate designs implemented at various stages\nof software development. In this paper, we present an innovative paradigm that\nleverages large language models (LLMs) throughout the entire software\ndevelopment process, streamlining and unifying key processes through natural\nlanguage communication, thereby eliminating the need for specialized models at\neach phase. At the core of this paradigm lies ChatDev, a virtual chat-powered\nsoftware development company that mirrors the established waterfall model,\nmeticulously dividing the development process into four distinct chronological\nstages: designing, coding, testing, and documenting. Each stage engages a team\nof \"software agents\", such as programmers, code reviewers, and test engineers,\nfostering collaborative dialogue and facilitating a seamless workflow. The chat\nchain acts as a facilitator, breaking down each stage into atomic subtasks.\nThis enables dual roles, allowing for proposing and validating solutions\nthrough context-aware communication, leading to efficient resolution of\nspecific subtasks. The instrumental analysis of ChatDev highlights its\nremarkable efficacy in software generation, enabling the completion of the\nentire software development process in under seven minutes at a cost of less\nthan one dollar. It not only identifies and alleviates potential\nvulnerabilities but also rectifies potential hallucinations while maintaining\ncommendable efficiency and cost-effectiveness. The potential of ChatDev unveils\nfresh possibilities for integrating LLMs into the realm of software\ndevelopment. Our code is available at https://github.com/OpenBMB/ChatDev.\n","authors":["Chen Qian","Xin Cong","Wei Liu","Cheng Yang","Weize Chen","Yusheng Su","Yufan Dang","Jiahao Li","Juyuan Xu","Dahai Li","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2307.07924v4.pdf","comment":"https://github.com/OpenBMB/ChatDev"},{"id":"http://arxiv.org/abs/2312.12108v1","updated":"2023-12-19T12:32:27Z","published":"2023-12-19T12:32:27Z","title":"Knowledge Graph Error Detection with Contrastive Confidence Adaption","summary":" Knowledge graphs (KGs) often contain various errors. Previous works on\ndetecting errors in KGs mainly rely on triplet embedding from graph structure.\nWe conduct an empirical study and find that these works struggle to\ndiscriminate noise from semantically-similar correct triplets. In this paper,\nwe propose a KG error detection model CCA to integrate both textual and graph\nstructural information from triplet reconstruction for better distinguishing\nsemantics. We design interactive contrastive learning to capture the\ndifferences between textual and structural patterns. Furthermore, we construct\nrealistic datasets with semantically-similar noise and adversarial noise.\nExperimental results demonstrate that CCA outperforms state-of-the-art\nbaselines, especially in detecting semantically-similar noise and adversarial\nnoise.\n","authors":["Xiangyu Liu","Yang Liu","Wei Hu"],"pdf_url":"https://arxiv.org/pdf/2312.12108v1.pdf","comment":"Accepted in the 38th AAAI Conference on Artificial Intelligence (AAAI\n 2024)"},{"id":"http://arxiv.org/abs/2310.18313v2","updated":"2023-12-19T12:27:58Z","published":"2023-10-27T17:59:51Z","title":"FP8-LM: Training FP8 Large Language Models","summary":" In this paper, we explore FP8 low-bit data formats for efficient training of\nlarge language models (LLMs). Our key insight is that most variables, such as\ngradients and optimizer states, in LLM training can employ low-precision data\nformats without compromising model accuracy and requiring no changes to\nhyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision\nframework for training LLMs. This framework offers three levels of FP8\nutilization to streamline mixed-precision and distributed parallel training for\nLLMs. It gradually incorporates 8-bit gradients, optimizer states, and\ndistributed learning in an incremental manner. Experiment results show that,\nduring the training of GPT-175B model on H100 GPU platform, our FP8\nmixed-precision training framework not only achieved a remarkable 39% reduction\nin real memory usage but also ran 75% faster than the widely adopted BF16\nframework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer\nEngine by 37%. This largely reduces the training costs for large foundation\nmodels. Furthermore, our FP8 mixed-precision training methodology is generic.\nIt can be seamlessly applied to other tasks such as LLM instruction tuning and\nreinforcement learning with human feedback, offering savings in fine-tuning\nexpenses. Our FP8 low-precision training framework is open-sourced at\n{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.\n","authors":["Houwen Peng","Kan Wu","Yixuan Wei","Guoshuai Zhao","Yuxiang Yang","Ze Liu","Yifan Xiong","Ziyue Yang","Bolin Ni","Jingcheng Hu","Ruihang Li","Miaosen Zhang","Chen Li","Jia Ning","Ruizhe Wang","Zheng Zhang","Shuguang Liu","Joe Chau","Han Hu","Peng Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.18313v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.06453v2","updated":"2023-12-19T12:13:25Z","published":"2023-09-12T08:16:58Z","title":"Narrowing the Gap between Supervised and Unsupervised Sentence\n Representation Learning with Large Language Model","summary":" Sentence Representation Learning (SRL) is a fundamental task in Natural\nLanguage Processing (NLP), with the Contrastive Learning of Sentence Embeddings\n(CSE) being the mainstream technique due to its superior performance. An\nintriguing phenomenon in CSE is the significant performance gap between\nsupervised and unsupervised methods, with their only difference lying in the\ntraining data. Previous works attribute this performance gap to differences in\ntwo representation properties (alignment and uniformity). However, since\nalignment and uniformity only measure the results, they fail to answer \"What\naspects of the training data contribute to the performance gap?\" and \"How can\nthe performance gap be narrowed?\", In this paper, we conduct empirical\nexperiments to answer these \"What\" and \"How\" questions. We first answer the\n\"What\" question by thoroughly comparing the behavior of supervised and\nunsupervised CSE during their respective training processes. From the\ncomparison, we identify the similarity pattern as a key factor to the\nperformance gap, and introduce a metric, called Relative Fitting Difficulty\n(RFD), to measure the complexity of the similarity pattern. Then, based on the\ninsights gained from the \"What\" question, we tackle the \"How\" question by\nincreasing the pattern complexity of the training data. We achieve this by\nleveraging the In-Context Learning (ICL) capability of the Large Language Model\n(LLM) to generate data that simulates complex patterns. By utilizing the\nhierarchical patterns in the LLM-generated data, we effectively narrow the gap\nbetween supervised and unsupervised CSE. We release our codes and appendix at\nhttps://github.com/BDBC-KG-NLP/NGCSE.\n","authors":["Mingxin Li","Richong Zhang","Zhijie Nie","Yongyi Mao"],"pdf_url":"https://arxiv.org/pdf/2309.06453v2.pdf","comment":"Accepted at AAAI24"},{"id":"http://arxiv.org/abs/2312.12037v1","updated":"2023-12-19T10:46:13Z","published":"2023-12-19T10:46:13Z","title":"Founder-GPT: Self-play to evaluate the Founder-Idea fit","summary":" This research introduces an innovative evaluation method for the\n\"founder-idea\" fit in early-stage startups, utilizing advanced large language\nmodel techniques to assess founders' profiles against their startup ideas to\nenhance decision-making. Embeddings, self-play, tree-of-thought, and\ncritique-based refinement techniques show early promising results that each\nidea's success patterns are unique and they should be evaluated based on the\ncontext of the founder's background.\n","authors":["Sichao Xiong","Yigit Ihlamur"],"pdf_url":"https://arxiv.org/pdf/2312.12037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12021v1","updated":"2023-12-19T10:16:24Z","published":"2023-12-19T10:16:24Z","title":"Synergistic Anchored Contrastive Pre-training for Few-Shot Relation\n Extraction","summary":" Few-shot Relation Extraction (FSRE) aims to extract relational facts from a\nsparse set of labeled corpora. Recent studies have shown promising results in\nFSRE by employing Pre-trained Language Models (PLMs) within the framework of\nsupervised contrastive learning, which considers both instances and label\nfacts. However, how to effectively harness massive instance-label pairs to\nencompass the learned representation with semantic richness in this learning\nparadigm is not fully explored. To address this gap, we introduce a novel\nsynergistic anchored contrastive pre-training framework. This framework is\nmotivated by the insight that the diverse viewpoints conveyed through\ninstance-label pairs capture incomplete yet complementary intrinsic textual\nsemantics. Specifically, our framework involves a symmetrical contrastive\nobjective that encompasses both sentence-anchored and label-anchored\ncontrastive losses. By combining these two losses, the model establishes a\nrobust and uniform representation space. This space effectively captures the\nreciprocal alignment of feature distributions among instances and relational\nfacts, simultaneously enhancing the maximization of mutual information across\ndiverse perspectives within the same relation. Experimental results demonstrate\nthat our framework achieves significant performance enhancements compared to\nbaseline models in downstream FSRE tasks. Furthermore, our approach exhibits\nsuperior adaptability to handle the challenges of domain shift and zero-shot\nrelation extraction. Our code is available online at\nhttps://github.com/AONE-NLP/FSRE-SaCon.\n","authors":[" DaLuo","Yanglei Gan","Rui Hou","Run Lin","Qiao Liu","Yuxiang Cai","Wannian Gao"],"pdf_url":"https://arxiv.org/pdf/2312.12021v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05161v4","updated":"2023-12-19T10:13:33Z","published":"2023-10-08T13:36:05Z","title":"Recurrent Neural Language Models as Probabilistic Finite-state Automata","summary":" Studying language models (LMs) in terms of well-understood formalisms allows\nus to precisely characterize their abilities and limitations. Previous work has\ninvestigated the representational capacity of recurrent neural network (RNN)\nLMs in terms of their capacity to recognize unweighted formal languages.\nHowever, LMs do not describe unweighted formal languages -- rather, they define\n\\emph{probability distributions} over strings. In this work, we study what\nclasses of such probability distributions RNN LMs can represent, which allows\nus to make more direct statements about their capabilities. We show that simple\nRNNs are equivalent to a subclass of probabilistic finite-state automata, and\ncan thus model a strict subset of probability distributions expressible by\nfinite-state models. Furthermore, we study the space complexity of representing\nfinite-state LMs with RNNs. We show that, to represent an arbitrary\ndeterministic finite-state LM with $N$ states over an alphabet $\\alphabet$, an\nRNN requires $\\Omega\\left(N |\\Sigma|\\right)$ neurons. These results present a\nfirst step towards characterizing the classes of distributions RNN LMs can\nrepresent and thus help us understand their capabilities and limitations.\n","authors":["Anej Svete","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2310.05161v4.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2312.12009v1","updated":"2023-12-19T09:58:54Z","published":"2023-12-19T09:58:54Z","title":"Active Preference Inference using Language Models and Probabilistic\n Reasoning","summary":" Actively inferring user preferences, for example by asking good questions, is\nimportant for any human-facing decision-making system. Active inference allows\nsuch systems to adapt and personalize themselves to nuanced individual\npreferences. To enable this ability for instruction-tuned large language models\n(LLMs), one may prompt them to ask users questions to infer their preferences,\ntransforming the language models into more robust, interactive systems.\nHowever, out of the box, these models are not efficient at extracting\npreferences: the questions they generate are not informative, requiring a high\nnumber of user interactions and impeding the usability of the downstream\nsystem. In this work, we introduce an inference-time algorithm that helps LLMs\nquickly infer preferences by using more informative questions. Our algorithm\nuses a probabilistic model whose conditional distributions are defined by\nprompting an LLM, and returns questions that optimize expected entropy and\nexpected model change. Results in a simplified interactive web shopping setting\nwith real product items show that an LLM equipped with our entropy reduction\nalgorithm outperforms baselines with the same underlying LLM on task\nperformance while using fewer user interactions.\n","authors":["Top Piriyakulkij","Volodymyr Kuleshov","Kevin Ellis"],"pdf_url":"https://arxiv.org/pdf/2312.12009v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12006v1","updated":"2023-12-19T09:54:27Z","published":"2023-12-19T09:54:27Z","title":"Can ChatGPT be Your Personal Medical Assistant?","summary":" The advanced large language model (LLM) ChatGPT has shown its potential in\ndifferent domains and remains unbeaten due to its characteristics compared to\nother LLMs. This study aims to evaluate the potential of using a fine-tuned\nChatGPT model as a personal medical assistant in the Arabic language. To do so,\nthis study uses publicly available online questions and answering datasets in\nArabic language. There are almost 430K questions and answers for 20\ndisease-specific categories. GPT-3.5-turbo model was fine-tuned with a portion\nof this dataset. The performance of this fine-tuned model was evaluated through\nautomated and human evaluation. The automated evaluations include perplexity,\ncoherence, similarity, and token count. Native Arabic speakers with medical\nknowledge evaluated the generated text by calculating relevance, accuracy,\nprecision, logic, and originality. The overall result shows that ChatGPT has a\nbright future in medical assistance.\n","authors":["Md. Rafiul Biswas","Ashhadul Islam","Zubair Shah","Wajdi Zaghouani","Samir Brahim Belhaouari"],"pdf_url":"https://arxiv.org/pdf/2312.12006v1.pdf","comment":"5 pages, 7 figures, two tables, Accepted on The International\n Symposium on Foundation and Large Language Models (FLLM2023)"},{"id":"http://arxiv.org/abs/2312.11997v1","updated":"2023-12-19T09:39:27Z","published":"2023-12-19T09:39:27Z","title":"Coreference Graph Guidance for Mind-Map Generation","summary":" Mind-map generation aims to process a document into a hierarchical structure\nto show its central idea and branches. Such a manner is more conducive to\nunderstanding the logic and semantics of the document than plain text.\nRecently, a state-of-the-art method encodes the sentences of a document\nsequentially and converts them to a relation graph via sequence-to-graph.\nThough this method is efficient to generate mind-maps in parallel, its\nmechanism focuses more on sequential features while hardly capturing structural\ninformation. Moreover, it's difficult to model long-range semantic relations.\nIn this work, we propose a coreference-guided mind-map generation network\n(CMGN) to incorporate external structure knowledge. Specifically, we construct\na coreference graph based on the coreference semantic relationship to introduce\nthe graph structure information. Then we employ a coreference graph encoder to\nmine the potential governing relations between sentences. In order to exclude\nnoise and better utilize the information of the coreference graph, we adopt a\ngraph enhancement module in a contrastive learning manner. Experimental results\ndemonstrate that our model outperforms all the existing methods. The case study\nfurther proves that our model can more accurately and concisely reveal the\nstructure and semantics of a document. Code and data are available at\nhttps://github.com/Cyno2232/CMGN.\n","authors":["Zhuowei Zhang","Mengting Hu","Yinhao Bai","Zhen Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.11997v1.pdf","comment":"9 pages, 6 figures. Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11985v1","updated":"2023-12-19T09:26:46Z","published":"2023-12-19T09:26:46Z","title":"Climate Change from Large Language Models","summary":" Climate change presents significant challenges to the global community, and\nit is imperative to raise widespread awareness of the climate crisis and\neducate users about low-carbon living. Artificial intelligence, particularly\nlarge language models (LLMs), have emerged as powerful tools in mitigating the\nclimate crisis, leveraging their extensive knowledge, broad user base, and\nnatural language interaction capabilities. However, despite the growing body of\nresearch on climate change, there is a lack of comprehensive assessments of\nclimate crisis knowledge within LLMs. This paper aims to resolve this gap by\nproposing an automatic evaluation framework. We employ a hybrid approach to\ndata acquisition that combines data synthesis and manual collection to compile\na diverse set of questions related to the climate crisis. These questions cover\nvarious aspects of climate change, including its causes, impacts, mitigation\nstrategies, and adaptation measures. We then evaluate the model knowledge\nthrough prompt engineering based on the collected questions and generated\nanswers. We propose a set of comprehensive metrics to evaluate the climate\ncrisis knowledge, incorporating indicators from 10 different perspectives.\nExperimental results show that our method is effective in evaluating the\nknowledge of LLMs regarding the climate crisis. We evaluate several\nstate-of-the-art LLMs and find that their knowledge falls short in terms of\ntimeliness.\n","authors":["Hongyin Zhu","Prayag Tiwari"],"pdf_url":"https://arxiv.org/pdf/2312.11985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11983v1","updated":"2023-12-19T09:23:48Z","published":"2023-12-19T09:23:48Z","title":"Fluctuation-based Adaptive Structured Pruning for Large Language Models","summary":" Network Pruning is a promising way to address the huge computing resource\ndemands of the deployment and inference of Large Language Models (LLMs).\nRetraining-free is important for LLMs' pruning methods. However, almost all of\nthe existing retraining-free pruning approaches for LLMs focus on unstructured\npruning, which requires specific hardware support for acceleration. In this\npaper, we propose a novel retraining-free structured pruning framework for\nLLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is\nhardware-friendly by effectively reducing storage and enhancing inference\nspeed. For effective structured pruning of LLMs, we highlight three critical\nelements that demand the utmost attention: formulating structured importance\nmetrics, adaptively searching the global compressed model, and implementing\ncompensation mechanisms to mitigate performance loss. First, FLAP determines\nwhether the output feature map is easily recoverable when a column of weight is\nremoved, based on the fluctuation pruning metric. Then it standardizes the\nimportance scores to adaptively determine the global compressed model\nstructure. At last, FLAP adds additional bias terms to recover the output\nfeature maps using the baseline values. We thoroughly evaluate our approach on\na variety of language benchmarks. Without any retraining, our method\nsignificantly outperforms the state-of-the-art methods, including LLM-Pruner\nand the extension of Wanda in structured pruning. The code is released at\nhttps://github.com/CASIA-IVA-Lab/FLAP.\n","authors":["Yongqi An","Xu Zhao","Tao Yu","Ming Tang","Jinqiao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11983v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2301.04312v5","updated":"2023-12-19T09:12:01Z","published":"2023-01-11T05:21:00Z","title":"Word-Graph2vec: An efficient word embedding approach on word\n co-occurrence graph using random walk sampling","summary":" Word embedding has become ubiquitous and is widely used in various text\nmining and natural language processing (NLP) tasks, such as information\nretrieval, semantic analysis, and machine translation, among many others.\nUnfortunately, it is prohibitively expensive to train the word embedding in a\nrelatively large corpus. We propose a graph-based word embedding algorithm,\ncalled Word-Graph2vec, which converts the large corpus into a word\nco-occurrence graph, then takes the word sequence samples from this graph by\nrandomly traveling and trains the word embedding on this sampling corpus in the\nend. We posit that because of the stable vocabulary, relative idioms, and fixed\nexpressions in English, the size and density of the word co-occurrence graph\nchange slightly with the increase in the training corpus. So that\nWord-Graph2vec has stable runtime on the large scale data set, and its\nperformance advantage becomes more and more obvious with the growth of the\ntraining corpus. Extensive experiments conducted on real-world datasets show\nthat the proposed algorithm outperforms traditional Skip-Gram by four-five\ntimes in terms of efficiency, while the error generated by the random walk\nsampling is small.\n","authors":["Wenting Li","Jiahong Xue","Xi Zhang","Huacan Chen","Zeyu Chen","Yuanzhe Cai"],"pdf_url":"https://arxiv.org/pdf/2301.04312v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11970v1","updated":"2023-12-19T09:06:45Z","published":"2023-12-19T09:06:45Z","title":"Large Language Models Empowered Agent-based Modeling and Simulation: A\n Survey and Perspectives","summary":" Agent-based modeling and simulation has evolved as a powerful tool for\nmodeling complex systems, offering insights into emergent behaviors and\ninteractions among diverse agents. Integrating large language models into\nagent-based modeling and simulation presents a promising avenue for enhancing\nsimulation capabilities. This paper surveys the landscape of utilizing large\nlanguage models in agent-based modeling and simulation, examining their\nchallenges and promising future directions. In this survey, since this is an\ninterdisciplinary field, we first introduce the background of agent-based\nmodeling and simulation and large language model-empowered agents. We then\ndiscuss the motivation for applying large language models to agent-based\nsimulation and systematically analyze the challenges in environment perception,\nhuman alignment, action generation, and evaluation. Most importantly, we\nprovide a comprehensive overview of the recent works of large language\nmodel-empowered agent-based modeling and simulation in multiple scenarios,\nwhich can be divided into four domains: cyber, physical, social, and hybrid,\ncovering simulation of both real-world and virtual environments. Finally, since\nthis area is new and quickly evolving, we discuss the open problems and\npromising future directions.\n","authors":["Chen Gao","Xiaochong Lan","Nian Li","Yuan Yuan","Jingtao Ding","Zhilun Zhou","Fengli Xu","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2312.11970v1.pdf","comment":"37 pages"},{"id":"http://arxiv.org/abs/2207.08012v5","updated":"2023-12-19T09:05:55Z","published":"2022-07-16T20:37:46Z","title":"Meta-Referential Games to Learn Compositional Learning Behaviours","summary":" Human beings use compositionality to generalise from past experiences to\nnovel experiences. We assume a separation of our experiences into fundamental\natomic components that can be recombined in novel ways to support our ability\nto engage with novel experiences. We frame this as the ability to learn to\ngeneralise compositionally, and we will refer to behaviours making use of this\nability as compositional learning behaviours (CLBs). A central problem to\nlearning CLBs is the resolution of a binding problem (BP). While it is another\nfeat of intelligence that human beings perform with ease, it is not the case\nfor state-of-the-art artificial agents. Thus, in order to build artificial\nagents able to collaborate with human beings, we propose to develop a novel\nbenchmark to investigate agents' abilities to exhibit CLBs by solving a\ndomain-agnostic version of the BP. We take inspiration from the language\nemergence and grounding framework of referential games and propose a\nmeta-learning extension of referential games, entitled Meta-Referential Games,\nand use this framework to build our benchmark, the Symbolic Behaviour Benchmark\n(S2B). We provide baseline results and error analysis showing that our\nbenchmark is a compelling challenge that we hope will spur the research\ncommunity towards developing more capable artificial agents.\n","authors":["Kevin Denamganaï","Sondess Missaoui","James Alfred Walker"],"pdf_url":"https://arxiv.org/pdf/2207.08012v5.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2312.04877v2","updated":"2023-12-19T08:58:19Z","published":"2023-12-08T07:27:26Z","title":"Generating Explanations to Understand and Repair Embedding-based Entity\n Alignment","summary":" Entity alignment (EA) seeks identical entities in different knowledge graphs,\nwhich is a long-standing task in the database research. Recent work leverages\ndeep learning to embed entities in vector space and align them via nearest\nneighbor search. Although embedding-based EA has gained marked success in\nrecent years, it lacks explanations for alignment decisions. In this paper, we\npresent the first framework that can generate explanations for understanding\nand repairing embedding-based EA results. Given an EA pair produced by an\nembedding model, we first compare its neighbor entities and relations to build\na matching subgraph as a local explanation. We then construct an alignment\ndependency graph to understand the pair from an abstract perspective. Finally,\nwe repair the pair by resolving three types of alignment conflicts based on\ndependency graphs. Experiments on a variety of EA datasets demonstrate the\neffectiveness, generalization, and robustness of our framework in explaining\nand repairing embedding-based EA results.\n","authors":["Xiaobin Tian","Zequn Sun","Wei Hu"],"pdf_url":"https://arxiv.org/pdf/2312.04877v2.pdf","comment":"Accepted in the 40th IEEE International Conference on Data\n Engineering (ICDE 2024)"},{"id":"http://arxiv.org/abs/2312.11947v1","updated":"2023-12-19T08:47:50Z","published":"2023-12-19T08:47:50Z","title":"Emotion Rendering for Conversational Speech Synthesis with Heterogeneous\n Graph-Based Context Modeling","summary":" Conversational Speech Synthesis (CSS) aims to accurately express an utterance\nwith the appropriate prosody and emotional inflection within a conversational\nsetting. While recognising the significance of CSS task, the prior studies have\nnot thoroughly investigated the emotional expressiveness problems due to the\nscarcity of emotional conversational datasets and the difficulty of stateful\nemotion modeling. In this paper, we propose a novel emotional CSS model, termed\nECSS, that includes two main components: 1) to enhance emotion understanding,\nwe introduce a heterogeneous graph-based emotional context modeling mechanism,\nwhich takes the multi-source dialogue history as input to model the dialogue\ncontext and learn the emotion cues from the context; 2) to achieve emotion\nrendering, we employ a contrastive learning-based emotion renderer module to\ninfer the accurate emotion style for the target utterance. To address the issue\nof data scarcity, we meticulously create emotional labels in terms of category\nand intensity, and annotate additional emotional information on the existing\nconversational dataset (DailyTalk). Both objective and subjective evaluations\nsuggest that our model outperforms the baseline models in understanding and\nrendering emotions. These evaluations also underscore the importance of\ncomprehensive emotional annotations. Code and audio samples can be found at:\nhttps://github.com/walker-hyf/ECSS.\n","authors":["Rui Liu","Yifan Hu","Yi Ren","Xiang Yin","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2312.11947v1.pdf","comment":"9 pages, 4 figures, Accepted by AAAI'2024, Code and audio samples:\n https://github.com/walker-hyf/ECSS"},{"id":"http://arxiv.org/abs/2312.11945v1","updated":"2023-12-19T08:43:02Z","published":"2023-12-19T08:43:02Z","title":"Multi-Granularity Information Interaction Framework for Incomplete\n Utterance Rewriting","summary":" Recent approaches in Incomplete Utterance Rewriting (IUR) fail to capture the\nsource of important words, which is crucial to edit the incomplete utterance,\nand introduce words from irrelevant utterances. We propose a novel and\neffective multi-task information interaction framework including context\nselection, edit matrix construction, and relevance merging to capture the\nmulti-granularity of semantic information. Benefiting from fetching the\nrelevant utterance and figuring out the important words, our approach\noutperforms existing state-of-the-art models on two benchmark datasets\nRestoration-200K and CANAND in this field. Code will be provided on\n\\url{https://github.com/yanmenxue/QR}.\n","authors":["Haowei Du","Dingyu Zhang","Chen Li","Yang Li","Dongyan Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.11945v1.pdf","comment":"Findings of EMNLP2023 (short)"},{"id":"http://arxiv.org/abs/2309.04766v2","updated":"2023-12-19T08:25:22Z","published":"2023-09-09T11:42:22Z","title":"SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment\n to Cultural Reasoning","summary":" We present SeaEval, a benchmark for multilingual foundation models. In\naddition to characterizing how these models understand and reason with natural\nlanguage, we also investigate how well they comprehend cultural practices,\nnuances, and values. Alongside standard accuracy metrics, we investigate the\nbrittleness of foundation models in the dimensions of semantics and\nmultilinguality. Our analyses span both open-sourced and closed models, leading\nto empirical results across classic NLP tasks, reasoning, and cultural\ncomprehension. Key findings indicate (1) Most models exhibit varied behavior\nwhen given paraphrased instructions. (2) Many models still suffer from exposure\nbias (e.g., positional bias, majority label bias). (3) For questions rooted in\nfactual, scientific, and commonsense knowledge, consistent responses are\nexpected across multilingual queries that are semantically equivalent. Yet,\nmost models surprisingly demonstrate inconsistent performance on these queries.\n(4) Multilingually-trained models have not attained \"balanced multilingual\"\ncapabilities. Our endeavors underscore the need for more generalizable semantic\nrepresentations and enhanced multilingual contextualization. SeaEval can serve\nas a launchpad for more thorough investigations and evaluations for\nmultilingual and multicultural scenarios.\n","authors":["Bin Wang","Zhengyuan Liu","Xin Huang","Fangkai Jiao","Yang Ding","Ai Ti Aw","Nancy F. Chen"],"pdf_url":"https://arxiv.org/pdf/2309.04766v2.pdf","comment":"20 pages. More datasets (2 on Cross-Lingual Consistency and 4 on\n Cultural Understanding) and more supported languages. Code:\n https://github.com/SeaEval/SeaEval"},{"id":"http://arxiv.org/abs/2307.10156v2","updated":"2023-12-19T08:02:03Z","published":"2023-07-19T17:37:03Z","title":"Exploring Transformer Extrapolation","summary":" Length extrapolation has attracted considerable attention recently since it\nallows transformers to be tested on longer sequences than those used in\ntraining. Previous research has shown that this property can be attained by\nusing carefully designed Relative Positional Encodings (RPEs). While these\nmethods perform well on a variety of corpora, the conditions for length\nextrapolation have yet to be investigated. This paper attempts to determine\nwhat types of RPEs allow for length extrapolation through a thorough\nmathematical and empirical analysis. We discover that a transformer is certain\nto possess this property as long as the series that corresponds to the RPE's\nexponential converges. Two practices are derived from the conditions and\nexamined in language modeling tasks on a variety of corpora. As a bonus from\nthe conditions, we derive a new Theoretical Receptive Field (TRF) to measure\nthe receptive field of RPEs without taking any training steps. Extensive\nexperiments are conducted on the Wikitext-103, Books, Github, and WikiBook\ndatasets to demonstrate the viability of our discovered conditions. We also\ncompare TRF to Empirical Receptive Field (ERF) across different models, showing\nconsistently matched trends on the aforementioned datasets. The code is\navailable at https://github.com/OpenNLPLab/Rpe.\n","authors":["Zhen Qin","Yiran Zhong","Hui Deng"],"pdf_url":"https://arxiv.org/pdf/2307.10156v2.pdf","comment":"AAAI Camera Ready. Zhen Qin and Yiran Zhong contribute equally to\n this paper; Yiran Zhong is the corresponding author. The code is available at\n https://github.com/OpenNLPLab/Rpe"},{"id":"http://arxiv.org/abs/2312.11922v1","updated":"2023-12-19T08:01:48Z","published":"2023-12-19T08:01:48Z","title":"Relation-Aware Question Answering for Heterogeneous Knowledge Graphs","summary":" Multi-hop Knowledge Base Question Answering(KBQA) aims to find the answer\nentity in a knowledge graph (KG), which requires multiple steps of reasoning.\nExisting retrieval-based approaches solve this task by concentrating on the\nspecific relation at different hops and predicting the intermediate entity\nwithin the reasoning path. During the reasoning process of these methods, the\nrepresentation of relations are fixed but the initial relation representation\nmay not be optimal. We claim they fail to utilize information from head-tail\nentities and the semantic connection between relations to enhance the current\nrelation representation, which undermines the ability to capture information of\nrelations in KGs. To address this issue, we construct a \\textbf{dual relation\ngraph} where each node denotes a relation in the original KG (\\textbf{primal\nentity graph}) and edges are constructed between relations sharing same head or\ntail entities. Then we iteratively do primal entity graph reasoning, dual\nrelation graph information propagation, and interaction between these two\ngraphs. In this way, the interaction between entity and relation is enhanced,\nand we derive better entity and relation representations. Experiments on two\npublic datasets, WebQSP and CWQ, show that our approach achieves a significant\nperformance gain over the prior state-of-the-art. Our code is available on\n\\url{https://github.com/yanmenxue/RAH-KBQA}.\n","authors":["Haowei Du","Quzhe Huang","Chen Li","Chen Zhang","Yang Li","Dongyan Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.11922v1.pdf","comment":"Findings of EMNLP2023 (Long)"},{"id":"http://arxiv.org/abs/2312.11920v1","updated":"2023-12-19T08:00:10Z","published":"2023-12-19T08:00:10Z","title":"External Knowledge Augmented Polyphone Disambiguation Using Large\n Language Model","summary":" One of the key issues in Mandarin Chinese text-to-speech (TTS) systems is\npolyphone disambiguation when doing grapheme-to-phoneme (G2P) conversion. In\nthis paper, we introduce a novel method to solve the problem as a generation\ntask. Following the trending research of large language models (LLM) and prompt\nlearning, the proposed method consists of three modules. Retrieval module\nincorporates external knowledge which is a multi-level semantic dictionary of\nChinese polyphonic characters to format the sentence into a prompt. Generation\nmodule adopts the decoder-only Transformer architecture to induce the target\ntext. Postprocess module corrects the generated text into a valid result if\nneeded. Experimental results show that our method outperforms the existing\nmethods on a public dataset called CPP. We also empirically study the impacts\nof different templates of the prompt, different sizes of training data, and\nwhether to incorporate external knowledge.\n","authors":["Chen Li"],"pdf_url":"https://arxiv.org/pdf/2312.11920v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11075v2","updated":"2023-12-19T07:47:09Z","published":"2023-12-18T10:16:37Z","title":"Split and Rephrase with Large Language Models","summary":" The Split and Rephrase task, which consists in splitting complex sentences\ninto a sequence of shorter grammatical sentences, while preserving the original\nmeaning, can facilitate the processing of complex texts for humans and machines\nalike. In this work, we describe an approach based on large language models,\nwhich improves over the state of the art by large margins on all the major\nmetrics for the task, on publicly available datasets. We also describe results\nfrom two human evaluations that further establish the significant improvements\nobtained with large language models and the viability of the approach. We\nevaluate different strategies, including fine-tuning pretrained language models\nof varying parameter size, and applying both zero-shot and few-shot in-context\nlearning on instruction-tuned language models. Although the latter were\nmarkedly outperformed by fine-tuned models, they still achieved promising\nresults overall. Our results thus demonstrate the strong potential of different\nvariants of large language models for the Split and Rephrase task, using\nrelatively small amounts of training samples and model parameters overall.\n","authors":["David Ponce","Thierry Etchegoyhen","Jesús Calleja Pérez","Harritxu Gete"],"pdf_url":"https://arxiv.org/pdf/2312.11075v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16583v5","updated":"2023-12-19T07:41:50Z","published":"2023-09-28T16:43:35Z","title":"GPT-Fathom: Benchmarking Large Language Models to Decipher the\n Evolutionary Path towards GPT-4 and Beyond","summary":" With the rapid advancement of large language models (LLMs), there is a\npressing need for a comprehensive evaluation suite to assess their capabilities\nand limitations. Existing LLM leaderboards often reference scores reported in\nother papers without consistent settings and prompts, which may inadvertently\nencourage cherry-picking favored settings and prompts for better results. In\nthis work, we introduce GPT-Fathom, an open-source and reproducible LLM\nevaluation suite built on top of OpenAI Evals. We systematically evaluate 10+\nleading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across\n7 capability categories, all under aligned settings. Our retrospective study on\nOpenAI's earlier models offers valuable insights into the evolutionary path\nfrom GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3\nprogressively improves to GPT-4, including technical details like whether\nadding code data improves LLM's reasoning capability, which aspects of LLM\ncapability can be improved by SFT and RLHF, how much is the alignment tax, etc.\nOur analysis sheds light on many of these questions, aiming to improve the\ntransparency of advanced LLMs.\n","authors":["Shen Zheng","Yuyu Zhang","Yijie Zhu","Chenguang Xi","Pengyang Gao","Xun Zhou","Kevin Chen-Chuan Chang"],"pdf_url":"https://arxiv.org/pdf/2309.16583v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.11608v2","updated":"2023-12-19T07:18:24Z","published":"2023-11-20T08:51:30Z","title":"Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse\n Biomedical Tasks","summary":" Objective: Most existing fine-tuned biomedical large language models (LLMs)\nfocus on enhancing performance in monolingual biomedical question answering and\nconversation tasks. To investigate the effectiveness of the fine-tuned LLMs on\ndiverse biomedical NLP tasks in different languages, We present Taiyi, a\nbilingual fine-tuned LLM for diverse biomedical tasks. Materials and Methods:\nWe first curated a comprehensive collection of 140 existing biomedical text\nmining datasets (102 English and 38 Chinese datasets) across over 10 task\ntypes. Subsequently, a two-stage strategy is proposed for supervised\nfine-tuning to optimize the model performance across varied tasks. Results:\nExperimental results on 13 test sets covering named entity recognition,\nrelation extraction, text classification, question answering tasks demonstrate\nthat Taiyi achieves superior performance compared to general LLMs. The case\nstudy involving additional biomedical NLP tasks further shows Taiyi's\nconsiderable potential for bilingual biomedical multi-tasking. Conclusion:\nLeveraging rich high-quality biomedical corpora and developing effective\nfine-tuning strategies can significantly improve the performance of LLMs within\nthe biomedical domain. Taiyi shows the bilingual multi-tasking capability\nthrough supervised fine-tuning. However, those tasks such as information\nextraction that are not generation tasks in nature remain challenging for\nLLM-based generative approaches, and they still underperform the conventional\ndiscriminative approaches of smaller language models.\n","authors":["Ling Luo","Jinzhong Ning","Yingwen Zhao","Zhijun Wang","Zeyuan Ding","Peng Chen","Weiru Fu","Qinyu Han","Guangtao Xu","Yunzhi Qiu","Dinghao Pan","Jiru Li","Hao Li","Wenduo Feng","Senbo Tu","Yuqi Liu","Zhihao Yang","Jian Wang","Yuanyuan Sun","Hongfei Lin"],"pdf_url":"https://arxiv.org/pdf/2311.11608v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11895v1","updated":"2023-12-19T06:39:38Z","published":"2023-12-19T06:39:38Z","title":"Analyzing Public Reactions, Perceptions, and Attitudes during the MPox\n Outbreak: Findings from Topic Modeling of Tweets","summary":" The recent outbreak of the MPox virus has resulted in a tremendous increase\nin the usage of Twitter. Prior works in this area of research have primarily\nfocused on the sentiment analysis and content analysis of these Tweets, and the\nfew works that have focused on topic modeling have multiple limitations. This\npaper aims to address this research gap and makes two scientific contributions\nto this field. First, it presents the results of performing Topic Modeling on\n601,432 Tweets about the 2022 Mpox outbreak that were posted on Twitter between\n7 May 2022 and 3 March 2023. The results indicate that the conversations on\nTwitter related to Mpox during this time range may be broadly categorized into\nfour distinct themes - Views and Perspectives about Mpox, Updates on Cases and\nInvestigations about Mpox, Mpox and the LGBTQIA+ Community, and Mpox and\nCOVID-19. Second, the paper presents the findings from the analysis of these\nTweets. The results show that the theme that was most popular on Twitter (in\nterms of the number of Tweets posted) during this time range was Views and\nPerspectives about Mpox. This was followed by the theme of Mpox and the\nLGBTQIA+ Community, which was followed by the themes of Mpox and COVID-19 and\nUpdates on Cases and Investigations about Mpox, respectively. Finally, a\ncomparison with related studies in this area of research is also presented to\nhighlight the novelty and significance of this research work.\n","authors":["Nirmalya Thakur","Yuvraj Nihal Duggal","Zihui Liu"],"pdf_url":"https://arxiv.org/pdf/2312.11895v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.07490v4","updated":"2023-12-19T06:27:45Z","published":"2023-05-12T14:04:30Z","title":"ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models\n with Enhanced Adapter","summary":" In recent years, advancements in large language models have been remarkable,\nwith models such as ChatGPT demonstrating exceptional proficiency in diverse\nlinguistic tasks. The pre-training of large models with billions of parameters,\nposes a formidable challenge, primarily due to the scarcity of datasets of a\ncommensurate scale for effective training. Nevertheless, innovative strategies\nhave emerged, including methods to fine-tune these pre-trained models using\nfewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite\ntheir potential in various domains, these models remain limited in their\nunderstanding of artistic imagery. They have yet to fully grasp the intricate\nnuances of art images or to provide an objective articulation of the emotions\nthey evoke, in a manner akin to human perception. This work introduces\nArtGPT-4, a pioneering large vision-language model tailored to address the\ndeficiencies of contemporary models in artistic comprehension. ArtGPT-4\nunderwent training on image-text pairs utilizing a Tesla A100 device in a mere\n2 hours, with a dataset comprising approximately 0.52M entries. Impressively,\nthe model can render images with an artistic-understanding and convey the\nemotions they inspire, mirroring human interpretation. Additionally, this work\npresents a unique dataset designed to evaluate the efficacy of vision-language\nmodels. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art\nperformance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the\nestablished benchmarks introduced in This study, lagging behind professional\nartists' descriptions by a negligible 0.15 points on a 6-point scale. The code\nand the pre-trained model are accessible in\nhttps://huggingface.co/Tyrannosaurus/ArtGPT-4.\n","authors":["Zhengqing Yuan","Xinyi Wang","Kun Wang","Lichao Sun","Yanfang Ye"],"pdf_url":"https://arxiv.org/pdf/2305.07490v4.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2312.11890v1","updated":"2023-12-19T06:26:25Z","published":"2023-12-19T06:26:25Z","title":"Difficulty-Focused Contrastive Learning for Knowledge Tracing with a\n Large Language Model-Based Difficulty Prediction","summary":" This paper presents novel techniques for enhancing the performance of\nknowledge tracing (KT) models by focusing on the crucial factor of question and\nconcept difficulty level. Despite the acknowledged significance of difficulty,\nprevious KT research has yet to exploit its potential for model optimization\nand has struggled to predict difficulty from unseen data. To address these\nproblems, we propose a difficulty-centered contrastive learning method for KT\nmodels and a Large Language Model (LLM)-based framework for difficulty\nprediction. These innovative methods seek to improve the performance of KT\nmodels and provide accurate difficulty estimates for unseen data. Our ablation\nstudy demonstrates the efficacy of these techniques by demonstrating enhanced\nKT model performance. Nonetheless, the complex relationship between language\nand difficulty merits further investigation.\n","authors":["Unggi Lee","Sungjun Yoon","Joon Seo Yun","Kyoungsoo Park","YoungHoon Jung","Damji Stratton","Hyeoncheol Kim"],"pdf_url":"https://arxiv.org/pdf/2312.11890v1.pdf","comment":"10 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.11882v1","updated":"2023-12-19T06:16:13Z","published":"2023-12-19T06:16:13Z","title":"ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for\n Accelerating Language Models Inference","summary":" Early Exiting is one of the most popular methods to achieve efficient\ninference. Current early exiting methods adopt the (weighted) sum of the cross\nentropy loss of all internal classifiers during training, imposing all these\nclassifiers to predict all instances correctly. However, during inference, as\nlong as one internal classifier predicts an instance correctly, it can\naccelerate without losing accuracy. Thus, there is a notable gap between\ntraining and inference. We propose ConsistentEE, an early exiting method that\nis consistent in training and inference. ConsistentEE formulates the early\nexiting process as a reinforcement learning problem. A policy network is added\nto decide whether an instance should exit or continue. The training objective\nof ConsistentEE only require each instance to be predicted correctly by one\ninternal classifier. Additionally, we introduce the concept Memorize Layer to\nmeasure the hardness of an instance. We incorporate memorized layer into reward\nfunction design, which allows ``easy'' instances to focus more on acceleration\nwhile ``hard'' instances to focus more on accuracy. Experimental results show\nthat our method outperforms other baselines on various natural language\nunderstanding and generation tasks.\n","authors":["Ziqian Zeng","Yihuai Hong","Hongliang Dai","Huiping Zhuang","Cen Chen"],"pdf_url":"https://arxiv.org/pdf/2312.11882v1.pdf","comment":"Accepted in AAAI24"},{"id":"http://arxiv.org/abs/2312.11881v1","updated":"2023-12-19T06:15:52Z","published":"2023-12-19T06:15:52Z","title":"Punctuation restoration Model and Spacing Model for Korean Ancient\n Document","summary":" In Korean ancient documents, there is no spacing or punctuation, and they are\nwritten in classical Chinese characters. This makes it challenging for modern\nindividuals and translation models to accurately interpret and translate them.\nWhile China has models predicting punctuation and spacing, applying them\ndirectly to Korean texts is problematic due to data differences. Therefore, we\ndeveloped the first models which predict punctuation and spacing for Korean\nhistorical texts and evaluated their performance. Our punctuation restoration\nmodel achieved an F1 score of 0.84, and Spacing model achieved a score of 0.96.\nIt has the advantage of enabling inference on low-performance GPUs with less\nVRAM while maintaining quite high accuracy.\n","authors":["Taehong Jang","Joonmo Ahn","Sojung Lucia Kim"],"pdf_url":"https://arxiv.org/pdf/2312.11881v1.pdf","comment":"5 Pages, 2 Figures"},{"id":"http://arxiv.org/abs/2312.11875v1","updated":"2023-12-19T06:06:30Z","published":"2023-12-19T06:06:30Z","title":"Sparse is Enough in Fine-tuning Pre-trained Large Language Model","summary":" With the prevalence of pre-training-fine-tuning paradigm, how to efficiently\nadapt the pre-trained model to the downstream tasks has been an intriguing\nissue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for\nlow-cost adaptation, including Adapters, Bia-only, and the recently widely used\nLow-Rank Adaptation. Although these methods have demonstrated their\neffectiveness to some extent and have been widely applied, the underlying\nprinciples are still unclear. In this paper, we reveal the transition of loss\nlandscape in the downstream domain from random initialization to pre-trained\ninitialization, that is, from low-amplitude oscillation to high-amplitude\noscillation. The parameter gradients exhibit a property akin to sparsity, where\na small fraction of components dominate the total gradient norm, for instance,\n1% of the components account for 99% of the gradient. This property ensures\nthat the pre-trained model can easily find a flat minimizer which guarantees\nthe model's ability to generalize even with a low number of trainable\nparameters. Based on this, we propose a gradient-based sparse fine-tuning\nalgorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its\neffectiveness on a range of tasks including the GLUE Benchmark and\nInstruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.\n","authors":["Weixi Song","Zuchao Li","Lefei Zhang","Hai Zhao","Bo Du"],"pdf_url":"https://arxiv.org/pdf/2312.11875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11870v1","updated":"2023-12-19T05:46:11Z","published":"2023-12-19T05:46:11Z","title":"A Revisit of Fake News Dataset with Augmented Fact-checking by ChatGPT","summary":" The proliferation of fake news has emerged as a critical issue in recent\nyears, requiring significant efforts to detect it. However, the existing fake\nnews detection datasets are sourced from human journalists, which are likely to\nhave inherent bias limitations due to the highly subjective nature of this\ntask. In this paper, we revisit the existing fake news dataset verified by\nhuman journalists with augmented fact-checking by large language models\n(ChatGPT), and we name the augmented fake news dataset ChatGPT-FC. We\nquantitatively analyze the distinctions and resemblances between human\njournalists and LLM in assessing news subject credibility, news creator\ncredibility, time-sensitive, and political framing. Our findings highlight\nLLM's potential to serve as a preliminary screening method, offering a\npromising avenue to mitigate the inherent biases of human journalists and\nenhance fake news detection.\n","authors":["Zizhong Li","Haopeng Zhang","Jiawei Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.11870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11276v2","updated":"2023-12-19T05:29:28Z","published":"2023-12-18T15:18:57Z","title":"Compositional Generalization for Multi-label Text Classification: A\n Data-Augmentation Approach","summary":" Despite significant advancements in multi-label text classification, the\nability of existing models to generalize to novel and seldom-encountered\ncomplex concepts, which are compositions of elementary ones, remains\nunderexplored. This research addresses this gap. By creating unique data splits\nacross three benchmarks, we assess the compositional generalization ability of\nexisting multi-label text classification models. Our results show that these\nmodels often fail to generalize to compositional concepts encountered\ninfrequently during training, leading to inferior performance on tests with\nthese new combinations. To address this, we introduce a data augmentation\nmethod that leverages two innovative text generation models designed to enhance\nthe classification models' capacity for compositional generalization. Our\nexperiments show that this data augmentation approach significantly improves\nthe compositional generalization capabilities of classification models on our\nbenchmarks, with both generation models surpassing other text generation\nbaselines.\n","authors":["Yuyang Chai","Zhuang Li","Jiahui Liu","Lei Chen","Fei Li","Donghong Ji","Chong Teng"],"pdf_url":"https://arxiv.org/pdf/2312.11276v2.pdf","comment":"Accepted by AAAI'24"},{"id":"http://arxiv.org/abs/2312.10793v2","updated":"2023-12-19T04:52:40Z","published":"2023-12-17T18:44:26Z","title":"Understanding the Instruction Mixture for Large Language Model\n Fine-tuning","summary":" While instructions fine-tuning of large language models (LLMs) has been\nproven to enhance performance across various applications, the influence of the\ninstruction dataset mixture on LLMs has not been thoroughly explored. In this\nstudy, we classify instructions into three main types: NLP downstream tasks,\ncoding, and general chatting, and investigate their impact on LLMs. Our\nfindings reveal that specific types of instructions are more beneficial for\nparticular uses, while it may cause harms to other aspects, emphasizing the\nimportance of meticulously designing the instruction mixture to maximize model\nperformance. This study sheds light on the instruction mixture and paves the\nway for future research.\n","authors":["Renxi Wang","Minghao Wu","Yuxia Wang","Xudong Han","Chiyu Zhang","Haonan Li"],"pdf_url":"https://arxiv.org/pdf/2312.10793v2.pdf","comment":"Instruction Tuning, Large Language Model, Alignment"},{"id":"http://arxiv.org/abs/2312.11852v1","updated":"2023-12-19T04:42:56Z","published":"2023-12-19T04:42:56Z","title":"Predicting Human Translation Difficulty with Neural Machine Translation","summary":" Human translators linger on some words and phrases more than others, and\npredicting this variation is a step towards explaining the underlying cognitive\nprocesses. Using data from the CRITT Translation Process Research Database, we\nevaluate the extent to which surprisal and attentional features derived from a\nNeural Machine Translation (NMT) model account for reading and production times\nof human translators. We find that surprisal and attention are complementary\npredictors of translation difficulty, and that surprisal derived from a NMT\nmodel is the single most successful predictor of production duration. Our\nanalyses draw on data from hundreds of translators operating across 13 language\npairs, and represent the most comprehensive investigation of human translation\ndifficulty to date.\n","authors":["Zheng Wei Lim","Ekaterina Vylomova","Charles Kemp","Trevor Cohn"],"pdf_url":"https://arxiv.org/pdf/2312.11852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11111v2","updated":"2023-12-19T04:27:47Z","published":"2023-12-18T11:19:45Z","title":"The Good, The Bad, and Why: Unveiling Emotions in Generative AI","summary":" Emotion significantly impacts our daily behaviors and interactions. While\nrecent generative AI models, such as large language models, have shown\nimpressive performance in various tasks, it remains unclear whether they truly\ncomprehend emotions. This paper aims to address this gap by incorporating\npsychological theories to gain a holistic understanding of emotions in\ngenerative AI models. Specifically, we propose three approaches: 1)\nEmotionPrompt to enhance AI model performance, 2) EmotionAttack to impair AI\nmodel performance, and 3) EmotionDecode to explain the effects of emotional\nstimuli, both benign and malignant. Through extensive experiments involving\nlanguage and multi-modal models on semantic understanding, logical reasoning,\nand generation tasks, we demonstrate that both textual and visual EmotionPrompt\ncan boost the performance of AI models while EmotionAttack can hinder it.\nAdditionally, EmotionDecode reveals that AI models can comprehend emotional\nstimuli akin to the mechanism of dopamine in the human brain. Our work heralds\na novel avenue for exploring psychology to enhance our understanding of\ngenerative AI models. This paper is an extended version of our previous work\nEmotionPrompt (arXiv:2307.11760).\n","authors":["Cheng Li","Jindong Wang","Yixuan Zhang","Kaijie Zhu","Xinyi Wang","Wenxin Hou","Jianxun Lian","Fang Luo","Qiang Yang","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2312.11111v2.pdf","comment":"Technical report; an extension to EmotionPrompt (arXiv:2307.11760);\n 34 pages"},{"id":"http://arxiv.org/abs/2312.10302v2","updated":"2023-12-19T03:48:21Z","published":"2023-12-16T03:33:12Z","title":"One Shot Learning as Instruction Data Prospector for Large Language\n Models","summary":" Aligning large language models(LLMs) with human is a critical step in\neffectively utilizing their pre-trained capabilities across a wide array of\nlanguage tasks. Current instruction tuning practices often rely on expanding\ndataset size without a clear strategy for ensuring data quality, which can\ninadvertently introduce noise and degrade model performance. To address this\nchallenge, we introduce Nuggets, a novel and efficient methodology that employs\none shot learning to select high-quality instruction data from expansive\ndatasets. Nuggets assesses the potential of individual instruction examples to\nact as effective one shot examples, thereby identifying those that can\nsignificantly enhance diverse task performance. Nuggets utilizes a scoring\nsystem based on the impact of candidate examples on the perplexity of a diverse\nanchor set, facilitating the selection of the most beneficial data for\ninstruction tuning. Through rigorous testing on two benchmarks, including\nMT-Bench and Alpaca-Eval, we demonstrate that instruction tuning with the top\n1% of Nuggets-curated examples substantially outperforms conventional methods\nthat use the full dataset. These findings advocate for a data selection\nparadigm that prioritizes quality, offering a more efficient pathway to align\nLLMs with humans.\n","authors":["Yunshui Li","Binyuan Hui","Xiaobo Xia","Jiaxi Yang","Min Yang","Lei Zhang","Shuzheng Si","Junhao Liu","Tongliang Liu","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2312.10302v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07594v2","updated":"2023-12-19T03:44:25Z","published":"2023-11-10T09:51:24Z","title":"How to Bridge the Gap between Modalities: A Comprehensive Survey on\n Multimodal Large Language Model","summary":" This review paper explores Multimodal Large Language Models (MLLMs), which\nintegrate Large Language Models (LLMs) like GPT-4 to handle multimodal data\nsuch as text and vision. MLLMs demonstrate capabilities like generating image\nnarratives and answering image-based questions, bridging the gap towards\nreal-world human-computer interactions and hinting at a potential pathway to\nartificial general intelligence. However, MLLMs still face challenges in\nprocessing the semantic gap in multimodality, which may lead to erroneous\ngeneration, posing potential risks to society. Choosing the appropriate\nmodality alignment method is crucial, as improper methods might require more\nparameters with limited performance improvement. This paper aims to explore\nmodality alignment methods for LLMs and their existing capabilities.\nImplementing modality alignment allows LLMs to address environmental issues and\nenhance accessibility. The study surveys existing modal alignment methods in\nMLLMs into four groups: (1) Multimodal Converters that change data into\nsomething LLMs can understand; (2) Multimodal Perceivers to improve how LLMs\nperceive different types of data; (3) Tools Assistance for changing data into\none common format, usually text; and (4) Data-Driven methods that teach LLMs to\nunderstand specific types of data in a dataset. This field is still in a phase\nof exploration and experimentation, and we will organize and update various\nexisting research methods for multimodal information alignment.\n","authors":["Shezheng Song","Xiaopeng Li","Shasha Li","Shan Zhao","Jie Yu","Jun Ma","Xiaoguang Mao","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07594v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11828v1","updated":"2023-12-19T03:39:23Z","published":"2023-12-19T03:39:23Z","title":"TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with\n Decentralized Natural Language Understanding Models","summary":" Chatbots have become one of the main pathways for the delivery of business\nautomation tools. Multi-agent systems offer a framework for designing chatbots\nat scale, making it easier to support complex conversations that span across\nmultiple domains as well as enabling developers to maintain and expand their\ncapabilities incrementally over time. However, multi-agent systems complicate\nthe natural language understanding (NLU) of user intents, especially when they\nrely on decentralized NLU models: some utterances (termed single intent) may\ninvoke a single agent while others (termed multi-intent) may explicitly invoke\nmultiple agents. Without correctly parsing multi-intent inputs, decentralized\nNLU approaches will not achieve high prediction accuracy. In this paper, we\npropose an efficient parsing and orchestration pipeline algorithm to service\nmulti-intent utterances from the user in the context of a multi-agent system.\nOur proposed approach achieved comparable performance to competitive deep\nlearning models on three different datasets while being up to 48 times faster.\n","authors":["Burak Aksar","Yara Rizk","Tathagata Chakraborti"],"pdf_url":"https://arxiv.org/pdf/2312.11828v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2208.11790v2","updated":"2023-12-19T03:32:26Z","published":"2022-08-24T22:44:09Z","title":"Addressing Token Uniformity in Transformers via Singular Value\n Transformation","summary":" Token uniformity is commonly observed in transformer-based models, in which\ndifferent tokens share a large proportion of similar information after going\nthrough stacked multiple self-attention layers in a transformer. In this paper,\nwe propose to use the distribution of singular values of outputs of each\ntransformer layer to characterise the phenomenon of token uniformity and\nempirically illustrate that a less skewed singular value distribution can\nalleviate the `token uniformity' problem. Base on our observations, we define\nseveral desirable properties of singular value distributions and propose a\nnovel transformation function for updating the singular values. We show that\napart from alleviating token uniformity, the transformation function should\npreserve the local neighbourhood structure in the original embedding space. Our\nproposed singular value transformation function is applied to a range of\ntransformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT,\nand improved performance is observed in semantic textual similarity evaluation\nand a range of GLUE tasks. Our source code is available at\nhttps://github.com/hanqi-qi/tokenUni.git.\n","authors":["Hanqi Yan","Lin Gui","Wenjie Li","Yulan He"],"pdf_url":"https://arxiv.org/pdf/2208.11790v2.pdf","comment":"UAI2022 Main Conference, Spotlight, combined with supplementary files"},{"id":"http://arxiv.org/abs/2312.11819v1","updated":"2023-12-19T03:24:55Z","published":"2023-12-19T03:24:55Z","title":"An Adaptive Placement and Parallelism Framework for Accelerating RLHF\n Training","summary":" Recently, ChatGPT or InstructGPT like large language models (LLM) has made a\nsignificant impact in the AI world. These models are incredibly versatile,\ncapable of performing language tasks on par or even exceeding the capabilities\nof human experts. Many works have attempted to reproduce the complex\nInstructGPT's RLHF (Reinforcement Learning with Human Feedback) training\npipeline. However, the mainstream distributed RLHF training methods typically\nadopt a fixed model placement strategy, referred to as the Flattening strategy.\nThis strategy treats all four models involved in RLHF as a single entity and\nplaces them on all devices, regardless of their differences. Unfortunately,\nthis strategy exacerbates the generation bottlenecks in the RLHF training and\ndegrades the overall training efficiency. To address these issues, we propose\nan adaptive model placement framework that offers two flexible model placement\nstrategies. These strategies allow for the agile allocation of models across\ndevices in a fine-grained manner. The Interleaving strategy helps reduce memory\nredundancy and communication costs during RLHF training. On the other hand, the\nSeparation strategy improves the throughput of model training by separating the\ntraining and generation stages of the RLHF pipeline. Notably, this framework\nseamlessly integrates with other mainstream techniques for acceleration and\nenables automatic hyperparameter search. Extensive experiments have\ndemonstrated that our Interleaving and Separation strategies can achieve\nnotable improvements up to 11x, compared to the current state-of-the-art (SOTA)\napproaches. These experiments encompassed a wide range of training scenarios,\ninvolving models of varying sizes and devices of different scales. The results\nhighlight the effectiveness and superiority of our approaches in accelerating\nthe training of distributed RLHF.\n","authors":["Youshao Xiao","Weichang Wu","Zhenglei Zhou","Fagui Mao","Shangchun Zhao","Lin Ju","Lei Liang","Xiaolu Zhang","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.11819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.03518v3","updated":"2023-12-19T02:57:31Z","published":"2021-06-07T11:14:58Z","title":"Position Bias Mitigation: A Knowledge-Aware Graph Model for Emotion\n Cause Extraction","summary":" The Emotion Cause Extraction (ECE)} task aims to identify clauses which\ncontain emotion-evoking information for a particular emotion expressed in text.\nWe observe that a widely-used ECE dataset exhibits a bias that the majority of\nannotated cause clauses are either directly before their associated emotion\nclauses or are the emotion clauses themselves. Existing models for ECE tend to\nexplore such relative position information and suffer from the dataset bias. To\ninvestigate the degree of reliance of existing ECE models on clause relative\npositions, we propose a novel strategy to generate adversarial examples in\nwhich the relative position information is no longer the indicative feature of\ncause clauses. We test the performance of existing models on such adversarial\nexamples and observe a significant performance drop. To address the dataset\nbias, we propose a novel graph-based method to explicitly model the emotion\ntriggering paths by leveraging the commonsense knowledge to enhance the\nsemantic dependencies between a candidate clause and an emotion clause.\nExperimental results show that our proposed approach performs on par with the\nexisting state-of-the-art methods on the original ECE dataset, and is more\nrobust against adversarial attacks compared to existing models.\n","authors":["Hanqi Yan","Lin Gui","Gabriele Pergola","Yulan He"],"pdf_url":"https://arxiv.org/pdf/2106.03518v3.pdf","comment":"ACL2021 Main Conference, Oral paper"},{"id":"http://arxiv.org/abs/2312.11805v1","updated":"2023-12-19T02:39:27Z","published":"2023-12-19T02:39:27Z","title":"Gemini: A Family of Highly Capable Multimodal Models","summary":" This report introduces a new family of multimodal models, Gemini, that\nexhibit remarkable capabilities across image, audio, video, and text\nunderstanding. The Gemini family consists of Ultra, Pro, and Nano sizes,\nsuitable for applications ranging from complex reasoning tasks to on-device\nmemory-constrained use-cases. Evaluation on a broad range of benchmarks shows\nthat our most-capable Gemini Ultra model advances the state of the art in 30 of\n32 of these benchmarks - notably being the first model to achieve human-expert\nperformance on the well-studied exam benchmark MMLU, and improving the state of\nthe art in every one of the 20 multimodal benchmarks we examined. We believe\nthat the new capabilities of Gemini models in cross-modal reasoning and\nlanguage understanding will enable a wide variety of use cases and we discuss\nour approach toward deploying them responsibly to users.\n","authors":[" Gemini Team","Rohan Anil","Sebastian Borgeaud","Yonghui Wu","Jean-Baptiste Alayrac","Jiahui Yu","Radu Soricut","Johan Schalkwyk","Andrew M. Dai","Anja Hauth","Katie Millican","David Silver","Slav Petrov","Melvin Johnson","Ioannis Antonoglou","Julian Schrittwieser","Amelia Glaese","Jilin Chen","Emily Pitler","Timothy Lillicrap","Angeliki Lazaridou","Orhan Firat","James Molloy","Michael Isard","Paul R. Barham","Tom Hennigan","Benjamin Lee","Fabio Viola","Malcolm Reynolds","Yuanzhong Xu","Ryan Doherty","Eli Collins","Clemens Meyer","Eliza Rutherford","Erica Moreira","Kareem Ayoub","Megha Goel","George Tucker","Enrique Piqueras","Maxim Krikun","Iain Barr","Nikolay Savinov","Ivo Danihelka","Becca Roelofs","Anaïs White","Anders Andreassen","Tamara von Glehn","Lakshman Yagati","Mehran Kazemi","Lucas Gonzalez","Misha Khalman","Jakub Sygnowski","Alexandre Frechette","Charlotte Smith","Laura Culp","Lev Proleev","Yi Luan","Xi Chen","James Lottes","Nathan Schucher","Federico Lebron","Alban Rrustemi","Natalie Clay","Phil Crone","Tomas Kocisky","Jeffrey Zhao","Bartek Perz","Dian Yu","Heidi Howard","Adam Bloniarz","Jack W. Rae","Han Lu","Laurent Sifre","Marcello Maggioni","Fred Alcober","Dan Garrette","Megan Barnes","Shantanu Thakoor","Jacob Austin","Gabriel Barth-Maron","William Wong","Rishabh Joshi","Rahma Chaabouni","Deeni Fatiha","Arun Ahuja","Ruibo Liu","Yunxuan Li","Sarah Cogan","Jeremy Chen","Chao Jia","Chenjie Gu","Qiao Zhang","Jordan Grimstad","Ale Jakse Hartman","Martin Chadwick","Gaurav Singh Tomar","Xavier Garcia","Evan Senter","Emanuel Taropa","Thanumalayan Sankaranarayana Pillai","Jacob Devlin","Michael Laskin","Diego de Las Casas","Dasha Valter","Connie Tao","Lorenzo Blanco","Adrià Puigdomènech Badia","David Reitter","Mianna Chen","Jenny Brennan","Clara Rivera","Sergey Brin","Shariq Iqbal","Gabriela Surita","Jane Labanowski","Abhi Rao","Stephanie Winkler","Emilio Parisotto","Yiming Gu","Kate Olszewska","Yujing Zhang","Ravi Addanki","Antoine Miech","Annie Louis","Laurent El Shafey","Denis Teplyashin","Geoff Brown","Elliot Catt","Nithya Attaluri","Jan Balaguer","Jackie Xiang","Pidong Wang","Zoe Ashwood","Anton Briukhov","Albert Webson","Sanjay Ganapathy","Smit Sanghavi","Ajay Kannan","Ming-Wei Chang","Axel Stjerngren","Josip Djolonga","Yuting Sun","Ankur Bapna","Matthew Aitchison","Pedram Pejman","Henryk Michalewski","Tianhe Yu","Cindy Wang","Juliette Love","Junwhan Ahn","Dawn Bloxwich","Kehang Han","Peter Humphreys","Thibault Sellam","James Bradbury","Varun Godbole","Sina Samangooei","Bogdan Damoc","Alex Kaskasoli","Sébastien M. R. Arnold","Vijay Vasudevan","Shubham Agrawal","Jason Riesa","Dmitry Lepikhin","Richard Tanburn","Srivatsan Srinivasan","Hyeontaek Lim","Sarah Hodkinson","Pranav Shyam","Johan Ferret","Steven Hand","Ankush Garg","Tom Le Paine","Jian Li","Yujia Li","Minh Giang","Alexander Neitz","Zaheer Abbas","Sarah York","Machel Reid","Elizabeth Cole","Aakanksha Chowdhery","Dipanjan Das","Dominika Rogozińska","Vitaly Nikolaev","Pablo Sprechmann","Zachary Nado","Lukas Zilka","Flavien Prost","Luheng He","Marianne Monteiro","Gaurav Mishra","Chris Welty","Josh Newlan","Dawei Jia","Miltiadis Allamanis","Clara Huiyi Hu","Raoul de Liedekerke","Justin Gilmer","Carl Saroufim","Shruti Rijhwani","Shaobo Hou","Disha Shrivastava","Anirudh Baddepudi","Alex Goldin","Adnan Ozturel","Albin Cassirer","Yunhan Xu","Daniel Sohn","Devendra Sachan","Reinald Kim Amplayo","Craig Swanson","Dessie Petrova","Shashi Narayan","Arthur Guez","Siddhartha Brahma","Jessica Landon","Miteyan Patel","Ruizhe Zhao","Kevin Villela","Luyu Wang","Wenhao Jia","Matthew Rahtz","Mai Giménez","Legg Yeung","Hanzhao Lin","James Keeling","Petko Georgiev","Diana Mincu","Boxi Wu","Salem Haykal","Rachel Saputro","Kiran Vodrahalli","James Qin","Zeynep Cankara","Abhanshu Sharma","Nick Fernando","Will Hawkins","Behnam Neyshabur","Solomon Kim","Adrian Hutter","Priyanka Agrawal","Alex Castro-Ros","George van den Driessche","Tao Wang","Fan Yang","Shuo-yiin Chang","Paul Komarek","Ross McIlroy","Mario Lučić","Guodong Zhang","Wael Farhan","Michael Sharman","Paul Natsev","Paul Michel","Yong Cheng","Yamini Bansal","Siyuan Qiao","Kris Cao","Siamak Shakeri","Christina Butterfield","Justin Chung","Paul Kishan Rubenstein","Shivani Agrawal","Arthur Mensch","Kedar Soparkar","Karel Lenc","Timothy Chung","Aedan Pope","Loren Maggiore","Jackie Kay","Priya Jhakra","Shibo Wang","Joshua Maynez","Mary Phuong","Taylor Tobin","Andrea Tacchetti","Maja Trebacz","Kevin Robinson","Yash Katariya","Sebastian Riedel","Paige Bailey","Kefan Xiao","Nimesh Ghelani","Lora Aroyo","Ambrose Slone","Neil Houlsby","Xuehan Xiong","Zhen Yang","Elena Gribovskaya","Jonas Adler","Mateo Wirth","Lisa Lee","Music Li","Thais Kagohara","Jay Pavagadhi","Sophie Bridgers","Anna Bortsova","Sanjay Ghemawat","Zafarali Ahmed","Tianqi Liu","Richard Powell","Vijay Bolina","Mariko Iinuma","Polina Zablotskaia","James Besley","Da-Woon Chung","Timothy Dozat","Ramona Comanescu","Xiance Si","Jeremy Greer","Guolong Su","Martin Polacek","Raphaël Lopez Kaufman","Simon Tokumine","Hexiang Hu","Elena Buchatskaya","Yingjie Miao","Mohamed Elhawaty","Aditya Siddhant","Nenad Tomasev","Jinwei Xing","Christina Greer","Helen Miller","Shereen Ashraf","Aurko Roy","Zizhao Zhang","Ada Ma","Angelos Filos","Milos Besta","Rory Blevins","Ted Klimenko","Chih-Kuan Yeh","Soravit Changpinyo","Jiaqi Mu","Oscar Chang","Mantas Pajarskas","Carrie Muir","Vered Cohen","Charline Le Lan","Krishna Haridasan","Amit Marathe","Steven Hansen","Sholto Douglas","Rajkumar Samuel","Mingqiu Wang","Sophia Austin","Chang Lan","Jiepu Jiang","Justin Chiu","Jaime Alonso Lorenzo","Lars Lowe Sjösund","Sébastien Cevey","Zach Gleicher","Thi Avrahami","Anudhyan Boral","Hansa Srinivasan","Vittorio Selo","Rhys May","Konstantinos Aisopos","Léonard Hussenot","Livio Baldini Soares","Kate Baumli","Michael B. Chang","Adrià Recasens","Ben Caine","Alexander Pritzel","Filip Pavetic","Fabio Pardo","Anita Gergely","Justin Frye","Vinay Ramasesh","Dan Horgan","Kartikeya Badola","Nora Kassner","Subhrajit Roy","Ethan Dyer","Víctor Campos","Alex Tomala","Yunhao Tang","Dalia El Badawy","Elspeth White","Basil Mustafa","Oran Lang","Abhishek Jindal","Sharad Vikram","Zhitao Gong","Sergi Caelles","Ross Hemsley","Gregory Thornton","Fangxiaoyu Feng","Wojciech Stokowiec","Ce Zheng","Phoebe Thacker","Çağlar Ünlü","Zhishuai Zhang","Mohammad Saleh","James Svensson","Max Bileschi","Piyush Patil","Ankesh Anand","Roman Ring","Katerina Tsihlas","Arpi Vezer","Marco Selvi","Toby Shevlane","Mikel Rodriguez","Tom Kwiatkowski","Samira Daruki","Keran Rong","Allan Dafoe","Nicholas FitzGerald","Keren Gu-Lemberg","Mina Khan","Lisa Anne Hendricks","Marie Pellat","Vladimir Feinberg","James Cobon-Kerr","Tara Sainath","Maribeth Rauh","Sayed Hadi Hashemi","Richard Ives","Yana Hasson","YaGuang Li","Eric Noland","Yuan Cao","Nathan Byrd","Le Hou","Qingze Wang","Thibault Sottiaux","Michela Paganini","Jean-Baptiste Lespiau","Alexandre Moufarek","Samer Hassan","Kaushik Shivakumar","Joost van Amersfoort","Amol Mandhane","Pratik Joshi","Anirudh Goyal","Matthew Tung","Andrew Brock","Hannah Sheahan","Vedant Misra","Cheng Li","Nemanja Rakićević","Mostafa Dehghani","Fangyu Liu","Sid Mittal","Junhyuk Oh","Seb Noury","Eren Sezener","Fantine Huot","Matthew Lamm","Nicola De Cao","Charlie Chen","Gamaleldin Elsayed","Ed Chi","Mahdis Mahdieh","Ian Tenney","Nan Hua","Ivan Petrychenko","Patrick Kane","Dylan Scandinaro","Rishub Jain","Jonathan Uesato","Romina Datta","Adam Sadovsky","Oskar Bunyan","Dominik Rabiej","Shimu Wu","John Zhang","Gautam Vasudevan","Edouard Leurent","Mahmoud Alnahlawi","Ionut Georgescu","Nan Wei","Ivy Zheng","Betty Chan","Pam G Rabinovitch","Piotr Stanczyk","Ye Zhang","David Steiner","Subhajit Naskar","Michael Azzam","Matthew Johnson","Adam Paszke","Chung-Cheng Chiu","Jaume Sanchez Elias","Afroz Mohiuddin","Faizan Muhammad","Jin Miao","Andrew Lee","Nino Vieillard","Sahitya Potluri","Jane Park","Elnaz Davoodi","Jiageng Zhang","Jeff Stanway","Drew Garmon","Abhijit Karmarkar","Zhe Dong","Jong Lee","Aviral Kumar","Luowei Zhou","Jonathan Evens","William Isaac","Zhe Chen","Johnson Jia","Anselm Levskaya","Zhenkai Zhu","Chris Gorgolewski","Peter Grabowski","Yu Mao","Alberto Magni","Kaisheng Yao","Javier Snaider","Norman Casagrande","Paul Suganthan","Evan Palmer","Geoffrey Irving","Edward Loper","Manaal Faruqui","Isha Arkatkar","Nanxin Chen","Izhak Shafran","Michael Fink","Alfonso Castaño","Irene Giannoumis","Wooyeol Kim","Mikołaj Rybiński","Ashwin Sreevatsa","Jennifer Prendki","David Soergel","Adrian Goedeckemeyer","Willi Gierke","Mohsen Jafari","Meenu Gaba","Jeremy Wiesner","Diana Gage Wright","Yawen Wei","Harsha Vashisht","Yana Kulizhskaya","Jay Hoover","Maigo Le","Lu Li","Chimezie Iwuanyanwu","Lu Liu","Kevin Ramirez","Andrey Khorlin","Albert Cui","Tian LIN","Marin Georgiev","Marcus Wu","Ricardo Aguilar","Keith Pallo","Abhishek Chakladar","Alena Repina","Xihui Wu","Tom van der Weide","Priya Ponnapalli","Caroline Kaplan","Jiri Simsa","Shuangfeng Li","Olivier Dousse","Fan Yang","Jeff Piper","Nathan Ie","Minnie Lui","Rama Pasumarthi","Nathan Lintz","Anitha Vijayakumar","Lam Nguyen Thiet","Daniel Andor","Pedro Valenzuela","Cosmin Paduraru","Daiyi Peng","Katherine Lee","Shuyuan Zhang","Somer Greene","Duc Dung Nguyen","Paula Kurylowicz","Sarmishta Velury","Sebastian Krause","Cassidy Hardin","Lucas Dixon","Lili Janzer","Kiam Choo","Ziqiang Feng","Biao Zhang","Achintya Singhal","Tejasi Latkar","Mingyang Zhang","Quoc Le","Elena Allica Abellan","Dayou Du","Dan McKinnon","Natasha Antropova","Tolga Bolukbasi","Orgad Keller","David Reid","Daniel Finchelstein","Maria Abi Raad","Remi Crocker","Peter Hawkins","Robert Dadashi","Colin Gaffney","Sid Lall","Ken Franko","Egor Filonov","Anna Bulanova","Rémi Leblond","Vikas Yadav","Shirley Chung","Harry Askham","Luis C. Cobo","Kelvin Xu","Felix Fischer","Jun Xu","Christina Sorokin","Chris Alberti","Chu-Cheng Lin","Colin Evans","Hao Zhou","Alek Dimitriev","Hannah Forbes","Dylan Banarse","Zora Tung","Jeremiah Liu","Mark Omernick","Colton Bishop","Chintu Kumar","Rachel Sterneck","Ryan Foley","Rohan Jain","Swaroop Mishra","Jiawei Xia","Taylor Bos","Geoffrey Cideron","Ehsan Amid","Francesco Piccinno","Xingyu Wang","Praseem Banzal","Petru Gurita","Hila Noga","Premal Shah","Daniel J. Mankowitz","Alex Polozov","Nate Kushman","Victoria Krakovna","Sasha Brown","MohammadHossein Bateni","Dennis Duan","Vlad Firoiu","Meghana Thotakuri","Tom Natan","Anhad Mohananey","Matthieu Geist","Sidharth Mudgal","Sertan Girgin","Hui Li","Jiayu Ye","Ofir Roval","Reiko Tojo","Michael Kwong","James Lee-Thorp","Christopher Yew","Quan Yuan","Sumit Bagri","Danila Sinopalnikov","Sabela Ramos","John Mellor","Abhishek Sharma","Aliaksei Severyn","Jonathan Lai","Kathy Wu","Heng-Tze Cheng","David Miller","Nicolas Sonnerat","Denis Vnukov","Rory Greig","Jennifer Beattie","Emily Caveness","Libin Bai","Julian Eisenschlos","Alex Korchemniy","Tomy Tsai","Mimi Jasarevic","Weize Kong","Phuong Dao","Zeyu Zheng","Frederick Liu","Fan Yang","Rui Zhu","Mark Geller","Tian Huey Teh","Jason Sanmiya","Evgeny Gladchenko","Nejc Trdin","Andrei Sozanschi","Daniel Toyama","Evan Rosen","Sasan Tavakkol","Linting Xue","Chen Elkind","Oliver Woodman","John Carpenter","George Papamakarios","Rupert Kemp","Sushant Kafle","Tanya Grunina","Rishika Sinha","Alice Talbert","Abhimanyu Goyal","Diane Wu","Denese Owusu-Afriyie","Cosmo Du","Chloe Thornton","Jordi Pont-Tuset","Pradyumna Narayana","Jing Li","Sabaer Fatehi","John Wieting","Omar Ajmeri","Benigno Uria","Tao Zhu","Yeongil Ko","Laura Knight","Amélie Héliou","Ning Niu","Shane Gu","Chenxi Pang","Dustin Tran","Yeqing Li","Nir Levine","Ariel Stolovich","Norbert Kalb","Rebeca Santamaria-Fernandez","Sonam Goenka","Wenny Yustalim","Robin Strudel","Ali Elqursh","Balaji Lakshminarayanan","Charlie Deck","Shyam Upadhyay","Hyo Lee","Mike Dusenberry","Zonglin Li","Xuezhi Wang","Kyle Levin","Raphael Hoffmann","Dan Holtmann-Rice","Olivier Bachem","Summer Yue","Sho Arora","Eric Malmi","Daniil Mirylenka","Qijun Tan","Christy Koh","Soheil Hassas Yeganeh","Siim Põder","Steven Zheng","Francesco Pongetti","Mukarram Tariq","Yanhua Sun","Lucian Ionita","Mojtaba Seyedhosseini","Pouya Tafti","Ragha Kotikalapudi","Zhiyu Liu","Anmol Gulati","Jasmine Liu","Xinyu Ye","Bart Chrzaszcz","Lily Wang","Nikhil Sethi","Tianrun Li","Ben Brown","Shreya Singh","Wei Fan","Aaron Parisi","Joe Stanton","Chenkai Kuang","Vinod Koverkathu","Christopher A. Choquette-Choo","Yunjie Li","TJ Lu","Abe Ittycheriah","Prakash Shroff","Pei Sun","Mani Varadarajan","Sanaz Bahargam","Rob Willoughby","David Gaddy","Ishita Dasgupta","Guillaume Desjardins","Marco Cornero","Brona Robenek","Bhavishya Mittal","Ben Albrecht","Ashish Shenoy","Fedor Moiseev","Henrik Jacobsson","Alireza Ghaffarkhah","Morgane Rivière","Alanna Walton","Clément Crepy","Alicia Parrish","Yuan Liu","Zongwei Zhou","Clement Farabet","Carey Radebaugh","Praveen Srinivasan","Claudia van der Salm","Andreas Fidjeland","Salvatore Scellato","Eri Latorre-Chimoto","Hanna Klimczak-Plucińska","David Bridson","Dario de Cesare","Tom Hudson","Piermaria Mendolicchio","Lexi Walker","Alex Morris","Ivo Penchev","Matthew Mauger","Alexey Guseynov","Alison Reid","Seth Odoom","Lucia Loher","Victor Cotruta","Madhavi Yenugula","Dominik Grewe","Anastasia Petrushkina","Tom Duerig","Antonio Sanchez","Steve Yadlowsky","Amy Shen","Amir Globerson","Adam Kurzrok","Lynette Webb","Sahil Dua","Dong Li","Preethi Lahoti","Surya Bhupatiraju","Dan Hurt","Haroon Qureshi","Ananth Agarwal","Tomer Shani","Matan Eyal","Anuj Khare","Shreyas Rammohan Belle","Lei Wang","Chetan Tekur","Mihir Sanjay Kale","Jinliang Wei","Ruoxin Sang","Brennan Saeta","Tyler Liechty","Yi Sun","Yao Zhao","Stephan Lee","Pandu Nayak","Doug Fritz","Manish Reddy Vuyyuru","John Aslanides","Nidhi Vyas","Martin Wicke","Xiao Ma","Taylan Bilal","Evgenii Eltyshev","Daniel Balle","Nina Martin","Hardie Cate","James Manyika","Keyvan Amiri","Yelin Kim","Xi Xiong","Kai Kang","Florian Luisier","Nilesh Tripuraneni","David Madras","Mandy Guo","Austin Waters","Oliver Wang","Joshua Ainslie","Jason Baldridge","Han Zhang","Garima Pruthi","Jakob Bauer","Feng Yang","Riham Mansour","Jason Gelman","Yang Xu","George Polovets","Ji Liu","Honglong Cai","Warren Chen","XiangHai Sheng","Emily Xue","Sherjil Ozair","Adams Yu","Christof Angermueller","Xiaowei Li","Weiren Wang","Julia Wiesinger","Emmanouil Koukoumidis","Yuan Tian","Anand Iyer","Madhu Gurumurthy","Mark Goldenson","Parashar Shah","MK Blake","Hongkun Yu","Anthony Urbanowicz","Jennimaria Palomaki","Chrisantha Fernando","Kevin Brooks","Ken Durden","Harsh Mehta","Nikola Momchev","Elahe Rahimtoroghi","Maria Georgaki","Amit Raul","Sebastian Ruder","Morgan Redshaw","Jinhyuk Lee","Komal Jalan","Dinghua Li","Ginger Perng","Blake Hechtman","Parker Schuh","Milad Nasr","Mia Chen","Kieran Milan","Vladimir Mikulik","Trevor Strohman","Juliana Franco","Tim Green","Demis Hassabis","Koray Kavukcuoglu","Jeffrey Dean","Oriol Vinyals"],"pdf_url":"https://arxiv.org/pdf/2312.11805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11803v1","updated":"2023-12-19T02:35:13Z","published":"2023-12-19T02:35:13Z","title":"Designing Guiding Principles for NLP for Healthcare: A Case Study of\n Maternal Health","summary":" Objective: An ethical framework for the use of large language models (LLMs)\nis urgently needed to shape how natural language processing (NLP) tools are\nused for healthcare applications. Drawing directly from the voices of those\nmost affected, we propose a set of guiding principles for the use of NLP in\nhealthcare, with examples based on applications in maternal health.\n Materials and Methods: We led an interactive session centered on an LLM-based\nchatbot demonstration during a full-day workshop with 39 participants, and\nadditionally surveyed 30 healthcare workers and 30 birthing people about their\nvalues, needs, and perceptions of AI and LLMs. We conducted quantitative and\nqualitative analyses of the interactive discussions to consolidate our findings\ninto a set of guiding principles.\n Results: Using the case study of maternal health, we propose nine principles\nfor ethical use of LLMs, grouped into three categories: (i) contextual\nsignificance, (ii) measurements, and (iii) who/what is valued. We describe\nrationales underlying these principles and provide practical advice.\n Discussion: Healthcare faces existing challenges including the balance of\npower in clinician-patient relationships, systemic health disparities,\nhistorical injustices, and economic constraints. Our principles serve as a\nframework for surfacing key considerations when deploying LLMs in medicine, as\nwell as providing a methodological pattern for other researchers to follow.\n Conclusion: This set of principles can serve as a resource to practitioners\nworking on maternal health and other healthcare fields to emphasize the\nimportance of technical nuance, historical context, and inclusive design when\ndeveloping LLMs for use in clinical settings.\n","authors":["Maria Antoniak","Aakanksha Naik","Carla S. Alvarado","Lucy Lu Wang","Irene Y. Chen"],"pdf_url":"https://arxiv.org/pdf/2312.11803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11795v1","updated":"2023-12-19T02:11:01Z","published":"2023-12-19T02:11:01Z","title":"MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA","summary":" Large language models (LLMs) have shown great success in various Natural\nLanguage Processing (NLP) tasks, whist they still need updates after deployment\nto fix errors or keep pace with the changing knowledge in the world.\nResearchers formulate such problem as Model Editing and have developed various\neditors focusing on different axes of editing properties. However, current\neditors can hardly support all properties and rely on heavy computational\nresources. In this paper, we propose a plug-in Model Editing method based on\nneuron-indexed dynamic LoRA (MELO), which alters the behavior of language\nmodels by dynamically activating certain LoRA blocks according to the index\nbuilt in an inner vector database. Our method satisfies various editing\nproperties with high efficiency and can be easily integrated into multiple LLM\nbackbones. Experimental results show that our proposed MELO achieves\nstate-of-the-art editing performance on three sequential editing tasks\n(document classification, question answering and hallucination correction),\nwhile requires the least trainable parameters and computational cost.\n","authors":["Lang Yu","Qin Chen","Jie Zhou","Liang He"],"pdf_url":"https://arxiv.org/pdf/2312.11795v1.pdf","comment":"In Proceedings of The 38th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2312.11792v1","updated":"2023-12-19T02:07:42Z","published":"2023-12-19T02:07:42Z","title":"COOPER: Coordinating Specialized Agents towards a Complex Dialogue Goal","summary":" In recent years, there has been a growing interest in exploring dialogues\nwith more complex goals, such as negotiation, persuasion, and emotional\nsupport, which go beyond traditional service-focused dialogue systems. Apart\nfrom the requirement for much more sophisticated strategic reasoning and\ncommunication skills, a significant challenge of these tasks lies in the\ndifficulty of objectively measuring the achievement of their goals in a\nquantifiable way, making it difficult for existing research to directly\noptimize the dialogue procedure towards them. In our work, we emphasize the\nmultifaceted nature of complex dialogue goals and argue that it is more\nfeasible to accomplish them by comprehensively considering and jointly\npromoting their different aspects. To this end, we propose a novel dialogue\nframework, Cooper, which coordinates multiple specialized agents, each\ndedicated to a specific dialogue goal aspect separately, to approach the\ncomplex objective. Through this divide-and-conquer manner, we make complex\ndialogue goals more approachable and elicit greater intelligence via the\ncollaboration of individual agents. Experiments on persuasion and emotional\nsupport dialogues demonstrate the superiority of our method over a set of\ncompetitive baselines.\n","authors":["Yi Cheng","Wenge Liu","Jian Wang","Chak Tou Leong","Yi Ouyang","Wenjie Li","Xian Wu","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2312.11792v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2309.12276v2","updated":"2023-12-19T01:49:45Z","published":"2023-09-21T17:37:01Z","title":"LLMR: Real-time Prompting of Interactive Worlds using Large Language\n Models","summary":" We present Large Language Model for Mixed Reality (LLMR), a framework for the\nreal-time creation and modification of interactive Mixed Reality experiences\nusing LLMs. LLMR leverages novel strategies to tackle difficult cases where\nideal training data is scarce, or where the design goal requires the synthesis\nof internal dynamics, intuitive analysis, or advanced interactivity. Our\nframework relies on text interaction and the Unity game engine. By\nincorporating techniques for scene understanding, task planning,\nself-debugging, and memory management, LLMR outperforms the standard GPT-4 by\n4x in average error rate. We demonstrate LLMR's cross-platform interoperability\nwith several example worlds, and evaluate it on a variety of creation and\nmodification tasks to show that it can produce and edit diverse objects, tools,\nand scenes. Finally, we conducted a usability study (N=11) with a diverse set\nthat revealed participants had positive experiences with the system and would\nuse it again.\n","authors":["Fernanda De La Torre","Cathy Mengying Fang","Han Huang","Andrzej Banburski-Fahey","Judith Amores Fernandez","Jaron Lanier"],"pdf_url":"https://arxiv.org/pdf/2309.12276v2.pdf","comment":"60 pages, 18 figures; Expanded discussion of experiments and the\n influence of various modules"},{"id":"http://arxiv.org/abs/2312.11785v1","updated":"2023-12-19T01:48:31Z","published":"2023-12-19T01:48:31Z","title":"Zero-Shot Fact-Checking with Semantic Triples and Knowledge Graphs","summary":" Despite progress in automated fact-checking, most systems require a\nsignificant amount of labeled training data, which is expensive. In this paper,\nwe propose a novel zero-shot method, which instead of operating directly on the\nclaim and evidence sentences, decomposes them into semantic triples augmented\nusing external knowledge graphs, and uses large language models trained for\nnatural language inference. This allows it to generalize to adversarial\ndatasets and domains that supervised models require specific training data for.\nOur empirical results show that our approach outperforms previous zero-shot\napproaches on FEVER, FEVER-Symmetric, FEVER 2.0, and Climate-FEVER, while being\ncomparable or better than supervised models on the adversarial and the\nout-of-domain datasets.\n","authors":["Zhangdie Yuan","Andreas Vlachos"],"pdf_url":"https://arxiv.org/pdf/2312.11785v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11779v1","updated":"2023-12-19T01:28:46Z","published":"2023-12-19T01:28:46Z","title":"Are you talking to ['xem'] or ['x', 'em']? On Tokenization and\n Addressing Misgendering in LLMs with Pronoun Tokenization Parity","summary":" A large body of NLP research has documented the ways gender biases manifest\nand amplify within large language models (LLMs), though this research has\npredominantly operated within a gender binary-centric context. A growing body\nof work has identified the harmful limitations of this gender-exclusive\nframing; many LLMs cannot correctly and consistently refer to persons outside\nthe gender binary, especially if they use neopronouns. While data scarcity has\nbeen identified as a possible culprit, the precise mechanisms through which it\ninfluences LLM misgendering remain underexplored. Our work addresses this gap\nby studying data scarcity's role in subword tokenization and, consequently, the\nformation of LLM word representations. We uncover how the Byte-Pair Encoding\n(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun\nmisgendering through out-of-vocabulary behavior. We introduce pronoun\ntokenization parity (PTP), a novel approach to reduce LLM neopronoun\nmisgendering by preserving a token's functional structure. We evaluate PTP's\nefficacy using pronoun consistency-based metrics and a novel syntax-based\nmetric. Through several controlled experiments, finetuning LLMs with PTP\nimproves neopronoun consistency from 14.5% to 58.4%, highlighting the\nsignificant role tokenization plays in LLM pronoun consistency.\n","authors":["Anaelia Ovalle","Ninareh Mehrabi","Palash Goyal","Jwala Dhamala","Kai-Wei Chang","Richard Zemel","Aram Galstyan","Yuval Pinter","Rahul Gupta"],"pdf_url":"https://arxiv.org/pdf/2312.11779v1.pdf","comment":"Accepted to 2023 Neurips Queer in AI workshop"},{"id":"http://arxiv.org/abs/2303.08774v4","updated":"2023-12-19T00:34:40Z","published":"2023-03-15T17:15:04Z","title":"GPT-4 Technical Report","summary":" We report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance on\nvarious professional and academic benchmarks, including passing a simulated bar\nexam with a score around the top 10% of test takers. GPT-4 is a\nTransformer-based model pre-trained to predict the next token in a document.\nThe post-training alignment process results in improved performance on measures\nof factuality and adherence to desired behavior. A core component of this\nproject was developing infrastructure and optimization methods that behave\npredictably across a wide range of scales. This allowed us to accurately\npredict some aspects of GPT-4's performance based on models trained with no\nmore than 1/1,000th the compute of GPT-4.\n","authors":[" OpenAI"," :","Josh Achiam","Steven Adler","Sandhini Agarwal","Lama Ahmad","Ilge Akkaya","Florencia Leoni Aleman","Diogo Almeida","Janko Altenschmidt","Sam Altman","Shyamal Anadkat","Red Avila","Igor Babuschkin","Suchir Balaji","Valerie Balcom","Paul Baltescu","Haiming Bao","Mo Bavarian","Jeff Belgum","Irwan Bello","Jake Berdine","Gabriel Bernadett-Shapiro","Christopher Berner","Lenny Bogdonoff","Oleg Boiko","Madelaine Boyd","Anna-Luisa Brakman","Greg Brockman","Tim Brooks","Miles Brundage","Kevin Button","Trevor Cai","Rosie Campbell","Andrew Cann","Brittany Carey","Chelsea Carlson","Rory Carmichael","Brooke Chan","Che Chang","Fotis Chantzis","Derek Chen","Sully Chen","Ruby Chen","Jason Chen","Mark Chen","Ben Chess","Chester Cho","Casey Chu","Hyung Won Chung","Dave Cummings","Jeremiah Currier","Yunxing Dai","Cory Decareaux","Thomas Degry","Noah Deutsch","Damien Deville","Arka Dhar","David Dohan","Steve Dowling","Sheila Dunning","Adrien Ecoffet","Atty Eleti","Tyna Eloundou","David Farhi","Liam Fedus","Niko Felix","Simón Posada Fishman","Juston Forte","Isabella Fulford","Leo Gao","Elie Georges","Christian Gibson","Vik Goel","Tarun Gogineni","Gabriel Goh","Rapha Gontijo-Lopes","Jonathan Gordon","Morgan Grafstein","Scott Gray","Ryan Greene","Joshua Gross","Shixiang Shane Gu","Yufei Guo","Chris Hallacy","Jesse Han","Jeff Harris","Yuchen He","Mike Heaton","Johannes Heidecke","Chris Hesse","Alan Hickey","Wade Hickey","Peter Hoeschele","Brandon Houghton","Kenny Hsu","Shengli Hu","Xin Hu","Joost Huizinga","Shantanu Jain","Shawn Jain","Joanne Jang","Angela Jiang","Roger Jiang","Haozhun Jin","Denny Jin","Shino Jomoto","Billie Jonn","Heewoo Jun","Tomer Kaftan","Łukasz Kaiser","Ali Kamali","Ingmar Kanitscheider","Nitish Shirish Keskar","Tabarak Khan","Logan Kilpatrick","Jong Wook Kim","Christina Kim","Yongjik Kim","Hendrik Kirchner","Jamie Kiros","Matt Knight","Daniel Kokotajlo","Łukasz Kondraciuk","Andrew Kondrich","Aris Konstantinidis","Kyle Kosic","Gretchen Krueger","Vishal Kuo","Michael Lampe","Ikai Lan","Teddy Lee","Jan Leike","Jade Leung","Daniel Levy","Chak Ming Li","Rachel Lim","Molly Lin","Stephanie Lin","Mateusz Litwin","Theresa Lopez","Ryan Lowe","Patricia Lue","Anna Makanju","Kim Malfacini","Sam Manning","Todor Markov","Yaniv Markovski","Bianca Martin","Katie Mayer","Andrew Mayne","Bob McGrew","Scott Mayer McKinney","Christine McLeavey","Paul McMillan","Jake McNeil","David Medina","Aalok Mehta","Jacob Menick","Luke Metz","Andrey Mishchenko","Pamela Mishkin","Vinnie Monaco","Evan Morikawa","Daniel Mossing","Tong Mu","Mira Murati","Oleg Murk","David Mély","Ashvin Nair","Reiichiro Nakano","Rajeev Nayak","Arvind Neelakantan","Richard Ngo","Hyeonwoo Noh","Long Ouyang","Cullen O'Keefe","Jakub Pachocki","Alex Paino","Joe Palermo","Ashley Pantuliano","Giambattista Parascandolo","Joel Parish","Emy Parparita","Alex Passos","Mikhail Pavlov","Andrew Peng","Adam Perelman","Filipe de Avila Belbute Peres","Michael Petrov","Henrique Ponde de Oliveira Pinto"," Michael"," Pokorny","Michelle Pokrass","Vitchyr Pong","Tolly Powell","Alethea Power","Boris Power","Elizabeth Proehl","Raul Puri","Alec Radford","Jack Rae","Aditya Ramesh","Cameron Raymond","Francis Real","Kendra Rimbach","Carl Ross","Bob Rotsted","Henri Roussez","Nick Ryder","Mario Saltarelli","Ted Sanders","Shibani Santurkar","Girish Sastry","Heather Schmidt","David Schnurr","John Schulman","Daniel Selsam","Kyla Sheppard","Toki Sherbakov","Jessica Shieh","Sarah Shoker","Pranav Shyam","Szymon Sidor","Eric Sigler","Maddie Simens","Jordan Sitkin","Katarina Slama","Ian Sohl","Benjamin Sokolowsky","Yang Song","Natalie Staudacher","Felipe Petroski Such","Natalie Summers","Ilya Sutskever","Jie Tang","Nikolas Tezak","Madeleine Thompson","Phil Tillet","Amin Tootoonchian","Elizabeth Tseng","Preston Tuggle","Nick Turley","Jerry Tworek","Juan Felipe Cerón Uribe","Andrea Vallone","Arun Vijayvergiya","Chelsea Voss","Carroll Wainwright","Justin Jay Wang","Alvin Wang","Ben Wang","Jonathan Ward","Jason Wei","CJ Weinmann","Akila Welihinda","Peter Welinder","Jiayi Weng","Lilian Weng","Matt Wiethoff","Dave Willner","Clemens Winter","Samuel Wolrich","Hannah Wong","Lauren Workman","Sherwin Wu","Jeff Wu","Michael Wu","Kai Xiao","Tao Xu","Sarah Yoo","Kevin Yu","Qiming Yuan","Wojciech Zaremba","Rowan Zellers","Chong Zhang","Marvin Zhang","Shengjia Zhao","Tianhao Zheng","Juntang Zhuang","William Zhuk","Barret Zoph"],"pdf_url":"https://arxiv.org/pdf/2303.08774v4.pdf","comment":"100 pages; updated authors list"},{"id":"http://arxiv.org/abs/2305.15685v2","updated":"2023-12-19T23:57:01Z","published":"2023-05-25T03:26:26Z","title":"RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting","summary":" Large Language Models (LLMs) have demonstrated impressive capabilities in\ncreative tasks such as storytelling and E-mail generation. However, as LLMs are\nprimarily trained on final text results rather than intermediate revisions, it\nmight be challenging for them to perform text rewriting tasks. Most studies in\nthe rewriting tasks focus on a particular transformation type within the\nboundaries of single sentences. In this work, we develop new strategies for\ninstruction tuning and reinforcement learning to better align LLMs for\ncross-sentence rewriting tasks using diverse wording and structures expressed\nthrough natural languages including 1) generating rewriting instruction data\nfrom Wiki edits and public corpus through instruction generation and\nchain-of-thought prompting; 2) collecting comparison data for reward model\ntraining through a new ranking function. To facilitate this research, we\nintroduce OpenRewriteEval, a novel benchmark covers a wide variety of rewriting\ntypes expressed through natural language instructions. Our results show\nsignificant improvements over a variety of baselines. The public repository is\navailable on GitHub under Google Research\n(https://github.com/google-research/google-research/tree/master/rewritelm).\n","authors":["Lei Shu","Liangchen Luo","Jayakumar Hoskere","Yun Zhu","Yinxiao Liu","Simon Tong","Jindong Chen","Lei Meng"],"pdf_url":"https://arxiv.org/pdf/2305.15685v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12660v1","updated":"2023-12-19T23:21:19Z","published":"2023-12-19T23:21:19Z","title":"Is post-editing really faster than human translation?","summary":" Time efficiency is paramount for the localisation industry, which demands\never-faster turnaround times. However, translation speed is largely\nunderresearched, and there is a lack of clarity about how language service\nproviders (LSPs) can evaluate the performance of their post-editing (PE) and\nhuman translation (HT) services. This study constitutes the first large-scale\ninvestigation of translation and revision speed in HT and in the PE of neural\nmachine translation, based on real-world data from an LSP. It uses an\nexploratory data analysis approach to investigate data for 90 million words\ntranslated by 879 linguists across 11 language pairs, over 2.5 years. The\nresults of this research indicate that (a) PE is usually but not always faster\nthan HT; (b) average speed values may be misleading; (c) translation speed is\nhighly variable; and (d) edit distance cannot be used as a proxy for\npost-editing productivity, because it does not correlate strongly with speed.\n","authors":["Silvia Terribile"],"pdf_url":"https://arxiv.org/pdf/2312.12660v1.pdf","comment":"30 pages, 11 tables, 7 figures. This article has been published in\n Translation Spaces. This is the author accepted manuscript. Please find the\n published version at: https://doi.org/10.1075/ts.22044.ter"},{"id":"http://arxiv.org/abs/2311.08206v2","updated":"2023-12-19T23:03:56Z","published":"2023-11-14T14:42:28Z","title":"Human-Centric Autonomous Systems With LLMs for User Command Reasoning","summary":" The evolution of autonomous driving has made remarkable advancements in\nrecent years, evolving into a tangible reality. However, a human-centric\nlarge-scale adoption hinges on meeting a variety of multifaceted requirements.\nTo ensure that the autonomous system meets the user's intent, it is essential\nto accurately discern and interpret user commands, especially in complex or\nemergency situations. To this end, we propose to leverage the reasoning\ncapabilities of Large Language Models (LLMs) to infer system requirements from\nin-cabin users' commands. Through a series of experiments that include\ndifferent LLM models and prompt designs, we explore the few-shot multivariate\nbinary classification accuracy of system requirements from natural language\ntextual commands. We confirm the general ability of LLMs to understand and\nreason about prompts but underline that their effectiveness is conditioned on\nthe quality of both the LLM model and the design of appropriate sequential\nprompts. Code and models are public with the link\n\\url{https://github.com/KTH-RPL/DriveCmd_LLM}.\n","authors":["Yi Yang","Qingwen Zhang","Ci Li","Daniel Simões Marta","Nazre Batool","John Folkesson"],"pdf_url":"https://arxiv.org/pdf/2311.08206v2.pdf","comment":"In Proceedings of the IEEE/CVF Winter Conference on Applications of\n Computer Vision (WACV) Workshops, 2024"},{"id":"http://arxiv.org/abs/2312.12655v1","updated":"2023-12-19T22:57:13Z","published":"2023-12-19T22:57:13Z","title":"Can Transformers Learn Sequential Function Classes In Context?","summary":" In-context learning (ICL) has revolutionized the capabilities of transformer\nmodels in NLP. In our project, we extend the understanding of the mechanisms\nunderpinning ICL by exploring whether transformers can learn from sequential,\nnon-textual function class data distributions. We introduce a novel sliding\nwindow sequential function class and employ toy-sized transformers with a GPT-2\narchitecture to conduct our experiments. Our analysis indicates that these\nmodels can indeed leverage ICL when trained on non-textual sequential function\nclasses. Additionally, our experiments with randomized y-label sequences\nhighlights that transformers retain some ICL capabilities even when the label\nassociations are obfuscated. We provide evidence that transformers can reason\nwith and understand sequentiality encoded within function classes, as reflected\nby the effective learning of our proposed tasks. Our results also show that the\nperformance deteriorated with increasing randomness in the labels, though not\nto the extent one might expect, implying a potential robustness of learned\nsequentiality against label noise. Future research may want to look into how\nprevious explanations of transformers, such as induction heads and task\nvectors, relate to sequentiality in ICL in these toy examples. Our\ninvestigation lays the groundwork for further research into how transformers\nprocess and perceive sequential data.\n","authors":["Ryan Campbell","Emma Guo","Evan Hu","Reya Vir","Ethan Hsiao"],"pdf_url":"https://arxiv.org/pdf/2312.12655v1.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.12634v1","updated":"2023-12-19T22:33:17Z","published":"2023-12-19T22:33:17Z","title":"MotionScript: Natural Language Descriptions for Expressive 3D Human\n Motions","summary":" This paper proposes MotionScript, a motion-to-text conversion algorithm and\nnatural language representation for human body motions. MotionScript aims to\ndescribe movements in greater detail and with more accuracy than previous\nnatural language approaches. Many motion datasets describe relatively objective\nand simple actions with little variation on the way they are expressed (e.g.\nsitting, walking, dribbling a ball). But for expressive actions that contain a\ndiversity of movements in the class (e.g. being sad, dancing), or for actions\noutside the domain of standard motion capture datasets (e.g. stylistic walking,\nsign-language), more specific and granular natural language descriptions are\nneeded. Our proposed MotionScript descriptions differ from existing natural\nlanguage representations in that it provides direct descriptions in natural\nlanguage instead of simple action labels or high-level human captions. To the\nbest of our knowledge, this is the first attempt at translating 3D motions to\nnatural language descriptions without requiring training data. Our experiments\nshow that when MotionScript representations are used in a text-to-motion neural\ntask, body movements are more accurately reconstructed, and large language\nmodels can be used to generate unseen complex motions.\n","authors":["Payam Jome Yazdian","Eric Liu","Li Cheng","Angelica Lim"],"pdf_url":"https://arxiv.org/pdf/2312.12634v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.09156v2","updated":"2023-12-19T22:03:48Z","published":"2023-08-17T19:08:42Z","title":"Characterizing Information Seeking Events in Health-Related Social\n Discourse","summary":" Social media sites have become a popular platform for individuals to seek and\nshare health information. Despite the progress in natural language processing\nfor social media mining, a gap remains in analyzing health-related texts on\nsocial discourse in the context of events. Event-driven analysis can offer\ninsights into different facets of healthcare at an individual and collective\nlevel, including treatment options, misconceptions, knowledge gaps, etc. This\npaper presents a paradigm to characterize health-related information-seeking in\nsocial discourse through the lens of events. Events here are board categories\ndefined with domain experts that capture the trajectory of the\ntreatment/medication. To illustrate the value of this approach, we analyze\nReddit posts regarding medications for Opioid Use Disorder (OUD), a critical\nglobal health concern. To the best of our knowledge, this is the first attempt\nto define event categories for characterizing information-seeking in OUD social\ndiscourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel\ntreatment information-seeking event dataset to analyze online discourse on an\nevent-based framework. This dataset contains Reddit posts on\ninformation-seeking events related to recovery from OUD, where each post is\nannotated based on the type of events. We also establish a strong performance\nbenchmark (77.4% F1 score) for the task by employing several machine learning\nand deep learning classifiers. Finally, we thoroughly investigate the\nperformance and errors of ChatGPT on this task, providing valuable insights\ninto the LLM's capabilities and ongoing characterization efforts.\n","authors":["Omar Sharif","Madhusudan Basak","Tanzia Parvin","Ava Scharfstein","Alphonso Bradham","Jacob T. Borodovsky","Sarah E. Lord","Sarah M. Preum"],"pdf_url":"https://arxiv.org/pdf/2308.09156v2.pdf","comment":"Accepted at AAAI-2024. 9 pages, 6 tables, 2 figures"},{"id":"http://arxiv.org/abs/2303.15413v5","updated":"2023-12-19T22:03:12Z","published":"2023-03-27T17:31:13Z","title":"Debiasing Scores and Prompts of 2D Diffusion for View-consistent\n Text-to-3D Generation","summary":" Existing score-distilling text-to-3D generation techniques, despite their\nconsiderable promise, often encounter the view inconsistency problem. One of\nthe most notable issues is the Janus problem, where the most canonical view of\nan object (\\textit{e.g}., face or head) appears in other views. In this work,\nwe explore existing frameworks for score-distilling text-to-3D generation and\nidentify the main causes of the view inconsistency problem -- the embedded bias\nof 2D diffusion models. Based on these findings, we propose two approaches to\ndebias the score-distillation frameworks for view-consistent text-to-3D\ngeneration. Our first approach, called score debiasing, involves cutting off\nthe score estimated by 2D diffusion models and gradually increasing the\ntruncation value throughout the optimization process. Our second approach,\ncalled prompt debiasing, identifies conflicting words between user prompts and\nview prompts using a language model, and adjusts the discrepancy between view\nprompts and the viewing direction of an object. Our experimental results show\nthat our methods improve the realism of the generated 3D objects by\nsignificantly reducing artifacts and achieve a good trade-off between\nfaithfulness to the 2D diffusion models and 3D consistency with little\noverhead. Our project page is available\nat~\\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.\n","authors":["Susung Hong","Donghoon Ahn","Seungryong Kim"],"pdf_url":"https://arxiv.org/pdf/2303.15413v5.pdf","comment":"Accepted to NeurIPS 2023. Project Page:\n https://susunghong.github.io/Debiased-Score-Distillation-Sampling/"},{"id":"http://arxiv.org/abs/2312.12624v1","updated":"2023-12-19T22:01:01Z","published":"2023-12-19T22:01:01Z","title":"Building a Llama2-finetuned LLM for Odia Language Utilizing Domain\n Knowledge Instruction Set","summary":" Building LLMs for languages other than English is in great demand due to the\nunavailability and performance of multilingual LLMs, such as understanding the\nlocal context. The problem is critical for low-resource languages due to the\nneed for instruction sets. In a multilingual country like India, there is a\nneed for LLMs supporting Indic languages to provide generative AI and LLM-based\ntechnologies and services to its citizens.\n This paper presents our approach of i) generating a large Odia instruction\nset, including domain knowledge data suitable for LLM fine-tuning, and ii)\nbuilding a Llama2-finetuned model tailored for enhanced performance in the Odia\ndomain. The proposed work will help researchers build an instruction set and\nLLM, particularly for Indic languages. We will release the model and\ninstruction set for the public for research and noncommercial purposes.\n","authors":["Guneet Singh Kohli","Shantipriya Parida","Sambit Sekhar","Samirit Saha","Nipun B Nair","Parul Agarwal","Sonal Khosla","Kusumlata Patiyal","Debasish Dhal"],"pdf_url":"https://arxiv.org/pdf/2312.12624v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12588v1","updated":"2023-12-19T20:35:08Z","published":"2023-12-19T20:35:08Z","title":"An Empirical study of Unsupervised Neural Machine Translation: analyzing\n NMT output, model's behavior and sentences' contribution","summary":" Unsupervised Neural Machine Translation (UNMT) focuses on improving NMT\nresults under the assumption there is no human translated parallel data, yet\nlittle work has been done so far in highlighting its advantages compared to\nsupervised methods and analyzing its output in aspects other than translation\naccuracy. We focus on three very diverse languages, French, Gujarati, and\nKazakh, and train bilingual NMT models, to and from English, with various\nlevels of supervision, in high- and low- resource setups, measure quality of\nthe NMT output and compare the generated sequences' word order and semantic\nsimilarity to source and reference sentences. We also use Layer-wise Relevance\nPropagation to evaluate the source and target sentences' contribution to the\nresult, expanding the findings of previous works to the UNMT paradigm.\n","authors":["Isidora Chara Tourni","Derry Wijaya"],"pdf_url":"https://arxiv.org/pdf/2312.12588v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11698v3","updated":"2023-12-19T19:38:39Z","published":"2023-06-20T17:24:23Z","title":"DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT\n Models","summary":" Generative Pre-trained Transformer (GPT) models have exhibited exciting\nprogress in their capabilities, capturing the interest of practitioners and the\npublic alike. Yet, while the literature on the trustworthiness of GPT models\nremains limited, practitioners have proposed employing capable GPT models for\nsensitive applications such as healthcare and finance -- where mistakes can be\ncostly. To this end, this work proposes a comprehensive trustworthiness\nevaluation for large language models with a focus on GPT-4 and GPT-3.5,\nconsidering diverse perspectives -- including toxicity, stereotype bias,\nadversarial robustness, out-of-distribution robustness, robustness on\nadversarial demonstrations, privacy, machine ethics, and fairness. Based on our\nevaluations, we discover previously unpublished vulnerabilities to\ntrustworthiness threats. For instance, we find that GPT models can be easily\nmisled to generate toxic and biased outputs and leak private information in\nboth training data and conversation history. We also find that although GPT-4\nis usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more\nvulnerable given jailbreaking system or user prompts, potentially because GPT-4\nfollows (misleading) instructions more precisely. Our work illustrates a\ncomprehensive trustworthiness evaluation of GPT models and sheds light on the\ntrustworthiness gaps. Our benchmark is publicly available at\nhttps://decodingtrust.github.io/; our dataset can be previewed at\nhttps://huggingface.co/datasets/AI-Secure/DecodingTrust; a concise version of\nthis work is at https://openreview.net/pdf?id=kaHpo8OZw2.\n","authors":["Boxin Wang","Weixin Chen","Hengzhi Pei","Chulin Xie","Mintong Kang","Chenhui Zhang","Chejian Xu","Zidi Xiong","Ritik Dutta","Rylan Schaeffer","Sang T. Truong","Simran Arora","Mantas Mazeika","Dan Hendrycks","Zinan Lin","Yu Cheng","Sanmi Koyejo","Dawn Song","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2306.11698v3.pdf","comment":"NeurIPS 2023 Outstanding Paper (Datasets and Benchmarks Track)"},{"id":"http://arxiv.org/abs/2303.06854v2","updated":"2023-12-19T19:12:53Z","published":"2023-03-13T04:49:46Z","title":"Robust Contrastive Language-Image Pre-training against Data Poisoning\n and Backdoor Attacks","summary":" Contrastive vision-language representation learning has achieved\nstate-of-the-art performance for zero-shot classification, by learning from\nmillions of image-caption pairs crawled from the internet. However, the massive\ndata that powers large multimodal models such as CLIP, makes them extremely\nvulnerable to various types of targeted data poisoning and backdoor attacks.\nDespite this vulnerability, robust contrastive vision-language pre-training\nagainst such attacks has remained unaddressed. In this work, we propose ROCLIP,\nthe first effective method for robust pre-training multimodal vision-language\nmodels against targeted data poisoning and backdoor attacks. ROCLIP effectively\nbreaks the association between poisoned image-caption pairs by considering a\nrelatively large and varying pool of random captions, and matching every image\nwith the text that is most similar to it in the pool instead of its own\ncaption, every few epochs.It also leverages image and text augmentations to\nfurther strengthen the defense and improve the performance of the model. Our\nextensive experiments show that ROCLIP renders state-of-the-art targeted data\npoisoning and backdoor attacks ineffective during pre-training CLIP models. In\nparticular, ROCLIP decreases the success rate for targeted data poisoning\nattacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while\nimproving the model's linear probe performance by 10% and maintains a similar\nzero shot performance compared to CLIP. By increasing the frequency of\nmatching, ROCLIP is able to defend strong attacks, which add up to 1% poisoned\nexamples to the data, and successfully maintain a low attack success rate of\n12.5%, while trading off the performance on some tasks.\n","authors":["Wenhan Yang","Jingdong Gao","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2303.06854v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12466v1","updated":"2023-12-19T03:18:12Z","published":"2023-12-19T03:18:12Z","title":"Users Approach on Providing Feedback for Smart Home Devices","summary":" Smart Home technology has accomplished extraordinary interest in making\nindividuals' lives more straightforward and more relaxing as of late.\nTechnology as of late brought about delivering numerous savvy and refined\nframeworks which advanced clever living innovation. In this paper, we will be\ninvestigating the behavioural intention of user's approach on providing\nfeedback for smart home devices. We will be conducting an online survey for\nsample of three to five students selected by simple random sampling to study\nthe user's motto for giving feedback on smart home devices and their\nexpectations. We have observed that most users are ready to share their\nfeedback on smart home devices actively to improvise the service and quality of\nthe product to fulfill the user needs and make their lives easier.\n","authors":["Santhosh Pogaku"],"pdf_url":"https://arxiv.org/pdf/2312.12466v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2312.11817"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2312.12437v1","updated":"2023-12-19T18:59:53Z","published":"2023-12-19T18:59:53Z","title":"Weakly Supervised Open-Vocabulary Object Detection","summary":" Despite weakly supervised object detection (WSOD) being a promising step\ntoward evading strong instance-level annotations, its capability is confined to\nclosed-set categories within a single training dataset. In this paper, we\npropose a novel weakly supervised open-vocabulary object detection framework,\nnamely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize\ndiverse datasets with only image-level annotations. To achieve this, we explore\nthree vital strategies, including dataset-level feature adaptation, image-level\nsalient object localization, and region-level vision-language alignment. First,\nwe perform data-aware feature extraction to produce an input-conditional\ncoefficient, which is leveraged into dataset attribute prototypes to identify\ndataset bias and help achieve cross-dataset generalization. Second, a\ncustomized location-oriented weakly supervised region proposal network is\nproposed to utilize high-level semantic layouts from the category-agnostic\nsegment anything model to distinguish object boundaries. Lastly, we introduce a\nproposal-concept synchronized multiple-instance network, i.e., object mining\nand refinement with visual-semantic alignment, to discover objects matched to\nthe text embeddings of concepts. Extensive experiments on Pascal VOC and MS\nCOCO demonstrate that the proposed WSOVOD achieves new state-of-the-art\ncompared with previous WSOD methods in both close-set object localization and\ndetection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary\nlearning to achieve on-par or even better performance than well-established\nfully-supervised open-vocabulary object detection (FSOVOD).\n","authors":["Jianghang Lin","Yunhang Shen","Bingquan Wang","Shaohui Lin","Ke Li","Liujuan Cao"],"pdf_url":"https://arxiv.org/pdf/2312.12437v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.12436v1","updated":"2023-12-19T18:59:22Z","published":"2023-12-19T18:59:22Z","title":"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise","summary":" The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Chaoyou Fu","Renrui Zhang","Haojia Lin","Zihan Wang","Timin Gao","Yongdong Luo","Yubo Huang","Zhengye Zhang","Longtian Qiu","Gaoxiang Ye","Yunhang Shen","Mengdan Zhang","Peixian Chen","Sirui Zhao","Xiawu Zheng","Shaohui Lin","Deqiang Jiang","Di Yin","Peng Gao","Ke Li","Xing Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2312.12436v1.pdf","comment":"Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2312.12433v1","updated":"2023-12-19T18:58:40Z","published":"2023-12-19T18:58:40Z","title":"Tracking Any Object Amodally","summary":" Amodal perception, the ability to comprehend complete object structures from\npartial visibility, is a fundamental skill, even for infants. Its significance\nextends to applications like autonomous driving, where a clear understanding of\nheavily occluded objects is essential. However, modern detection and tracking\nalgorithms often overlook this critical capability, perhaps due to the\nprevalence of modal annotations in most datasets. To address the scarcity of\namodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse\ncategories in thousands of video sequences. Our dataset includes amodal and\nmodal bounding boxes for visible and occluded objects, including objects that\nare partially out-of-frame. To enhance amodal tracking with object permanence,\nwe leverage a lightweight plug-in module, the amodal expander, to transform\nstandard, modal trackers into amodal ones through fine-tuning on a few hundred\nvideo sequences with data augmentation. We achieve a 3.3\\% and 1.6\\%\nimprovement on the detection and tracking of occluded objects on TAO-Amodal.\nWhen evaluated on people, our method produces dramatic improvements of 2x\ncompared to state-of-the-art modal baselines.\n","authors":["Cheng-Yen Hsieh","Tarasha Khurana","Achal Dave","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2312.12433v1.pdf","comment":"Project Page: https://tao-amodal.github.io"},{"id":"http://arxiv.org/abs/2312.12431v1","updated":"2023-12-19T18:57:34Z","published":"2023-12-19T18:57:34Z","title":"On Inference Stability for Diffusion Models","summary":" Denoising Probabilistic Models (DPMs) represent an emerging domain of\ngenerative models that excel in generating diverse and high-quality images.\nHowever, most current training methods for DPMs often neglect the correlation\nbetween timesteps, limiting the model's performance in generating images\neffectively. Notably, we theoretically point out that this issue can be caused\nby the cumulative estimation gap between the predicted and the actual\ntrajectory. To minimize that gap, we propose a novel \\textit{sequence-aware}\nloss that aims to reduce the estimation gap to enhance the sampling quality.\nFurthermore, we theoretically show that our proposed loss function is a tighter\nupper bound of the estimation loss in comparison with the conventional loss in\nDPMs. Experimental results on several benchmark datasets including CIFAR10,\nCelebA, and CelebA-HQ consistently show a remarkable improvement of our\nproposed method regarding the image generalization quality measured by FID and\nInception Score compared to several DPM baselines. Our code and pre-trained\ncheckpoints are available at \\url{https://github.com/viettmab/SA-DPM}.\n","authors":["Viet Nguyen","Giang Vu","Tung Nguyen Thanh","Khoat Than","Toan Tran"],"pdf_url":"https://arxiv.org/pdf/2312.12431v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12429v1","updated":"2023-12-19T18:56:44Z","published":"2023-12-19T18:56:44Z","title":"The Endoscapes Dataset for Surgical Scene Segmentation, Object\n Detection, and Critical View of Safety Assessment: Official Splits and\n Benchmark","summary":" This technical report provides a detailed overview of Endoscapes, a dataset\nof laparoscopic cholecystectomy (LC) videos with highly intricate annotations\ntargeted at automated assessment of the Critical View of Safety (CVS).\nEndoscapes comprises 201 LC videos with frames annotated sparsely but regularly\nwith segmentation masks, bounding boxes, and CVS assessment by three different\nclinical experts. Altogether, there are 11090 frames annotated with CVS and\n1933 frames annotated with tool and anatomy bounding boxes from the 201 videos,\nas well as an additional 422 frames from 50 of the 201 videos annotated with\ntool and anatomy segmentation masks. In this report, we provide detailed\ndataset statistics (size, class distribution, dataset splits, etc.) and a\ncomprehensive performance benchmark for instance segmentation, object\ndetection, and CVS prediction. The dataset and model checkpoints are publically\navailable at https://github.com/CAMMA-public/Endoscapes.\n","authors":["Aditya Murali","Deepak Alapatt","Pietro Mascagni","Armine Vardazaryan","Alain Garcia","Nariaki Okamoto","Guido Costamagna","Didier Mutter","Jacques Marescaux","Bernard Dallemagne","Nicolas Padoy"],"pdf_url":"https://arxiv.org/pdf/2312.12429v1.pdf","comment":"7 pages; 3 figures"},{"id":"http://arxiv.org/abs/2312.12425v1","updated":"2023-12-19T18:53:47Z","published":"2023-12-19T18:53:47Z","title":"SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete\n Diffusion Process","summary":" In this paper, we explore a principal way to enhance the quality of object\nmasks produced by different segmentation models. We propose a model-agnostic\nsolution called SegRefiner, which offers a novel perspective on this problem by\ninterpreting segmentation refinement as a data generation process. As a result,\nthe refinement process can be smoothly implemented through a series of\ndenoising diffusion steps. Specifically, SegRefiner takes coarse masks as\ninputs and refines them using a discrete diffusion process. By predicting the\nlabel and corresponding states-transition probabilities for each pixel,\nSegRefiner progressively refines the noisy masks in a conditional denoising\nmanner. To assess the effectiveness of SegRefiner, we conduct comprehensive\nexperiments on various segmentation tasks, including semantic segmentation,\ninstance segmentation, and dichotomous image segmentation. The results\ndemonstrate the superiority of our SegRefiner from multiple aspects. Firstly,\nit consistently improves both the segmentation metrics and boundary metrics\nacross different types of coarse masks. Secondly, it outperforms previous\nmodel-agnostic refinement methods by a significant margin. Lastly, it exhibits\na strong capability to capture extremely fine details when refining\nhigh-resolution images. The source code and trained models are available at\nhttps://github.com/MengyuWang826/SegRefiner.\n","authors":["Mengyu Wang","Henghui Ding","Jun Hao Liew","Jiajun Liu","Yao Zhao","Yunchao Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12425v1.pdf","comment":"NeurIPS 2023, Code: https://github.com/MengyuWang826/SegRefiner"},{"id":"http://arxiv.org/abs/2312.12423v1","updated":"2023-12-19T18:53:01Z","published":"2023-12-19T18:53:01Z","title":"Jack of All Tasks, Master of Many: Designing General-purpose\n Coarse-to-Fine Vision-Language Model","summary":" The ability of large language models (LLMs) to process visual inputs has\ngiven rise to general-purpose vision systems, unifying various vision-language\n(VL) tasks by instruction tuning. However, due to the enormous diversity in\ninput-output formats in the vision domain, existing general-purpose models fail\nto successfully integrate segmentation and multi-image inputs with coarse-level\ntasks into a single framework. In this work, we introduce VistaLLM, a powerful\nvisual system that addresses coarse- and fine-grained VL tasks over single and\nmultiple input images using a unified framework. VistaLLM utilizes an\ninstruction-guided image tokenizer that filters global embeddings using task\ndescriptions to extract compressed and refined features from numerous images.\nMoreover, VistaLLM employs a gradient-aware adaptive sampling technique to\nrepresent binary segmentation masks as sequences, significantly improving over\npreviously used uniform sampling. To bolster the desired capability of\nVistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning\ndataset with 6.8M samples. We also address the lack of multi-image grounding\ndatasets by introducing a novel task, AttCoSeg (Attribute-level\nCo-Segmentation), which boosts the model's reasoning and grounding capability\nover multiple input images. Extensive experiments on a wide range of V- and VL\ntasks demonstrate the effectiveness of VistaLLM by achieving consistent\nstate-of-the-art performance over strong baselines across all downstream tasks.\nOur project page can be found at https://shramanpramanick.github.io/VistaLLM/.\n","authors":["Shraman Pramanick","Guangxing Han","Rui Hou","Sayan Nag","Ser-Nam Lim","Nicolas Ballas","Qifan Wang","Rama Chellappa","Amjad Almahairi"],"pdf_url":"https://arxiv.org/pdf/2312.12423v1.pdf","comment":"24 pages including references and supplementary"},{"id":"http://arxiv.org/abs/2312.12419v1","updated":"2023-12-19T18:50:33Z","published":"2023-12-19T18:50:33Z","title":"Scene-Conditional 3D Object Stylization and Composition","summary":" Recently, 3D generative models have made impressive progress, enabling the\ngeneration of almost arbitrary 3D assets from text or image inputs. However,\nthese approaches generate objects in isolation without any consideration for\nthe scene where they will eventually be placed. In this paper, we propose a\nframework that allows for the stylization of an existing 3D asset to fit into a\ngiven 2D scene, and additionally produce a photorealistic composition as if the\nasset was placed within the environment. This not only opens up a new level of\ncontrol for object stylization, for example, the same assets can be stylized to\nreflect changes in the environment, such as summer to winter or fantasy versus\nfuturistic settings-but also makes the object-scene composition more\ncontrollable. We achieve this by combining modeling and optimizing the object's\ntexture and environmental lighting through differentiable ray tracing with\nimage priors from pre-trained text-to-image diffusion models. We demonstrate\nthat our method is applicable to a wide variety of indoor and outdoor scenes\nand arbitrary objects.\n","authors":["Jinghao Zhou","Tomas Jakab","Philip Torr","Christian Rupprecht"],"pdf_url":"https://arxiv.org/pdf/2312.12419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12418v1","updated":"2023-12-19T18:50:10Z","published":"2023-12-19T18:50:10Z","title":"LASA: Instance Reconstruction from Real Scans using A Large-scale\n Aligned Shape Annotation Dataset","summary":" Instance shape reconstruction from a 3D scene involves recovering the full\ngeometries of multiple objects at the semantic instance level. Many methods\nleverage data-driven learning due to the intricacies of scene complexity and\nsignificant indoor occlusions. Training these methods often requires a\nlarge-scale, high-quality dataset with aligned and paired shape annotations\nwith real-world scans. Existing datasets are either synthetic or misaligned,\nrestricting the performance of data-driven methods on real data. To this end,\nwe introduce LASA, a Large-scale Aligned Shape Annotation Dataset comprising\n10,412 high-quality CAD annotations aligned with 920 real-world scene scans\nfrom ArkitScenes, created manually by professional artists. On this top, we\npropose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo)\nmethod. It is empowered by a hybrid feature aggregation design to fuse\nmulti-modal inputs and recover high-fidelity object geometries. Besides, we\npresent an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate\nthat our shape annotations provide scene occupancy clues that can further\nimprove 3D object detection. Supported by LASA, extensive experiments show that\nour methods achieve state-of-the-art performance in both instance-level scene\nreconstruction and 3D object detection tasks.\n","authors":["Haolin Liu","Chongjie Ye","Yinyu Nie","Yingfan He","Xiaoguang Han"],"pdf_url":"https://arxiv.org/pdf/2312.12418v1.pdf","comment":"homepage: https://gap-lab-cuhk-sz.github.io/LASA/"},{"id":"http://arxiv.org/abs/2312.12416v1","updated":"2023-12-19T18:47:30Z","published":"2023-12-19T18:47:30Z","title":"Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image\n Diffusion Models","summary":" The quality of the prompts provided to text-to-image diffusion models\ndetermines how faithful the generated content is to the user's intent, often\nrequiring `prompt engineering'. To harness visual concepts from target images\nwithout prompt engineering, current approaches largely rely on embedding\ninversion by optimizing and then mapping them to pseudo-tokens. However,\nworking with such high-dimensional vector representations is challenging\nbecause they lack semantics and interpretability, and only allow simple vector\noperations when using them. Instead, this work focuses on inverting the\ndiffusion model to obtain interpretable language prompts directly. The\nchallenge of doing this lies in the fact that the resulting optimization\nproblem is fundamentally discrete and the space of prompts is exponentially\nlarge; this makes using standard optimization techniques, such as stochastic\ngradient descent, difficult. To this end, we utilize a delayed projection\nscheme to optimize for prompts representative of the vocabulary space in the\nmodel. Further, we leverage the findings that different timesteps of the\ndiffusion process cater to different levels of detail in an image. The later,\nnoisy, timesteps of the forward diffusion process correspond to the semantic\ninformation, and therefore, prompt inversion in this range provides tokens\nrepresentative of the image semantics. We show that our approach can identify\nsemantically interpretable and meaningful prompts for a target image which can\nbe used to synthesize diverse images with similar content. We further\nillustrate the application of the optimized prompts in evolutionary image\ngeneration and concept removal.\n","authors":["Shweta Mahajan","Tanzila Rahman","Kwang Moo Yi","Leonid Sigal"],"pdf_url":"https://arxiv.org/pdf/2312.12416v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13304v2","updated":"2023-12-19T18:45:10Z","published":"2023-08-25T11:04:35Z","title":"Rapid Artefact Removal and H&E-Stained Tissue Segmentation","summary":" We present an innovative method for rapidly segmenting hematoxylin and eosin\n(H&E)-stained tissue in whole-slide images (WSIs) that eliminates a wide range\nof undesirable artefacts such as pen marks and scanning artefacts. Our method\ninvolves taking a single-channel representation of a lowmagnification RGB\noverview of the WSI in which the pixel values are bimodally distributed such\nthat H&E-stained tissue is easily distinguished from both background and a wide\nvariety of artefacts. We demonstrate our method on 30 WSIs prepared from a wide\nrange of institutions and WSI digital scanners, each containing substantial\nartefacts, and compare it to segmentations provided by Otsu thresholding and\nHistolab tissue segmentation and pen filtering tools. We found that our method\nsegmented the tissue and fully removed all artefacts in 29 out of 30 WSIs,\nwhereas Otsu thresholding failed to remove any artefacts, and the Histolab pen\nfiltering tools only partially removed the pen marks. The beauty of our\napproach lies in its simplicity: manipulating RGB colour space and using Otsu\nthresholding allows for the segmentation of H&E-stained tissue and the rapid\nremoval of artefacts without the need for machine learning or parameter tuning.\n","authors":["B. A. Schreiber","J. Denholm","F. Jaeckle","M. J. Arends","K. M. Branson","C. -B. Schönlieb","E. J. Soilleux"],"pdf_url":"https://arxiv.org/pdf/2308.13304v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.12379v1","updated":"2023-12-19T18:11:19Z","published":"2023-12-19T18:11:19Z","title":"Mixture of Cluster-conditional LoRA Experts for Vision-language\n Instruction Tuning","summary":" Instruction tuning of the Large Vision-language Models (LVLMs) has\nrevolutionized the development of versatile models with zero-shot\ngeneralization across a wide range of downstream vision-language tasks.\nHowever, diversity of training tasks of different sources and formats would\nlead to inevitable task conflicts, where different tasks conflicts for the same\nset of model parameters, resulting in sub-optimal instruction-following\nabilities. To address that, we propose the Mixture of Cluster-conditional LoRA\nExperts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to\nactivate the task-customized model parameters based on the instruction\nclusters. A separate universal expert is further incorporated to improve the\ngeneralization capabilities of MoCLE for novel instructions. Extensive\nexperiments on 10 zero-shot tasks demonstrate the effectiveness of MoCLE.\n","authors":["Yunhao Gou","Zhili Liu","Kai Chen","Lanqing Hong","Hang Xu","Aoxue Li","Dit-Yan Yeung","James T. Kwok","Yu Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12379v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.15677v4","updated":"2023-12-19T17:42:05Z","published":"2022-05-31T10:35:55Z","title":"Augmentation-Aware Self-Supervision for Data-Efficient GAN Training","summary":" Training generative adversarial networks (GANs) with limited data is\nchallenging because the discriminator is prone to overfitting. Previously\nproposed differentiable augmentation demonstrates improved data efficiency of\ntraining GANs. However, the augmentation implicitly introduces undesired\ninvariance to augmentation for the discriminator since it ignores the change of\nsemantics in the label space caused by data transformation, which may limit the\nrepresentation learning ability of the discriminator and ultimately affect the\ngenerative modeling performance of the generator. To mitigate the negative\nimpact of invariance while inheriting the benefits of data augmentation, we\npropose a novel augmentation-aware self-supervised discriminator that predicts\nthe augmentation parameter of the augmented data. Particularly, the prediction\ntargets of real data and generated data are required to be distinguished since\nthey are different during training. We further encourage the generator to\nadversarially learn from the self-supervised discriminator by generating\naugmentation-predictable real and not fake data. This formulation connects the\nlearning objective of the generator and the arithmetic $-$ harmonic mean\ndivergence under certain assumptions. We compare our method with\nstate-of-the-art (SOTA) methods using the class-conditional BigGAN and\nunconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100,\nFFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate\nsignificant improvements of our method over SOTA methods in training\ndata-efficient GANs.\n","authors":["Liang Hou","Qi Cao","Yige Yuan","Songtao Zhao","Chongyang Ma","Siyuan Pan","Pengfei Wan","Zhongyuan Wang","Huawei Shen","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2205.15677v4.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.12359v1","updated":"2023-12-19T17:40:27Z","published":"2023-12-19T17:40:27Z","title":"CLIP-DINOiser: Teaching CLIP a few DINO tricks","summary":" The popular CLIP model displays impressive zero-shot capabilities thanks to\nits seamless interaction with arbitrary text prompts. However, its lack of\nspatial awareness makes it unsuitable for dense computer vision tasks, e.g.,\nsemantic segmentation, without an additional fine-tuning step that often uses\nannotations and can potentially suppress its original open-vocabulary\nproperties. Meanwhile, self-supervised representation methods have demonstrated\ngood localization properties without human-made annotations nor explicit\nsupervision. In this work, we take the best of both worlds and propose a\nzero-shot open-vocabulary semantic segmentation method, which does not require\nany annotations. We propose to locally improve dense MaskCLIP features,\ncomputed with a simple modification of CLIP's last pooling layer, by\nintegrating localization priors extracted from self-supervised features. By\ndoing so, we greatly improve the performance of MaskCLIP and produce smooth\noutputs. Moreover, we show that the used self-supervised feature properties can\ndirectly be learnt from CLIP features therefore allowing us to obtain the best\nresults with a single pass through CLIP model. Our method CLIP-DINOiser needs\nonly a single forward pass of CLIP and two light convolutional layers at\ninference, no extra supervision nor extra memory and reaches state-of-the-art\nresults on challenging and fine-grained benchmarks such as COCO, Pascal\nContext, Cityscapes and ADE20k. The code to reproduce our results is available\nat https://github.com/wysoczanska/clip_dinoiser.\n","authors":["Monika Wysoczańska","Oriane Siméoni","Michaël Ramamonjisoa","Andrei Bursuc","Tomasz Trzciński","Patrick Pérez"],"pdf_url":"https://arxiv.org/pdf/2312.12359v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12347v1","updated":"2023-12-19T17:26:44Z","published":"2023-12-19T17:26:44Z","title":"SMC-NCA: Semantic-guided Multi-level Contrast for Semi-supervised Action\n Segmentation","summary":" Semi-supervised action segmentation aims to perform frame-wise classification\nin long untrimmed videos, where only a fraction of videos in the training set\nhave labels. Recent studies have shown the potential of contrastive learning in\nunsupervised representation learning using unlabelled data. However, learning\nthe representation of each frame by unsupervised contrastive learning for\naction segmentation remains an open and challenging problem. In this paper, we\npropose a novel Semantic-guided Multi-level Contrast scheme with a\nNeighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise\nrepresentations for semi-supervised action segmentation. Specifically, for\nrepresentation learning, SMC is firstly used to explore intra- and\ninter-information variations in a unified and contrastive way, based on dynamic\nclustering process of the original input, encoded semantic and temporal\nfeatures. Then, the NCA module, which is responsible for enforcing spatial\nconsistency between neighbourhoods centered at different frames to alleviate\nover-segmentation issues, works alongside SMC for semi-supervised learning. Our\nSMC outperforms the other state-of-the-art methods on three benchmarks,\noffering improvements of up to 17.8% and 12.6% in terms of edit distance and\naccuracy, respectively. Additionally, the NCA unit results in significant\nbetter segmentation performance against the others in the presence of only 5%\nlabelled videos. We also demonstrate the effectiveness of the proposed method\non our Parkinson's Disease Mouse Behaviour (PDMB) dataset. The code and\ndatasets will be made publicly available.\n","authors":["Feixiang Zhou","Zheheng Jiang","Huiyu Zhou","Xuelong Li"],"pdf_url":"https://arxiv.org/pdf/2312.12347v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12340v1","updated":"2023-12-19T17:13:51Z","published":"2023-12-19T17:13:51Z","title":"Scalable Geometric Fracture Assembly via Co-creation Space among\n Assemblers","summary":" Geometric fracture assembly presents a challenging practical task in\narchaeology and 3D computer vision. Previous methods have focused solely on\nassembling fragments based on semantic information, which has limited the\nquantity of objects that can be effectively assembled. Therefore, there is a\nneed to develop a scalable framework for geometric fracture assembly without\nrelying on semantic information. To improve the effectiveness of assembling\ngeometric fractures without semantic information, we propose a co-creation\nspace comprising several assemblers capable of gradually and unambiguously\nassembling fractures. Additionally, we introduce a novel loss function, i.e.,\nthe geometric-based collision loss, to address collision issues during the\nfracture assembly process and enhance the results. Our framework exhibits\nbetter performance on both PartNet and Breaking Bad datasets compared to\nexisting state-of-the-art frameworks. Extensive experiments and quantitative\ncomparisons demonstrate the effectiveness of our proposed framework, which\nfeatures linear computational complexity, enhanced abstraction, and improved\ngeneralization. Our code is publicly available at\nhttps://github.com/Ruiyuan-Zhang/CCS.\n","authors":["Ruiyuan Zhang","Jiaxiang Liu","Zexi Li","Hao Dong","Jie Fu","Chao Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.11145v2","updated":"2023-12-19T17:09:04Z","published":"2023-01-26T14:52:30Z","title":"Learning from Mistakes: Self-Regularizing Hierarchical Representations\n in Point Cloud Semantic Segmentation","summary":" Recent advances in autonomous robotic technologies have highlighted the\ngrowing need for precise environmental analysis. LiDAR semantic segmentation\nhas gained attention to accomplish fine-grained scene understanding by acting\ndirectly on raw content provided by sensors. Recent solutions showed how\ndifferent learning techniques can be used to improve the performance of the\nmodel, without any architectural or dataset change. Following this trend, we\npresent a coarse-to-fine setup that LEArns from classification mistaKes (LEAK)\nderived from a standard model. First, classes are clustered into macro groups\naccording to mutual prediction errors; then, the learning process is\nregularized by: (1) aligning class-conditional prototypical feature\nrepresentation for both fine and coarse classes, (2) weighting instances with a\nper-class fairness index. Our LEAK approach is very general and can be\nseamlessly applied on top of any segmentation architecture; indeed,\nexperimental results showed that it enables state-of-the-art performances on\ndifferent architectures, datasets and tasks, while ensuring more balanced\nclass-wise results and faster convergence.\n","authors":["Elena Camuffo","Umberto Michieli","Simone Milani"],"pdf_url":"https://arxiv.org/pdf/2301.11145v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12337v1","updated":"2023-12-19T17:03:50Z","published":"2023-12-19T17:03:50Z","title":"pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable\n Generalizable 3D Reconstruction","summary":" We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D\nradiance fields parameterized by 3D Gaussian primitives from pairs of images.\nOur model features real-time and memory-efficient rendering for scalable\ntraining as well as fast 3D reconstruction at inference time. To overcome local\nminima inherent to sparse and locally supported representations, we predict a\ndense probability distribution over 3D and sample Gaussian means from that\nprobability distribution. We make this sampling operation differentiable via a\nreparameterization trick, allowing us to back-propagate gradients through the\nGaussian splatting representation. We benchmark our method on wide-baseline\nnovel view synthesis on the real-world RealEstate10k and ACID datasets, where\nwe outperform state-of-the-art light field transformers and accelerate\nrendering by 2.5 orders of magnitude while reconstructing an interpretable and\neditable 3D radiance field.\n","authors":["David Charatan","Sizhe Li","Andrea Tagliasacchi","Vincent Sitzmann"],"pdf_url":"https://arxiv.org/pdf/2312.12337v1.pdf","comment":"Project page: https://pixelsplat.github.io/"},{"id":"http://arxiv.org/abs/2310.10198v3","updated":"2023-12-19T16:44:46Z","published":"2023-10-16T09:09:02Z","title":"MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete\n Representations","summary":" In this work, we present MoConVQ, a novel unified framework for physics-based\nmotion control leveraging scalable discrete representations. Building upon\nvector quantized variational autoencoders (VQ-VAE) and model-based\nreinforcement learning, our approach effectively learns motion embeddings from\na large, unstructured dataset spanning tens of hours of motion examples. The\nresultant motion representation not only captures diverse motion skills but\nalso offers a robust and intuitive interface for various applications. We\ndemonstrate the versatility of MoConVQ through several applications: universal\ntracking control from various motion sources, interactive character control\nwith latent motion representations using supervised learning, physics-based\nmotion generation from natural language descriptions using the GPT framework,\nand, most interestingly, seamless integration with large language models (LLMs)\nwith in-context learning to tackle complex and abstract tasks.\n","authors":["Heyuan Yao","Zhenhua Song","Yuyang Zhou","Tenglong Ao","Baoquan Chen","Libin Liu"],"pdf_url":"https://arxiv.org/pdf/2310.10198v3.pdf","comment":"Project page: MoConVQ.github.io"},{"id":"http://arxiv.org/abs/2312.12314v1","updated":"2023-12-19T16:39:02Z","published":"2023-12-19T16:39:02Z","title":"First qualitative observations on deep learning vision model YOLO and\n DETR for automated driving in Austria","summary":" This study investigates the application of single and two-stage 2D-object\ndetection algorithms like You Only Look Once (YOLO), Real-Time DEtection\nTRansformer (RT-DETR) algorithm for automated object detection to enhance road\nsafety for autonomous driving on Austrian roads. The YOLO algorithm is a\nstate-of-the-art real-time object detection system known for its efficiency and\naccuracy. In the context of driving, its potential to rapidly identify and\ntrack objects is crucial for advanced driver assistance systems (ADAS) and\nautonomous vehicles. The research focuses on the unique challenges posed by the\nroad conditions and traffic scenarios in Austria. The country's diverse\nlandscape, varying weather conditions, and specific traffic regulations\nnecessitate a tailored approach for reliable object detection. The study\nutilizes a selective dataset comprising images and videos captured on Austrian\nroads, encompassing urban, rural, and alpine environments.\n","authors":["Stefan Schoder"],"pdf_url":"https://arxiv.org/pdf/2312.12314v1.pdf","comment":"draft"},{"id":"http://arxiv.org/abs/2312.12274v1","updated":"2023-12-19T15:56:19Z","published":"2023-12-19T15:56:19Z","title":"Intrinsic Image Diffusion for Single-view Material Estimation","summary":" We present Intrinsic Image Diffusion, a generative model for appearance\ndecomposition of indoor scenes. Given a single input view, we sample multiple\npossible material explanations represented as albedo, roughness, and metallic\nmaps. Appearance decomposition poses a considerable challenge in computer\nvision due to the inherent ambiguity between lighting and material properties\nand the lack of real datasets. To address this issue, we advocate for a\nprobabilistic formulation, where instead of attempting to directly predict the\ntrue material properties, we employ a conditional generative model to sample\nfrom the solution space. Furthermore, we show that utilizing the strong learned\nprior of recent diffusion models trained on large-scale real-world images can\nbe adapted to material estimation and highly improves the generalization to\nreal images. Our method produces significantly sharper, more consistent, and\nmore detailed materials, outperforming state-of-the-art methods by $1.5dB$ on\nPSNR and by $45\\%$ better FID score on albedo prediction. We demonstrate the\neffectiveness of our approach through experiments on both synthetic and\nreal-world datasets.\n","authors":["Peter Kocsis","Vincent Sitzmann","Matthias Nießner"],"pdf_url":"https://arxiv.org/pdf/2312.12274v1.pdf","comment":"Project page: https://peter-kocsis.github.io/IntrinsicImageDiffusion/\n Video: https://youtu.be/lz0meJlj5cA"},{"id":"http://arxiv.org/abs/2312.12273v1","updated":"2023-12-19T15:56:08Z","published":"2023-12-19T15:56:08Z","title":"VQA4CIR: Boosting Composed Image Retrieval with Visual Question\n Answering","summary":" Albeit progress has been made in Composed Image Retrieval (CIR), we\nempirically find that a certain percentage of failure retrieval results are not\nconsistent with their relative captions. To address this issue, this work\nprovides a Visual Question Answering (VQA) perspective to boost the performance\nof CIR. The resulting VQA4CIR is a post-processing approach and can be directly\nplugged into existing CIR methods. Given the top-C retrieved images by a CIR\nmethod, VQA4CIR aims to decrease the adverse effect of the failure retrieval\nresults being inconsistent with the relative caption. To find the retrieved\nimages inconsistent with the relative caption, we resort to the \"QA generation\nto VQA\" self-verification pipeline. For QA generation, we suggest fine-tuning\nLLM (e.g., LLaMA) to generate several pairs of questions and answers from each\nrelative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model.\nBy feeding the retrieved image and question to the VQA model, one can find the\nimages inconsistent with relative caption when the answer by VQA is\ninconsistent with the answer in the QA pair. Consequently, the CIR performance\ncan be boosted by modifying the ranks of inconsistently retrieved images.\nExperimental results show that our proposed method outperforms state-of-the-art\nCIR methods on the CIRR and Fashion-IQ datasets.\n","authors":["Chun-Mei Feng","Yang Bai","Tao Luo","Zhen Li","Salman Khan","Wangmeng Zuo","Xinxing Xu","Rick Siow Mong Goh","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2312.12273v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12263v1","updated":"2023-12-19T15:46:47Z","published":"2023-12-19T15:46:47Z","title":"FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy\n Labels","summary":" Federated learning with noisy labels (F-LNL) aims at seeking an optimal\nserver model via collaborative distributed learning by aggregating multiple\nclient models trained with local noisy or clean samples. On the basis of a\nfederated learning framework, recent advances primarily adopt label noise\nfiltering to separate clean samples from noisy ones on each client, thereby\nmitigating the negative impact of label noise. However, these prior methods do\nnot learn noise filters by exploiting knowledge across all clients, leading to\nsub-optimal and inferior noise filtering performance and thus damaging training\nstability. In this paper, we present FedDiv to tackle the challenges of F-LNL.\nSpecifically, we propose a global noise filter called Federated Noise Filter\nfor effectively identifying samples with noisy labels on every client, thereby\nraising stability during local training sessions. Without sacrificing data\nprivacy, this is achieved by modeling the global distribution of label noise\nacross all clients. Then, in an effort to make the global model achieve higher\nperformance, we introduce a Predictive Consistency based Sampler to identify\nmore credible local data for local model training, thus preventing noise\nmemorization and further boosting the training stability. Extensive experiments\non CIFAR-10, CIFAR-100, and Clothing1M demonstrate that \\texttt{FedDiv}\nachieves superior performance over state-of-the-art F-LNL methods under\ndifferent label noise settings for both IID and non-IID data partitions. Source\ncode is publicly available at https://github.com/lijichang/FLNL-FedDiv.\n","authors":["Jichang Li","Guanbin Li","Hui Cheng","Zicheng Liao","Yizhou Yu"],"pdf_url":"https://arxiv.org/pdf/2312.12263v1.pdf","comment":"To appear in AAAI-2024"},{"id":"http://arxiv.org/abs/2312.10237v2","updated":"2023-12-19T15:44:40Z","published":"2023-12-15T22:09:04Z","title":"Vertical Federated Alzheimer's Detection on Multimodal Data","summary":" In the era of rapidly advancing medical technologies, the segmentation of\nmedical data has become inevitable, necessitating the development of privacy\npreserving machine learning algorithms that can train on distributed data.\nConsolidating sensitive medical data is not always an option particularly due\nto the stringent privacy regulations imposed by the Health Insurance\nPortability and Accountability Act (HIPAA). In this paper, we introduce a HIPAA\ncompliant framework that can train from distributed data. We then propose a\nmultimodal vertical federated model for Alzheimer's Disease (AD) detection, a\nserious neurodegenerative condition that can cause dementia, severely impairing\nbrain function and hindering simple tasks, especially without preventative\ncare. This vertical federated model offers a distributed architecture that\nenables collaborative learning across diverse sources of medical data while\nrespecting privacy constraints imposed by HIPAA. It is also able to leverage\nmultiple modalities of data, enhancing the robustness and accuracy of AD\ndetection. Our proposed model not only contributes to the advancement of\nfederated learning techniques but also holds promise for overcoming the hurdles\nposed by data segmentation in medical research. By using vertical federated\nlearning, this research strives to provide a framework that enables healthcare\ninstitutions to harness the collective intelligence embedded in their\ndistributed datasets without compromising patient privacy.\n","authors":["Paul K. Mandal"],"pdf_url":"https://arxiv.org/pdf/2312.10237v2.pdf","comment":"14 pages, 7 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.09754v2","updated":"2023-12-19T15:44:15Z","published":"2023-12-15T12:49:08Z","title":"PPFM: Image denoising in photon-counting CT using single-step posterior\n sampling Poisson flow generative models","summary":" Diffusion and Poisson flow models have shown impressive performance in a wide\nrange of generative tasks, including low-dose CT image denoising. However, one\nlimitation in general, and for clinical applications in particular, is slow\nsampling. Due to their iterative nature, the number of function evaluations\n(NFE) required is usually on the order of $10-10^3$, both for conditional and\nunconditional generation. In this paper, we present posterior sampling Poisson\nflow generative models (PPFM), a novel image denoising technique for low-dose\nand photon-counting CT that produces excellent image quality whilst keeping\nNFE=1. Updating the training and sampling processes of Poisson flow generative\nmodels (PFGM)++, we learn a conditional generator which defines a trajectory\nbetween the prior noise distribution and the posterior distribution of\ninterest. We additionally hijack and regularize the sampling process to achieve\nNFE=1. Our results shed light on the benefits of the PFGM++ framework compared\nto diffusion models. In addition, PPFM is shown to perform favorably compared\nto current state-of-the-art diffusion-style models with NFE=1, consistency\nmodels, as well as popular deep learning and non-deep learning-based image\ndenoising techniques, on clinical low-dose CT images and clinical images from a\nprototype photon-counting CT system.\n","authors":["Dennis Hein","Staffan Holmin","Timothy Szczykutowicz","Jonathan S Maltz","Mats Danielsson","Ge Wang","Mats Persson"],"pdf_url":"https://arxiv.org/pdf/2312.09754v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12250v1","updated":"2023-12-19T15:33:57Z","published":"2023-12-19T15:33:57Z","title":"ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition\n in the Operating Room","summary":" Surgical robotics holds much promise for improving patient safety and\nclinician experience in the Operating Room (OR). However, it also comes with\nnew challenges, requiring strong team coordination and effective OR management.\nAutomatic detection of surgical activities is a key requirement for developing\nAI-based intelligent tools to tackle these challenges. The current\nstate-of-the-art surgical activity recognition methods however operate on\nimage-based representations and depend on large-scale labeled datasets whose\ncollection is time-consuming and resource-expensive. This work proposes a new\nsample-efficient and object-based approach for surgical activity recognition in\nthe OR. Our method focuses on the geometric arrangements between clinicians and\nsurgical devices, thus utilizing the significant object interaction dynamics in\nthe OR. We conduct experiments in a low-data regime study for long video\nactivity recognition. We also benchmark our method againstother object-centric\napproaches on clip-level action classification and show superior performance.\n","authors":["Idris Hamoud","Muhammad Abdullah Jamal","Vinkle Srivastav","Didier Mutter","Nicolas Padoy","Omid Mohareri"],"pdf_url":"https://arxiv.org/pdf/2312.12250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12246v1","updated":"2023-12-19T15:30:10Z","published":"2023-12-19T15:30:10Z","title":"MDD-UNet: Domain Adaptation for Medical Image Segmentation with\n Theoretical Guarantees, a Proof of Concept","summary":" The current state-of-the art techniques for image segmentation are often\nbased on U-Net architectures, a U-shaped encoder-decoder networks with skip\nconnections. Despite the powerful performance, the architecture often does not\nperform well when used on data which has different characteristics than the\ndata it was trained on. Many techniques for improving performance in the\npresence of domain shift have been developed, however typically only have loose\nconnections to the theory of domain adaption. In this work, we propose an\nunsupervised domain adaptation framework for U-Nets with theoretical guarantees\nbased on the Margin Disparity Discrepancy [1] called the MDD-UNet. We evaluate\nthe proposed technique on the task of hippocampus segmentation, and find that\nthe MDD-UNet is able to learn features which are domain-invariant with no\nknowledge about the labels in the target domain. The MDD-UNet improves\nperformance over the standard U-Net on 11 out of 12 combinations of datasets.\nThis work serves as a proof of concept by demonstrating an improvement on the\nU-Net in it's standard form without modern enhancements, which opens up a new\navenue of studying domain adaptation for models with very large hypothesis\nspaces from both methodological and practical perspectives. Code is available\nat https://github.com/asbjrnmunk/mdd-unet.\n","authors":["Asbjørn Munk","Ao Ma","Mads Nielsen"],"pdf_url":"https://arxiv.org/pdf/2312.12246v1.pdf","comment":"Published at NLDL 2024"},{"id":"http://arxiv.org/abs/2312.12241v1","updated":"2023-12-19T15:25:39Z","published":"2023-12-19T15:25:39Z","title":"GeomVerse: A Systematic Evaluation of Large Models for Geometric\n Reasoning","summary":" Large language models have shown impressive results for multi-hop\nmathematical reasoning when the input question is only textual. Many\nmathematical reasoning problems, however, contain both text and image. With the\never-increasing adoption of vision language models (VLMs), understanding their\nreasoning abilities for such problems is crucial. In this paper, we evaluate\nthe reasoning capabilities of VLMs along various axes through the lens of\ngeometry problems. We procedurally create a synthetic dataset of geometry\nquestions with controllable difficulty levels along multiple axes, thus\nenabling a systematic evaluation. The empirical results obtained using our\nbenchmark for state-of-the-art VLMs indicate that these models are not as\ncapable in subjects like geometry (and, by generalization, other topics\nrequiring similar reasoning) as suggested by previous benchmarks. This is made\nespecially clear by the construction of our benchmark at various depth levels,\nsince solving higher-depth problems requires long chains of reasoning rather\nthan additional memorized knowledge. We release the dataset for further\nresearch in this area.\n","authors":["Mehran Kazemi","Hamidreza Alvari","Ankit Anand","Jialin Wu","Xi Chen","Radu Soricut"],"pdf_url":"https://arxiv.org/pdf/2312.12241v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12232v1","updated":"2023-12-19T15:18:40Z","published":"2023-12-19T15:18:40Z","title":"Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model","summary":" Recently, diffusion-based image generation methods are credited for their\nremarkable text-to-image generation capabilities, while still facing challenges\nin accurately generating multilingual scene text images. To tackle this\nproblem, we propose Diff-Text, which is a training-free scene text generation\nframework for any language. Our model outputs a photo-realistic image given a\ntext of any language along with a textual description of a scene. The model\nleverages rendered sketch images as priors, thus arousing the potential\nmultilingual-generation ability of the pre-trained Stable Diffusion. Based on\nthe observation from the influence of the cross-attention map on object\nplacement in generated images, we propose a localized attention constraint into\nthe cross-attention layer to address the unreasonable positioning problem of\nscene text. Additionally, we introduce contrastive image-level prompts to\nfurther refine the position of the textual region and achieve more accurate\nscene text generation. Experiments demonstrate that our method outperforms the\nexisting method in both the accuracy of text recognition and the naturalness of\nforeground-background blending.\n","authors":["Lingjun Zhang","Xinyuan Chen","Yaohui Wang","Yue Lu","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2312.12232v1.pdf","comment":"Accepted to AAAI 2024. Code:\n https://github.com/ecnuljzhang/brush-your-text"},{"id":"http://arxiv.org/abs/2303.14027v3","updated":"2023-12-19T15:15:50Z","published":"2023-03-24T14:37:07Z","title":"Poincaré ResNet","summary":" This paper introduces an end-to-end residual network that operates entirely\non the Poincar\\'e ball model of hyperbolic space. Hyperbolic learning has\nrecently shown great potential for visual understanding, but is currently only\nperformed in the penultimate layer(s) of deep networks. All visual\nrepresentations are still learned through standard Euclidean networks. In this\npaper we investigate how to learn hyperbolic representations of visual data\ndirectly from the pixel-level. We propose Poincar\\'e ResNet, a hyperbolic\ncounterpart of the celebrated residual network, starting from Poincar\\'e 2D\nconvolutions up to Poincar\\'e residual connections. We identify three\nroadblocks for training convolutional networks entirely in hyperbolic space and\npropose a solution for each: (i) Current hyperbolic network initializations\ncollapse to the origin, limiting their applicability in deeper networks. We\nprovide an identity-based initialization that preserves norms over many layers.\n(ii) Residual networks rely heavily on batch normalization, which comes with\nexpensive Fr\\'echet mean calculations in hyperbolic space. We introduce\nPoincar\\'e midpoint batch normalization as a faster and equally effective\nalternative. (iii) Due to the many intermediate operations in Poincar\\'e\nlayers, we lastly find that the computation graphs of deep learning libraries\nblow up, limiting our ability to train on deep hyperbolic networks. We provide\nmanual backward derivations of core hyperbolic operations to maintain\nmanageable computation graphs.\n","authors":["Max van Spengler","Erwin Berkhout","Pascal Mettes"],"pdf_url":"https://arxiv.org/pdf/2303.14027v3.pdf","comment":"International Conference on Computer Vision 2023"},{"id":"http://arxiv.org/abs/2312.12227v1","updated":"2023-12-19T15:13:08Z","published":"2023-12-19T15:13:08Z","title":"HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models\n with Minimal Feedback","summary":" We introduce HuTuMotion, an innovative approach for generating natural human\nmotions that navigates latent motion diffusion models by leveraging few-shot\nhuman feedback. Unlike existing approaches that sample latent variables from a\nstandard normal prior distribution, our method adapts the prior distribution to\nbetter suit the characteristics of the data, as indicated by human feedback,\nthus enhancing the quality of motion generation. Furthermore, our findings\nreveal that utilizing few-shot feedback can yield performance levels on par\nwith those attained through extensive human feedback. This discovery emphasizes\nthe potential and efficiency of incorporating few-shot human-guided\noptimization within latent diffusion models for personalized and style-aware\nhuman motion generation applications. The experimental results show the\nsignificantly superior performance of our method over existing state-of-the-art\napproaches.\n","authors":["Gaoge Han","Shaoli Huang","Mingming Gong","Jinglei Tang"],"pdf_url":"https://arxiv.org/pdf/2312.12227v1.pdf","comment":"Accepted by AAAI 2024 Main Track"},{"id":"http://arxiv.org/abs/2312.12223v1","updated":"2023-12-19T15:11:46Z","published":"2023-12-19T15:11:46Z","title":"Self-Supervised Detection of Perfect and Partial Input-Dependent\n Symmetries","summary":" Group equivariance ensures consistent responses to group transformations of\nthe input, leading to more robust models and enhanced generalization\ncapabilities. However, this property can lead to overly constrained models if\nthe symmetries considered in the group differ from those observed in data.\nWhile common methods address this by determining the appropriate level of\nsymmetry at the dataset level, they are limited to supervised settings and\nignore scenarios in which multiple levels of symmetry co-exist in the same\ndataset. For instance, pictures of cars and planes exhibit different levels of\nrotation, yet both are included in the CIFAR-10 dataset. In this paper, we\npropose a method able to detect the level of symmetry of each input without the\nneed for labels. To this end, we derive a sufficient and necessary condition to\nlearn the distribution of symmetries in the data. Using the learned\ndistribution, we generate pseudo-labels that allow us to learn the levels of\nsymmetry of each input in a self-supervised manner. We validate the\neffectiveness of our approach on synthetic datasets with different per-class\nlevels of symmetries e.g. MNISTMultiple, in which digits are uniformly rotated\nwithin a class-dependent interval. We demonstrate that our method can be used\nfor practical applications such as the generation of standardized datasets in\nwhich the symmetries are not present, as well as the detection of\nout-of-distribution symmetries during inference. By doing so, both the\ngeneralization and robustness of non-equivariant models can be improved. Our\ncode is publicly available at https://github.com/aurban0/ssl-sym.\n","authors":["Alonso Urbano","David W. Romero"],"pdf_url":"https://arxiv.org/pdf/2312.12223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12222v1","updated":"2023-12-19T15:11:32Z","published":"2023-12-19T15:11:32Z","title":"EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote\n Sensing Visual Question Answering","summary":" Earth vision research typically focuses on extracting geospatial object\nlocations and categories but neglects the exploration of relations between\nobjects and comprehensive reasoning. Based on city planning needs, we develop a\nmulti-modal multi-task VQA dataset (EarthVQA) to advance relational\nreasoning-based judging, counting, and comprehensive analysis. The EarthVQA\ndataset contains 6000 images, corresponding semantic masks, and 208,593 QA\npairs with urban and rural governance requirements embedded. As objects are the\nbasis for complex relational reasoning, we propose a Semantic OBject Awareness\nframework (SOBA) to advance VQA in an object-centric way. To preserve refined\nspatial locations and semantics, SOBA leverages a segmentation network for\nobject semantics generation. The object-guided attention aggregates object\ninterior features via pseudo masks, and bidirectional cross-attention further\nmodels object external relations hierarchically. To optimize object counting,\nwe propose a numerical difference loss that dynamically adds difference\npenalties, unifying the classification and regression tasks. Experimental\nresults show that SOBA outperforms both advanced general and remote sensing\nmethods. We believe this dataset and framework provide a strong benchmark for\nEarth vision's complex analysis. The project page is at\nhttps://Junjue-Wang.github.io/homepage/EarthVQA.\n","authors":["Junjue Wang","Zhuo Zheng","Zihang Chen","Ailong Ma","Yanfei Zhong"],"pdf_url":"https://arxiv.org/pdf/2312.12222v1.pdf","comment":"Accepted By AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12198v1","updated":"2023-12-19T14:34:36Z","published":"2023-12-19T14:34:36Z","title":"Mask Grounding for Referring Image Segmentation","summary":" Referring Image Segmentation (RIS) is a challenging task that requires an\nalgorithm to segment objects referred by free-form language expressions.\nDespite significant progress in recent years, most state-of-the-art (SOTA)\nmethods still suffer from considerable language-image modality gap at the pixel\nand word level. These methods generally 1) rely on sentence-level language\nfeatures for language-image alignment and 2) lack explicit training supervision\nfor fine-grained visual grounding. Consequently, they exhibit weak object-level\ncorrespondence between visual and language features. Without well-grounded\nfeatures, prior methods struggle to understand complex expressions that require\nstrong reasoning over relationships among multiple objects, especially when\ndealing with rarely used or ambiguous clauses. To tackle this challenge, we\nintroduce a novel Mask Grounding auxiliary task that significantly improves\nvisual grounding within language features, by explicitly teaching the model to\nlearn fine-grained correspondence between masked textual tokens and their\nmatching visual objects. Mask Grounding can be directly used on prior RIS\nmethods and consistently bring improvements. Furthermore, to holistically\naddress the modality gap, we also design a cross-modal alignment loss and an\naccompanying alignment module. These additions work synergistically with Mask\nGrounding. With all these techniques, our comprehensive approach culminates in\nMagNet Mask-grounded Network), an architecture that significantly outperforms\nprior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating\nour method's effectiveness in addressing current limitations of RIS algorithms.\nOur code and pre-trained weights will be released.\n","authors":["Yong Xien Chng","Henry Zheng","Yizeng Han","Xuchong Qiu","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2312.12198v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.06962v2","updated":"2023-12-19T14:29:58Z","published":"2023-08-14T06:32:54Z","title":"Color-NeuS: Reconstructing Neural Implicit Surfaces with Color","summary":" The reconstruction of object surfaces from multi-view images or monocular\nvideo is a fundamental issue in computer vision. However, much of the recent\nresearch concentrates on reconstructing geometry through implicit or explicit\nmethods. In this paper, we shift our focus towards reconstructing mesh in\nconjunction with color. We remove the view-dependent color from neural volume\nrendering while retaining volume rendering performance through a relighting\nnetwork. Mesh is extracted from the signed distance function (SDF) network for\nthe surface, and color for each surface vertex is drawn from the global color\nnetwork. To evaluate our approach, we conceived a in hand object scanning task\nfeaturing numerous occlusions and dramatic shifts in lighting conditions. We've\ngathered several videos for this task, and the results surpass those of any\nexisting methods capable of reconstructing mesh alongside color. Additionally,\nour method's performance was assessed using public datasets, including DTU,\nBlendedMVS, and OmniObject3D. The results indicated that our method performs\nwell across all these datasets. Project page:\nhttps://colmar-zlicheng.github.io/color_neus.\n","authors":["Licheng Zhong","Lixin Yang","Kailin Li","Haoyu Zhen","Mei Han","Cewu Lu"],"pdf_url":"https://arxiv.org/pdf/2308.06962v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12189v1","updated":"2023-12-19T14:23:47Z","published":"2023-12-19T14:23:47Z","title":"Teeth Localization and Lesion Segmentation in CBCT Images using\n SpatialConfiguration-Net and U-Net","summary":" The localization of teeth and segmentation of periapical lesions in cone-beam\ncomputed tomography (CBCT) images are crucial tasks for clinical diagnosis and\ntreatment planning, which are often time-consuming and require a high level of\nexpertise. However, automating these tasks is challenging due to variations in\nshape, size, and orientation of lesions, as well as similar topologies among\nteeth. Moreover, the small volumes occupied by lesions in CBCT images pose a\nclass imbalance problem that needs to be addressed. In this study, we propose a\ndeep learning-based method utilizing two convolutional neural networks: the\nSpatialConfiguration-Net (SCN) and a modified version of the U-Net. The SCN\naccurately predicts the coordinates of all teeth present in an image, enabling\nprecise cropping of teeth volumes that are then fed into the U-Net which\ndetects lesions via segmentation. To address class imbalance, we compare the\nperformance of three reweighting loss functions. After evaluation on 144 CBCT\nimages, our method achieves a 97.3% accuracy for teeth localization, along with\na promising sensitivity and specificity of 0.97 and 0.88, respectively, for\nsubsequent lesion detection.\n","authors":["Arnela Hadzic","Barbara Kirnbauer","Darko Stern","Martin Urschler"],"pdf_url":"https://arxiv.org/pdf/2312.12189v1.pdf","comment":"Accepted for VISIGRAPP 2024 (Track: VISAPP), 8 pages"},{"id":"http://arxiv.org/abs/2312.10447v2","updated":"2023-12-19T14:22:43Z","published":"2023-12-16T13:39:52Z","title":"Finger Biometric Recognition With Feature Selection","summary":" Biometrics is indispensable in this modern digital era for secure automated\nhuman authentication in various fields of machine learning and pattern\nrecognition. Hand geometry is a promising physiological biometric trait with\nample deployed application areas for identity verification. Due to the\nintricate anatomic foundation of the thumb and substantial inter-finger posture\nvariation, satisfactory performances cannot be achieved while the thumb is\nincluded in the contact-free environment. To overcome the hindrances associated\nwith the thumb, four finger-based (excluding the thumb) biometric approaches\nhave been devised. In this chapter, a four-finger based biometric method has\nbeen presented. Again, selection of salient features is essential to reduce the\nfeature dimensionality by eliminating the insignificant features. Weights are\nassigned according to the discriminative efficiency of the features to\nemphasize on the essential features. Two different strategies namely, the\nglobal and local feature selection methods are adopted based on the adaptive\nforward-selection and backward-elimination (FoBa) algorithm. The identification\nperformances are evaluated using the weighted k-nearest neighbor (wk-NN) and\nrandom forest (RF) classifiers. The experiments are conducted using the\nselected feature subsets over the 300 subjects of the Bosphorus hand database.\nThe best identification accuracy of 98.67%, and equal error rate (EER) of 4.6%\nhave been achieved using the subset of 25 features which are selected by the\nrank-based local FoBa algorithm.\n","authors":["Asish Bera","Debotosh Bhattacharjee","Mita Nasipuri"],"pdf_url":"https://arxiv.org/pdf/2312.10447v2.pdf","comment":"34 pages. The Biometric Computing: Recognition and Registration, 2019"},{"id":"http://arxiv.org/abs/2305.18072v2","updated":"2023-12-19T14:17:57Z","published":"2023-05-29T13:18:59Z","title":"Image Captioning with Multi-Context Synthetic Data","summary":" Image captioning requires numerous annotated image-text pairs, resulting in\nsubstantial annotation costs. Recently, large models (e.g. diffusion models and\nlarge language models) have excelled in producing high-quality images and text.\nThis potential can be harnessed to create synthetic image-text pairs for\ntraining captioning models. Synthetic data can improve cost and time efficiency\nin data collection, allow for customization to specific domains, bootstrap\ngeneralization capability for zero-shot performance, and circumvent privacy\nconcerns associated with real-world data. However, existing methods struggle to\nattain satisfactory performance solely through synthetic data. We identify the\nissue as generated images from simple descriptions mostly capture a solitary\nperspective with limited context, failing to align with the intricate scenes\nprevalent in real-world imagery. To tackle this, we present an innovative\npipeline that introduces multi-context data generation. Beginning with an\ninitial text corpus, our approach employs a large language model to extract\nmultiple sentences portraying the same scene from diverse viewpoints. These\nsentences are then condensed into a single sentence with multiple contexts.\nSubsequently, we generate intricate images using the condensed captions through\ndiffusion models. Our model is exclusively trained on synthetic image-text\npairs crafted through this process. The effectiveness of our pipeline is\nvalidated through experimental results in both the in-domain and cross-domain\nsettings, where it achieves state-of-the-art performance on well-known datasets\nsuch as MSCOCO, Flickr30k, and NoCaps.\n","authors":["Feipeng Ma","Yizhou Zhou","Fengyun Rao","Yueyi Zhang","Xiaoyan Sun"],"pdf_url":"https://arxiv.org/pdf/2305.18072v2.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2309.15289v4","updated":"2023-12-19T14:14:22Z","published":"2023-09-26T21:56:03Z","title":"SEPT: Towards Efficient Scene Representation Learning for Motion\n Prediction","summary":" Motion prediction is crucial for autonomous vehicles to operate safely in\ncomplex traffic environments. Extracting effective spatiotemporal relationships\namong traffic elements is key to accurate forecasting. Inspired by the\nsuccessful practice of pretrained large language models, this paper presents\nSEPT, a modeling framework that leverages self-supervised learning to develop\npowerful spatiotemporal understanding for complex traffic scenes. Specifically,\nour approach involves three masking-reconstruction modeling tasks on scene\ninputs including agents' trajectories and road network, pretraining the scene\nencoder to capture kinematics within trajectory, spatial structure of road\nnetwork, and interactions among roads and agents. The pretrained encoder is\nthen finetuned on the downstream forecasting task. Extensive experiments\ndemonstrate that SEPT, without elaborate architectural design or manual feature\nengineering, achieves state-of-the-art performance on the Argoverse 1 and\nArgoverse 2 motion forecasting benchmarks, outperforming previous methods on\nall main metrics by a large margin.\n","authors":["Zhiqian Lan","Yuxuan Jiang","Yao Mu","Chen Chen","Shengbo Eben Li"],"pdf_url":"https://arxiv.org/pdf/2309.15289v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12176v1","updated":"2023-12-19T14:09:12Z","published":"2023-12-19T14:09:12Z","title":"All for One, and One for All: UrbanSyn Dataset, the third Musketeer of\n Synthetic Driving Scenes","summary":" We introduce UrbanSyn, a photorealistic dataset acquired through\nsemi-procedurally generated synthetic urban driving scenarios. Developed using\nhigh-quality geometry and materials, UrbanSyn provides pixel-level ground\ntruth, including depth, semantic segmentation, and instance segmentation with\nobject bounding boxes and occlusion degree. It complements GTAV and Synscapes\ndatasets to form what we coin as the 'Three Musketeers'. We demonstrate the\nvalue of the Three Musketeers in unsupervised domain adaptation for image\nsemantic segmentation. Results on real-world datasets, Cityscapes, Mapillary\nVistas, and BDD100K, establish new benchmarks, largely attributed to UrbanSyn.\nWe make UrbanSyn openly and freely accessible (www.urbansyn.org).\n","authors":["Jose L. Gómez","Manuel Silva","Antonio Seoane","Agnès Borrás","Mario Noriega","Germán Ros","Jose A. Iglesias-Guitian","Antonio M. López"],"pdf_url":"https://arxiv.org/pdf/2312.12176v1.pdf","comment":"The UrbanSyn Dataset is available in http://urbansyn.org/"},{"id":"http://arxiv.org/abs/2311.17280v3","updated":"2023-12-19T14:04:33Z","published":"2023-11-28T23:40:13Z","title":"Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?","summary":" Data augmentation via back-translation is common when pretraining\nVision-and-Language Navigation (VLN) models, even though the generated\ninstructions are noisy. But: does that noise matter? We find that nonsensical\nor irrelevant language instructions during pretraining can have little effect\non downstream performance for both HAMT and VLN-BERT on R2R, and is still\nbetter than only using clean, human data. To underscore these results, we\nconcoct an efficient augmentation method, Unigram + Object, which generates\nnonsensical instructions that nonetheless improve downstream performance. Our\nfindings suggest that what matters for VLN R2R pretraining is the quantity of\nvisual trajectories, not the quality of instructions.\n","authors":["Wang Zhu","Ishika Singh","Yuan Huang","Robin Jia","Jesse Thomason"],"pdf_url":"https://arxiv.org/pdf/2311.17280v3.pdf","comment":"Accepted by O-DRUM @ CVPR 2023"},{"id":"http://arxiv.org/abs/2312.10656v2","updated":"2023-12-19T13:54:15Z","published":"2023-12-17T09:05:56Z","title":"VidToMe: Video Token Merging for Zero-Shot Video Editing","summary":" Diffusion models have made significant advances in generating high-quality\nimages, but their application to video generation has remained challenging due\nto the complexity of temporal motion. Zero-shot video editing offers a solution\nby utilizing pre-trained image diffusion models to translate source videos into\nnew ones. Nevertheless, existing methods struggle to maintain strict temporal\nconsistency and efficient memory consumption. In this work, we propose a novel\napproach to enhance temporal consistency in generated videos by merging\nself-attention tokens across frames. By aligning and compressing temporally\nredundant tokens across frames, our method improves temporal coherence and\nreduces memory consumption in self-attention computations. The merging strategy\nmatches and aligns tokens according to the temporal correspondence between\nframes, facilitating natural temporal consistency in generated video frames. To\nmanage the complexity of video processing, we divide videos into chunks and\ndevelop intra-chunk local token merging and inter-chunk global token merging,\nensuring both short-term video continuity and long-term content consistency.\nOur video editing approach seamlessly extends the advancements in image editing\nto video editing, rendering favorable results in temporal consistency over\nstate-of-the-art methods.\n","authors":["Xirui Li","Chao Ma","Xiaokang Yang","Ming-Hsuan Yang"],"pdf_url":"https://arxiv.org/pdf/2312.10656v2.pdf","comment":"Project page: https://vidtome-diffusion.github.io"},{"id":"http://arxiv.org/abs/2312.12155v1","updated":"2023-12-19T13:38:48Z","published":"2023-12-19T13:38:48Z","title":"Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video\n Moment Retrieval","summary":" Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed\nvideos corresponding to a given language query by constructing cross-modal\nalignment strategies. However, these existing strategies are often sub-optimal\nsince they ignore the modality imbalance problem, \\textit{i.e.}, the semantic\nrichness inherent in videos far exceeds that of a given limited-length\nsentence. Therefore, in pursuit of better alignment, a natural idea is\nenhancing the video modality to filter out query-irrelevant semantics, and\nenhancing the text modality to capture more segment-relevant knowledge. In this\npaper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework\nfor more balanced alignment through enhancing features at two levels. First, we\nenhance the video modality at the frame-word level through word reconstruction.\nThis strategy emphasizes the portions associated with query words in\nframe-level features while suppressing irrelevant parts. Therefore, the\nenhanced video contains less redundant semantics and is more balanced with the\ntextual modality. Second, we enhance the textual modality at the\nsegment-sentence level by learning complementary knowledge from context\nsentences and ground-truth segments. With the knowledge added to the query, the\ntextual modality thus maintains more meaningful semantics and is more balanced\nwith the video modality. By implementing two levels of MESM, the semantic\ninformation from both modalities is more balanced to align, thereby bridging\nthe modality gap. Experiments on three widely used benchmarks, including the\nout-of-distribution settings, show that the proposed framework achieves a new\nstart-of-the-art performance with notable generalization ability (e.g., 4.42%\nand 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code\nwill be available at https://github.com/lntzm/MESM.\n","authors":["Zhihang Liu","Jun Li","Hongtao Xie","Pandeng Li","Jiannan Ge","Sun-Ao Liu","Guoqing Jin"],"pdf_url":"https://arxiv.org/pdf/2312.12155v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12151v1","updated":"2023-12-19T13:33:59Z","published":"2023-12-19T13:33:59Z","title":"SoftCTM: Cell detection by soft instance segmentation and consideration\n of cell-tissue interaction","summary":" Detecting and classifying cells in histopathology H\\&E stained whole-slide\nimages is a core task in computational pathology, as it provides valuable\ninsight into the tumor microenvironment. In this work we investigate the impact\nof ground truth formats on the models performance. Additionally, cell-tissue\ninteractions are considered by providing tissue segmentation predictions as\ninput to the cell detection model. We find that a \"soft\", probability-map\ninstance segmentation ground truth leads to best model performance. Combined\nwith cell-tissue interaction and test-time augmentation our Soft\nCell-Tissue-Model (SoftCTM) achieves 0.7172 mean F1-Score on the Overlapped\nCell On Tissue (OCELOT) test set, achieving the third best overall score in the\nOCELOT 2023 Challenge. The source code for our approach is made publicly\navailable at https://github.com/lely475/ocelot23algo.\n","authors":["Lydia A. Schoenpflug","Viktor H. Koelzer"],"pdf_url":"https://arxiv.org/pdf/2312.12151v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12144v1","updated":"2023-12-19T13:25:45Z","published":"2023-12-19T13:25:45Z","title":"M-BEV: Masked BEV Perception for Robust Autonomous Driving","summary":" 3D perception is a critical problem in autonomous driving. Recently, the\nBird-Eye-View (BEV) approach has attracted extensive attention, due to low-cost\ndeployment and desirable vision detection capacity. However, the existing\nmodels ignore a realistic scenario during the driving procedure, i.e., one or\nmore view cameras may be failed, which largely deteriorates the performance. To\ntackle this problem, we propose a generic Masked BEV (M-BEV) perception\nframework, which can effectively improve robustness to this challenging\nscenario, by random masking and reconstructing camera views in the end-to-end\ntraining. More specifically, we develop a novel Masked View Reconstruction\n(MVR) module for M-BEV. It mimics various missing cases by randomly masking\nfeatures of different camera views, then leverages the original features of\nthese views as self-supervision, and reconstructs the masked ones with the\ndistinct spatio-temporal context across views. Via such a plug-and-play MVR,\nour M-BEV is capable of learning the missing views from the resting ones, and\nthus well generalized for robust view recovery and accurate perception in the\ntesting. We perform extensive experiments on the popular NuScenes benchmark,\nwhere our framework can significantly boost 3D perception performance of the\nstate-of-the-art models on various missing view cases, e.g., for the absence of\nback view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain.\n","authors":["Siran Chen","Yue Ma","Yu Qiao","Yali Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12144v1.pdf","comment":"Github repository: https://github.com/Sranc3/M-BEV"},{"id":"http://arxiv.org/abs/2312.12143v1","updated":"2023-12-19T13:23:49Z","published":"2023-12-19T13:23:49Z","title":"Integrating Human Vision Perception in Vision Transformers for\n Classifying Waste Items","summary":" In this paper, we propose an novel methodology aimed at simulating the\nlearning phenomenon of nystagmus through the application of differential\nblurring on datasets. Nystagmus is a biological phenomenon that influences\nhuman vision throughout life, notably by diminishing head shake from infancy to\nadulthood. Leveraging this concept, we address the issue of waste\nclassification, a pressing global concern. The proposed framework comprises two\nmodules, with the second module closely resembling the original Vision\nTransformer, a state of the art model model in classification tasks. The\nprimary motivation behind our approach is to enhance the model's precision and\nadaptability, mirroring the real world conditions that the human visual system\nundergoes. This novel methodology surpasses the standard Vision Transformer\nmodel in waste classification tasks, exhibiting an improvement with a margin of\n2%. This improvement underscores the potential of our methodology in improving\nmodel precision by drawing inspiration from human vision perception. Further\nresearch in the proposed methodology could yield greater performance results,\nand can extrapolated to other global tasks.\n","authors":["Akshat Kishore Shrivastava","Tapan Kumar Gandhi"],"pdf_url":"https://arxiv.org/pdf/2312.12143v1.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.12142v1","updated":"2023-12-19T13:23:20Z","published":"2023-12-19T13:23:20Z","title":"FontDiffuser: One-Shot Font Generation via Denoising Diffusion with\n Multi-Scale Content Aggregation and Style Contrastive Learning","summary":" Automatic font generation is an imitation task, which aims to create a font\nlibrary that mimics the style of reference images while preserving the content\nfrom source images. Although existing font generation methods have achieved\nsatisfactory performance, they still struggle with complex characters and large\nstyle variations. To address these issues, we propose FontDiffuser, a\ndiffusion-based image-to-image one-shot font generation method, which\ninnovatively models the font imitation task as a noise-to-denoise paradigm. In\nour method, we introduce a Multi-scale Content Aggregation (MCA) block, which\neffectively combines global and local content cues across different scales,\nleading to enhanced preservation of intricate strokes of complex characters.\nMoreover, to better manage the large variations in style transfer, we propose a\nStyle Contrastive Refinement (SCR) module, which is a novel structure for style\nrepresentation learning. It utilizes a style extractor to disentangle styles\nfrom images, subsequently supervising the diffusion model via a meticulously\ndesigned style contrastive loss. Extensive experiments demonstrate\nFontDiffuser's state-of-the-art performance in generating diverse characters\nand styles. It consistently excels on complex characters and large style\nchanges compared to previous methods. The code is available at\nhttps://github.com/yeungchenwa/FontDiffuser.\n","authors":["Zhenhua Yang","Dezhi Peng","Yuxin Kong","Yuyi Zhang","Cong Yao","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2312.12142v1.pdf","comment":"Accepted to AAAI 2024; Github Page:\n https://github.com/yeungchenwa/FontDiffuser"},{"id":"http://arxiv.org/abs/2203.03005v5","updated":"2023-12-19T13:15:00Z","published":"2022-03-06T16:52:28Z","title":"Self-Supervised Face Image Restoration with a One-Shot Reference","summary":" For image restoration, methods leveraging priors from generative models have\nbeen proposed and demonstrated a promising capacity to robustly restore\nphotorealistic and high-quality results. However, these methods are susceptible\nto semantic ambiguity, particularly with images that have obviously correct\nsemantics such as facial images. In this paper, we propose a semantic-aware\nlatent space exploration method for image restoration (SAIR). By explicitly\nmodeling semantics information from a given reference image, SAIR is able to\nreliably restore severely degraded images not only to high-resolution and\nhighly realistic looks but also to correct semantics. Quantitative and\nqualitative experiments collectively demonstrate the superior performance of\nthe proposed SAIR. Our code is available at https://github.com/Liamkuo/SAIR.\n","authors":["Yanhui Guo","Fangzhou Luo","Shaoyuan Xu"],"pdf_url":"https://arxiv.org/pdf/2203.03005v5.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.12135v1","updated":"2023-12-19T13:14:52Z","published":"2023-12-19T13:14:52Z","title":"Object Detection for Automated Coronary Artery Using Deep Learning","summary":" In the era of digital medicine, medical imaging serves as a widespread\ntechnique for early disease detection, with a substantial volume of images\nbeing generated and stored daily in electronic patient records. X-ray\nangiography imaging is a standard and one of the most common methods for\nrapidly diagnosing coronary artery diseases. The notable achievements of recent\ndeep learning algorithms align with the increased use of electronic health\nrecords and diagnostic imaging. Deep neural networks, leveraging abundant data,\nadvanced algorithms, and powerful computational capabilities, prove highly\neffective in the analysis and interpretation of images. In this context, Object\ndetection methods have become a promising approach, particularly through\nconvolutional neural networks (CNN), streamlining medical image analysis by\neliminating manual feature extraction. This allows for direct feature\nextraction from images, ensuring high accuracy in results. Therefore, in our\npaper, we utilized the object detection method on X-ray angiography images to\nprecisely identify the location of coronary artery stenosis. As a result, this\nmodel enables automatic and real-time detection of stenosis locations,\nassisting in the crucial and sensitive decision-making process for healthcare\nprofessionals.\n","authors":["Hadis Keshavarz","Hossein Sadr"],"pdf_url":"https://arxiv.org/pdf/2312.12135v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12133v1","updated":"2023-12-19T13:11:35Z","published":"2023-12-19T13:11:35Z","title":"Object-Aware Domain Generalization for Object Detection","summary":" Single-domain generalization (S-DG) aims to generalize a model to unseen\nenvironments with a single-source domain. However, most S-DG approaches have\nbeen conducted in the field of classification. When these approaches are\napplied to object detection, the semantic features of some objects can be\ndamaged, which can lead to imprecise object localization and misclassification.\nTo address these problems, we propose an object-aware domain generalization\n(OA-DG) method for single-domain generalization in object detection. Our method\nconsists of data augmentation and training strategy, which are called OA-Mix\nand OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level\ntransformation and object-aware mixing strategy. OA-Loss enables models to\nlearn domain-invariant representations for objects and backgrounds from the\noriginal and OA-Mixed images. Our proposed method outperforms state-of-the-art\nworks on standard benchmarks. Our code is available at\nhttps://github.com/WoojuLee24/OA-DG.\n","authors":["Wooju Lee","Dasol Hong","Hyungtae Lim","Hyun Myung"],"pdf_url":"https://arxiv.org/pdf/2312.12133v1.pdf","comment":"Accepted by AAAI-24. The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2310.00757v2","updated":"2023-12-19T13:10:23Z","published":"2023-10-01T18:27:59Z","title":"Mind the Gap: Federated Learning Broadens Domain Generalization in\n Diagnostic AI Models","summary":" Developing robust artificial intelligence (AI) models that generalize well to\nunseen datasets is challenging and usually requires large and variable\ndatasets, preferably from multiple institutions. In federated learning (FL), a\nmodel is trained collaboratively at numerous sites that hold local datasets\nwithout exchanging them. So far, the impact of training strategy, i.e., local\nversus collaborative, on the diagnostic on-domain and off-domain performance of\nAI models interpreting chest radiographs has not been assessed. Consequently,\nusing 610,000 chest radiographs from five institutions across the globe, we\nassessed diagnostic performance as a function of training strategy (i.e., local\nvs. collaborative), network architecture (i.e., convolutional vs.\ntransformer-based), generalization performance (i.e., on-domain vs.\noff-domain), imaging finding (i.e., cardiomegaly, pleural effusion, pneumonia,\natelectasis, consolidation, pneumothorax, and no abnormality), dataset size\n(i.e., from n=18,000 to 213,921 radiographs), and dataset diversity. Large\ndatasets not only showed minimal performance gains with FL but, in some\ninstances, even exhibited decreases. In contrast, smaller datasets revealed\nmarked improvements. Thus, on-domain performance was mainly driven by training\ndata size. However, off-domain performance leaned more on training diversity.\nWhen trained collaboratively across diverse external institutions, AI models\nconsistently surpassed models trained locally for off-domain tasks, emphasizing\nFL's potential in leveraging data diversity. In conclusion, FL can bolster\ndiagnostic privacy, reproducibility, and off-domain reliability of AI models\nand, potentially, optimize healthcare outcomes.\n","authors":["Soroosh Tayebi Arasteh","Christiane Kuhl","Marwin-Jonathan Saehn","Peter Isfort","Daniel Truhn","Sven Nebelung"],"pdf_url":"https://arxiv.org/pdf/2310.00757v2.pdf","comment":"Published in Nature Scientific Reports"},{"id":"http://arxiv.org/abs/2302.04977v3","updated":"2023-12-19T13:05:06Z","published":"2023-02-09T23:34:17Z","title":"Mithridates: Auditing and Boosting Backdoor Resistance of Machine\n Learning Pipelines","summary":" Machine learning (ML) models trained on data from potentially untrusted\nsources are vulnerable to poisoning. A small, maliciously crafted subset of the\ntraining inputs can cause the model to learn a \"backdoor\" task (e.g.,\nmisclassify inputs with a certain feature) in addition to its main task. Recent\nresearch proposed many hypothetical backdoor attacks whose efficacy heavily\ndepends on the configuration and training hyperparameters of the target model.\n Given the variety of potential backdoor attacks, ML engineers who are not\nsecurity experts have no way to measure how vulnerable their current training\npipelines are, nor do they have a practical way to compare training\nconfigurations so as to pick the more resistant ones. Deploying a defense\nrequires evaluating and choosing from among dozens of research papers and\nre-engineering the training pipeline.\n In this paper, we aim to provide ML engineers with pragmatic tools to audit\nthe backdoor resistance of their training pipelines and to compare different\ntraining configurations, to help choose one that best balances accuracy and\nsecurity.\n First, we propose a universal, attack-agnostic resistance metric based on the\nminimum number of training inputs that must be compromised before the model\nlearns any backdoor.\n Second, we design, implement, and evaluate Mithridates a multi-stage approach\nthat integrates backdoor resistance into the training-configuration search. ML\ndevelopers already rely on hyperparameter search to find configurations that\nmaximize the model's accuracy. Mithridates extends this standard tool to\nbalance accuracy and resistance without disruptive changes to the training\npipeline. We show that hyperparameters found by Mithridates increase resistance\nto multiple types of backdoor attacks by 3-5x with only a slight impact on\naccuracy. We also discuss extensions to AutoML and federated learning.\n","authors":["Eugene Bagdasaryan","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2302.04977v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.16557v2","updated":"2023-12-19T12:58:02Z","published":"2022-03-30T18:00:07Z","title":"COSMOS: Cross-Modality Unsupervised Domain Adaptation for 3D Medical\n Image Segmentation based on Target-aware Domain Translation and Iterative\n Self-Training","summary":" Recent advances in deep learning-based medical image segmentation studies\nachieve nearly human-level performance when in fully supervised condition.\nHowever, acquiring pixel-level expert annotations is extremely expensive and\nlaborious in medical imaging fields. Unsupervised domain adaptation can\nalleviate this problem, which makes it possible to use annotated data in one\nimaging modality to train a network that can successfully perform segmentation\non target imaging modality with no labels. In this work, we propose a\nself-training based unsupervised domain adaptation framework for 3D medical\nimage segmentation named COSMOS and validate it with automatic segmentation of\nVestibular Schwannoma (VS) and cochlea on high-resolution T2 Magnetic Resonance\nImages (MRI). Our target-aware contrast conversion network translates source\ndomain annotated T1 MRI to pseudo T2 MRI to enable segmentation training on\ntarget domain, while preserving important anatomical features of interest in\nthe converted images. Iterative self-training is followed to incorporate\nunlabeled data to training and incrementally improve the quality of\npseudo-labels, thereby leading to improved performance of segmentation. COSMOS\nwon the 1\\textsuperscript{st} place in the Cross-Modality Domain Adaptation\n(crossMoDA) challenge held in conjunction with the 24th International\nConference on Medical Image Computing and Computer Assisted Intervention\n(MICCAI 2021). It achieves mean Dice score and Average Symmetric Surface\nDistance of 0.871(0.063) and 0.437(0.270) for VS, and 0.842(0.020) and\n0.152(0.030) for cochlea.\n","authors":["Hyungseob Shin","Hyeongyu Kim","Sewon Kim","Yohan Jun","Taejoon Eo","Dosik Hwang"],"pdf_url":"https://arxiv.org/pdf/2203.16557v2.pdf","comment":"10 pages, 6 figures, MICCAI 2021 Cross-Modality Domain Adaptation\n (crossMoDA) Challenge"},{"id":"http://arxiv.org/abs/2312.12122v1","updated":"2023-12-19T12:54:54Z","published":"2023-12-19T12:54:54Z","title":"ZS-SRT: An Efficient Zero-Shot Super-Resolution Training Method for\n Neural Radiance Fields","summary":" Neural Radiance Fields (NeRF) have achieved great success in the task of\nsynthesizing novel views that preserve the same resolution as the training\nviews. However, it is challenging for NeRF to synthesize high-quality\nhigh-resolution novel views with low-resolution training data. To solve this\nproblem, we propose a zero-shot super-resolution training framework for NeRF.\nThis framework aims to guide the NeRF model to synthesize high-resolution novel\nviews via single-scene internal learning rather than requiring any external\nhigh-resolution training data. Our approach consists of two stages. First, we\nlearn a scene-specific degradation mapping by performing internal learning on a\npretrained low-resolution coarse NeRF. Second, we optimize a super-resolution\nfine NeRF by conducting inverse rendering with our mapping function so as to\nbackpropagate the gradients from low-resolution 2D space into the\nsuper-resolution 3D sampling space. Then, we further introduce a temporal\nensemble strategy in the inference phase to compensate for the scene estimation\nerrors. Our method is featured on two points: (1) it does not consume\nhigh-resolution views or additional scene data to train super-resolution NeRF;\n(2) it can speed up the training process by adopting a coarse-to-fine strategy.\nBy conducting extensive experiments on public datasets, we have qualitatively\nand quantitatively demonstrated the effectiveness of our method.\n","authors":["Xiang Feng","Yongbo He","Yubo Wang","Chengkai Wang","Zhenzhong Kuang","Jiajun Ding","Feiwei Qin","Jun Yu","Jianping Fan"],"pdf_url":"https://arxiv.org/pdf/2312.12122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06171v2","updated":"2023-12-19T12:36:47Z","published":"2023-12-11T07:20:42Z","title":"Jointly Explicit and Implicit Cross-Modal Interaction Network for\n Anterior Chamber Inflammation Diagnosis","summary":" Uveitis demands the precise diagnosis of anterior chamber inflammation (ACI)\nfor optimal treatment. However, current diagnostic methods only rely on a\nlimited single-modal disease perspective, which leads to poor performance. In\nthis paper, we investigate a promising yet challenging way to fuse multimodal\ndata for ACI diagnosis. Notably, existing fusion paradigms focus on empowering\nimplicit modality interactions (i.e., self-attention and its variants), but\nneglect to inject explicit modality interactions, especially from clinical\nknowledge and imaging property. To this end, we propose a jointly Explicit and\nimplicit Cross-Modal Interaction Network (EiCI-Net) for Anterior Chamber\nInflammation Diagnosis that uses anterior segment optical coherence tomography\n(AS-OCT) images, slit-lamp images, and clinical data jointly. Specifically, we\nfirst develop CNN-Based Encoders and Tabular Processing Module (TPM) to extract\nefficient feature representations in different modalities. Then, we devise an\nExplicit Cross-Modal Interaction Module (ECIM) to generate attention maps as a\nkind of explicit clinical knowledge based on the tabular feature maps, then\nintegrated them into the slit-lamp feature maps, allowing the CNN-Based Encoder\nto focus on more effective informativeness of the slit-lamp images. After that,\nthe Implicit Cross-Modal Interaction Module (ICIM), a transformer-based\nnetwork, further implicitly enhances modality interactions. Finally, we\nconstruct a considerable real-world dataset from our collaborative hospital and\nconduct sufficient experiments to demonstrate the superior performance of our\nproposed EiCI-Net compared with the state-of-the-art classification methods in\nvarious metrics.\n","authors":["Qian Shao","Ye Dai","Haochao Ying","Kan Xu","Jinhong Wang","Wei Chi","Jian Wu"],"pdf_url":"https://arxiv.org/pdf/2312.06171v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.06252v5","updated":"2023-12-19T12:31:45Z","published":"2023-05-10T15:33:15Z","title":"Embedded Feature Similarity Optimization with Specific Parameter\n Initialization for 2D/3D Medical Image Registration","summary":" We present a novel deep learning-based framework: Embedded Feature Similarity\nOptimization with Specific Parameter Initialization (SOPI) for 2D/3D medical\nimage registration which is a most challenging problem due to the difficulty\nsuch as dimensional mismatch, heavy computation load and lack of golden\nevaluation standard. The framework we design includes a parameter specification\nmodule to efficiently choose initialization pose parameter and a\nfine-registration module to align images. The proposed framework takes\nextracting multi-scale features into consideration using a novel composite\nconnection encoder with special training techniques. We compare the method with\nboth learning-based methods and optimization-based methods on a in-house\nCT/X-ray dataset as well as simulated data to further evaluate performance. Our\nexperiments demonstrate that the method in this paper has improved the\nregistration performance, and thereby outperforms the existing methods in terms\nof accuracy and running time. We also show the potential of the proposed method\nas an initial pose estimator. The code is available at\nhttps://github.com/m1nhengChen/SOPI\n","authors":["Minheng Chen","Zhirun Zhang","Shuheng Gu","Youyong Kong"],"pdf_url":"https://arxiv.org/pdf/2305.06252v5.pdf","comment":"14 pages, 5 figures, accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.12102v1","updated":"2023-12-19T12:26:57Z","published":"2023-12-19T12:26:57Z","title":"I-CEE: Tailoring Explanations of Image Classifications Models to User\n Expertise","summary":" Effectively explaining decisions of black-box machine learning models is\ncritical to responsible deployment of AI systems that rely on them. Recognizing\ntheir importance, the field of explainable AI (XAI) provides several techniques\nto generate these explanations. Yet, there is relatively little emphasis on the\nuser (the explainee) in this growing body of work and most XAI techniques\ngenerate \"one-size-fits-all\" explanations. To bridge this gap and achieve a\nstep closer towards human-centered XAI, we present I-CEE, a framework that\nprovides Image Classification Explanations tailored to User Expertise. Informed\nby existing work, I-CEE explains the decisions of image classification models\nby providing the user with an informative subset of training data (i.e.,\nexample images), corresponding local explanations, and model decisions.\nHowever, unlike prior work, I-CEE models the informativeness of the example\nimages to depend on user expertise, resulting in different examples for\ndifferent users. We posit that by tailoring the example set to user expertise,\nI-CEE can better facilitate users' understanding and simulatability of the\nmodel. To evaluate our approach, we conduct detailed experiments in both\nsimulation and with human participants (N = 100) on multiple datasets.\nExperiments with simulated users show that I-CEE improves users' ability to\naccurately predict the model's decisions (simulatability) compared to\nbaselines, providing promising preliminary results. Experiments with human\nparticipants demonstrate that our method significantly improves user\nsimulatability accuracy, highlighting the importance of human-centered XAI\n","authors":["Yao Rong","Peizhu Qian","Vaibhav Unhelkar","Enkelejda Kasneci"],"pdf_url":"https://arxiv.org/pdf/2312.12102v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10798v2","updated":"2023-12-19T12:22:45Z","published":"2023-12-17T19:22:39Z","title":"Land use/land cover classification of fused Sentinel-1 and Sentinel-2\n imageries using ensembles of Random Forests","summary":" The study explores the synergistic combination of Synthetic Aperture Radar\n(SAR) and Visible-Near Infrared-Short Wave Infrared (VNIR-SWIR) imageries for\nland use/land cover (LULC) classification. Image fusion, employing Bayesian\nfusion, merges SAR texture bands with VNIR-SWIR imageries. The research aims to\ninvestigate the impact of this fusion on LULC classification. Despite the\npopularity of random forests for supervised classification, their limitations,\nsuch as suboptimal performance with fewer features and accuracy stagnation, are\naddressed. To overcome these issues, ensembles of random forests (RFE) are\ncreated, introducing random rotations using the Forest-RC algorithm. Three\nrotation approaches: principal component analysis (PCA), sparse random rotation\n(SRP) matrix, and complete random rotation (CRP) matrix are employed.\nSentinel-1 SAR data and Sentinel-2 VNIR-SWIR data from the IIT-Kanpur region\nconstitute the training datasets, including SAR, SAR with texture, VNIR-SWIR,\nVNIR-SWIR with texture, and fused VNIR-SWIR with texture. The study evaluates\nclassifier efficacy, explores the impact of SAR and VNIR-SWIR fusion on\nclassification, and significantly enhances the execution speed of Bayesian\nfusion code. The SRP-based RFE outperforms other ensembles for the first two\ndatasets, yielding average overall kappa values of 61.80% and 68.18%, while the\nCRP-based RFE excels for the last three datasets with average overall kappa\nvalues of 95.99%, 96.93%, and 96.30%. The fourth dataset achieves the highest\noverall kappa of 96.93%. Furthermore, incorporating texture with SAR bands\nresults in a maximum overall kappa increment of 10.00%, while adding texture to\nVNIR-SWIR bands yields a maximum increment of approximately 3.45%.\n","authors":["Shivam Pande"],"pdf_url":"https://arxiv.org/pdf/2312.10798v2.pdf","comment":"Thesis for Master of Technology. Created: July 2018. Total pages 124"},{"id":"http://arxiv.org/abs/2312.12098v1","updated":"2023-12-19T12:21:09Z","published":"2023-12-19T12:21:09Z","title":"Domain Generalization in LiDAR Semantic Segmentation Leveraged by\n Density Discriminative Feature Embedding","summary":" While significant progress has been achieved in LiDAR-based perception,\ndomain generalization continues to present challenges, often resulting in\nreduced performance when encountering unfamiliar datasets due to domain\ndiscrepancies. One of the primary hurdles stems from the variability of LiDAR\nsensors, leading to inconsistencies in point cloud density distribution. Such\ninconsistencies can undermine the effectiveness of perception models. We\naddress this challenge by introducing a new approach that acknowledges a\nfundamental characteristic of LiDAR: the variation in point density due to the\ndistance from the LiDAR to the scene, and the number of beams relative to the\nfield of view. Understanding this, we view each LiDAR's point cloud at various\ndistances as having distinct density distributions, which can be consistent\nacross different LiDAR models. With this insight, we propose the Density\nDiscriminative Feature Embedding (DDFE) module, crafted to specifically extract\nfeatures related to density while ensuring domain invariance across different\nLiDAR sensors. In addition, we introduce a straightforward but effective\ndensity augmentation technique, designed to broaden the density spectrum and\nenhance the capabilities of the DDFE. The proposed DDFE stands out as a\nversatile and lightweight domain generalization module. It can be seamlessly\nintegrated into various 3D backbone networks, consistently outperforming\nexisting state-of-the-art domain generalization approaches. We commit to\nreleasing the source code publicly to foster community collaboration and\nadvancement.\n","authors":["Jaeyeul Kim","Jungwan Woo","Jeonghoon Kim","Sunghoon Im"],"pdf_url":"https://arxiv.org/pdf/2312.12098v1.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2312.05281v2","updated":"2023-12-19T12:20:22Z","published":"2023-12-08T10:27:47Z","title":"X2-Softmax: Margin Adaptive Loss Function for Face Recognition","summary":" Learning the discriminative features of different faces is an important task\nin face recognition. By extracting face features in neural networks, it becomes\neasy to measure the similarity of different face images, which makes face\nrecognition possible. To enhance the neural network's face feature\nseparability, incorporating an angular margin during training is common\npractice. State-of-the-art loss functions CosFace and ArcFace apply fixed\nmargins between weights of classes to enhance the inter-class separation of\nface features. Since the distribution of samples in the training set is\nimbalanced, similarities between different identities are unequal. Therefore,\nusing an inappropriately fixed angular margin may lead to the problem that the\nmodel is difficult to converge or the face features are not discriminative\nenough. It is more in line with our intuition that the margins are angular\nadaptive, which could increase with the angles between classes growing. In this\npaper, we propose a new angular margin loss named X2-Softmax. X2-Softmax loss\nhas adaptive angular margins, which provide the margin that increases with the\nangle between different classes growing. The angular adaptive margin ensures\nmodel flexibility and effectively improves the effect of face recognition. We\nhave trained the neural network with X2-Softmax loss on the MS1Mv3 dataset and\ntested it on several evaluation benchmarks to demonstrate the effectiveness and\nsuperiority of our loss function.\n","authors":["Jiamu Xu","Xiaoxiang Liu","Xinyuan Zhang","Yain-Whar Si","Xiaofan Li","Zheng Shi","Ke Wang","Xueyuan Gong"],"pdf_url":"https://arxiv.org/pdf/2312.05281v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12096v1","updated":"2023-12-19T12:19:20Z","published":"2023-12-19T12:19:20Z","title":"DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular\n Videos","summary":" Reconstructing a dynamic human with loose clothing is an important but\ndifficult task. To address this challenge, we propose a method named DLCA-Recon\nto create human avatars from monocular videos. The distance from loose clothing\nto the underlying body rapidly changes in every frame when the human freely\nmoves and acts. Previous methods lack effective geometric initialization and\nconstraints for guiding the optimization of deformation to explain this\ndramatic change, resulting in the discontinuous and incomplete reconstruction\nsurface. To model the deformation more accurately, we propose to initialize an\nestimated 3D clothed human in the canonical space, as it is easier for\ndeformation fields to learn from the clothed human than from SMPL. With both\nrepresentations of explicit mesh and implicit SDF, we utilize the physical\nconnection information between consecutive frames and propose a dynamic\ndeformation field (DDF) to optimize deformation fields. DDF accounts for\ncontributive forces on loose clothing to enhance the interpretability of\ndeformations and effectively capture the free movement of loose clothing.\nMoreover, we propagate SMPL skinning weights to each individual and refine pose\nand skinning weights during the optimization to improve skinning\ntransformation. Based on more reasonable initialization and DDF, we can\nsimulate real-world physics more accurately. Extensive experiments on public\nand our own datasets validate that our method can produce superior results for\nhumans with loose clothing compared to the SOTA methods.\n","authors":["Chunjie Luo","Fei Luo","Yusen Wang","Enxu Zhao","Chunxia Xiao"],"pdf_url":"https://arxiv.org/pdf/2312.12096v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12090v1","updated":"2023-12-19T12:10:12Z","published":"2023-12-19T12:10:12Z","title":"GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion\n Prediction","summary":" Human motion prediction is important for virtual reality (VR) applications,\ne.g., for realistic avatar animation. Existing methods have synthesised body\nmotion only from observed past motion, despite the fact that human gaze is\nknown to correlate strongly with body movements and is readily available in\nrecent VR headsets. We present GazeMoDiff -- a novel gaze-guided denoising\ndiffusion model to generate stochastic human motions. Our method first uses a\ngraph attention network to learn the spatio-temporal correlations between eye\ngaze and human movements and to fuse them into cross-modal gaze-motion\nfeatures. These cross-modal features are injected into a noise prediction\nnetwork via a cross-attention mechanism and progressively denoised to generate\nrealistic human full-body motions. Experimental results on the MoGaze and GIMO\ndatasets demonstrate that our method outperforms the state-of-the-art methods\nby a large margin in terms of average displacement error (15.03% on MoGaze and\n9.20% on GIMO). We further conducted an online user study to compare our method\nwith state-of-the-art methods and the responses from 23 participants validate\nthat the motions generated by our method are more realistic than those from\nother methods. Taken together, our work makes a first important step towards\ngaze-guided stochastic human motion prediction and guides future work on this\nimportant topic in VR research.\n","authors":["Haodong Yan","Zhiming Hu","Syn Schmitt","Andreas Bulling"],"pdf_url":"https://arxiv.org/pdf/2312.12090v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12080v1","updated":"2023-12-19T11:57:54Z","published":"2023-12-19T11:57:54Z","title":"Learning Subject-Aware Cropping by Outpainting Professional Photos","summary":" How to frame (or crop) a photo often depends on the image subject and its\ncontext; e.g., a human portrait. Recent works have defined the subject-aware\nimage cropping task as a nuanced and practical version of image cropping. We\npropose a weakly-supervised approach (GenCrop) to learn what makes a\nhigh-quality, subject-aware crop from professional stock images. Unlike\nsupervised prior work, GenCrop requires no new manual annotations beyond the\nexisting stock image collection. The key challenge in learning from this data,\nhowever, is that the images are already cropped and we do not know what regions\nwere removed. Our insight is combine a library of stock images with a modern,\npre-trained text-to-image diffusion model. The stock image collection provides\ndiversity and its images serve as pseudo-labels for a good crop, while the\ntext-image diffusion model is used to out-paint (i.e., outward inpainting)\nrealistic uncropped images. Using this procedure, we are able to automatically\ngenerate a large dataset of cropped-uncropped training pairs to train a\ncropping model. Despite being weakly-supervised, GenCrop is competitive with\nstate-of-the-art supervised methods and significantly better than comparable\nweakly-supervised baselines on quantitative and qualitative evaluation metrics.\n","authors":["James Hong","Lu Yuan","Michaël Gharbi","Matthew Fisher","Kayvon Fatahalian"],"pdf_url":"https://arxiv.org/pdf/2312.12080v1.pdf","comment":"AAAI 24. Extended version with supplemental materials"},{"id":"http://arxiv.org/abs/2312.10439v2","updated":"2023-12-19T11:43:07Z","published":"2023-12-16T13:06:15Z","title":"Simple Image-level Classification Improves Open-vocabulary Object\n Detection","summary":" Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a\ngiven set of base categories on which the detection model is trained. Recent\nOVOD methods focus on adapting the image-level pre-trained vision-language\nmodels (VLMs), such as CLIP, to a region-level object detection task via, eg.,\nregion-level knowledge distillation, regional prompt learning, or region-text\npre-training, to expand the detection vocabulary. These methods have\ndemonstrated remarkable performance in recognizing regional visual concepts,\nbut they are weak in exploiting the VLMs' powerful global scene understanding\nability learned from the billion-scale image-level text descriptions. This\nlimits their capability in detecting hard objects of small, blurred, or\noccluded appearance from novel/base categories, whose detection heavily relies\non contextual information. To address this, we propose a novel approach, namely\nSimple Image-level Classification for Context-Aware Detection Scoring\n(SIC-CADS), to leverage the superior global knowledge yielded from CLIP for\ncomplementing the current OVOD models from a global perspective. The core of\nSIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the\nobject co-occurrence-based contextual information from CLIP to recognize all\npossible object categories in the scene. These image-level MLR scores can then\nbe utilized to refine the instance-level detection scores of the current OVOD\nmodels in detecting those hard objects. This is verified by extensive empirical\nresults on two popular benchmarks, OV-LVIS and OV-COCO, which show that\nSIC-CADS achieves significant and consistent improvement when combined with\ndifferent types of OVOD models. Further, SIC-CADS also improves the\ncross-dataset generalization ability on Objects365 and OpenImages. The code is\navailable at https://github.com/mala-lab/SIC-CADS.\n","authors":["Ruohuan Fang","Guansong Pang","Xiao Bai"],"pdf_url":"https://arxiv.org/pdf/2312.10439v2.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2306.07915v4","updated":"2023-12-19T11:40:46Z","published":"2023-06-13T17:18:01Z","title":"Image Captioners Are Scalable Vision Learners Too","summary":" Contrastive pretraining on image-text pairs from the web is one of the most\npopular large-scale pretraining strategies for vision backbones, especially in\nthe context of large multimodal models. At the same time, image captioning on\nthis type of data is commonly considered an inferior pretraining strategy. In\nthis paper, we perform a fair comparison of these two pretraining strategies,\ncarefully matching training data, compute, and model capacity. Using a standard\nencoder-decoder transformer, we find that captioning alone is surprisingly\neffective: on classification tasks, captioning produces vision encoders\ncompetitive with contrastively pretrained encoders, while surpassing them on\nvision & language tasks. We further analyze the effect of the model\narchitecture and scale, as well as the pretraining data on the representation\nquality, and find that captioning exhibits the same or better scaling behavior\nalong these axes. Overall our results show that plain image captioning is a\nmore powerful pretraining strategy than was previously believed.\n","authors":["Michael Tschannen","Manoj Kumar","Andreas Steiner","Xiaohua Zhai","Neil Houlsby","Lucas Beyer"],"pdf_url":"https://arxiv.org/pdf/2306.07915v4.pdf","comment":"Accepted at NeurIPS 2023. v2 adds SugarCrepe results and more\n ablations, v3 has minor fixes. v4 adds a code link (\n https://github.com/google-research/big_vision )"},{"id":"http://arxiv.org/abs/2312.12068v1","updated":"2023-12-19T11:36:03Z","published":"2023-12-19T11:36:03Z","title":"PICNN: A Pathway towards Interpretable Convolutional Neural Networks","summary":" Convolutional Neural Networks (CNNs) have exhibited great performance in\ndiscriminative feature learning for complex visual tasks. Besides\ndiscrimination power, interpretability is another important yet under-explored\nproperty for CNNs. One difficulty in the CNN interpretability is that filters\nand image classes are entangled. In this paper, we introduce a novel pathway to\nalleviate the entanglement between filters and image classes. The proposed\npathway groups the filters in a late conv-layer of CNN into class-specific\nclusters. Clusters and classes are in a one-to-one relationship. Specifically,\nwe use the Bernoulli sampling to generate the filter-cluster assignment matrix\nfrom a learnable filter-class correspondence matrix. To enable end-to-end\noptimization, we develop a novel reparameterization trick for handling the\nnon-differentiable Bernoulli sampling. We evaluate the effectiveness of our\nmethod on ten widely used network architectures (including nine CNNs and a ViT)\nand five benchmark datasets. Experimental results have demonstrated that our\nmethod PICNN (the combination of standard CNNs with our proposed pathway)\nexhibits greater interpretability than standard CNNs while achieving higher or\ncomparable discrimination power.\n","authors":["Wengang Guo","Jiayi Yang","Huilin Yin","Qijun Chen","Wei Ye"],"pdf_url":"https://arxiv.org/pdf/2312.12068v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12064v1","updated":"2023-12-19T11:32:02Z","published":"2023-12-19T11:32:02Z","title":"MPI Planar Correction of Pulse Based ToF Cameras","summary":" Time-of-Flight (ToF) cameras are becoming popular in a wide span of areas\nranging from consumer-grade electronic devices to safety-critical industrial\nrobots. This is mainly due to their high frame rate, relative good precision\nand the lowered costs. Although ToF cameras are in continuous development,\nespecially pulse-based variants, they still face different problems, including\nspurious noise over the points or multipath inference (MPI). The latter can\ncause deformed surfaces to manifest themselves on curved surfaces instead of\nplanar ones, making standard spatial data preprocessing, such as plane\nextraction, difficult. In this paper, we focus on the MPI reduction problem\nusing Feature Pyramid Networks (FPN) which allow the mitigation of this type of\nartifact for pulse-based ToF cameras. With our end-to-end network, we managed\nto attenuate the MPI effect on planar surfaces using a learning-based method on\nreal ToF data. Both the custom dataset used for our model training as well as\nthe code is available on the author's Github homepage.\n","authors":["Marian-Leontin Pop","Levente Tamas"],"pdf_url":"https://arxiv.org/pdf/2312.12064v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12042v1","updated":"2023-12-19T10:55:46Z","published":"2023-12-19T10:55:46Z","title":"Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body\n Poses using an Eye-body Coordination Model","summary":" While generating realistic body movements, e.g., for avatars in virtual\nreality, is widely studied in computer vision and graphics, the generation of\neye movements that exhibit realistic coordination with the body remains\nunder-explored. We first report a comprehensive analysis of the coordination of\nhuman eye and full-body movements during everyday activities based on data from\nthe MoGaze and GIMO datasets. We show that eye gaze has strong correlations\nwith head directions and also full-body motions and there exists a noticeable\ntime delay between body and eye movements. Inspired by the analyses, we then\npresent Pose2Gaze -- a novel eye-body coordination model that first uses a\nconvolutional neural network and a spatio-temporal graph convolutional neural\nnetwork to extract features from head directions and full-body poses\nrespectively and then applies a convolutional neural network to generate\nrealistic eye movements. We compare our method with state-of-the-art methods\nthat predict eye gaze only from head movements for three different generation\ntasks and demonstrate that Pose2Gaze significantly outperforms these baselines\non both datasets with an average improvement of 26.4% and 21.6% in mean angular\nerror, respectively. Our findings underline the significant potential of\ncross-modal human gaze behaviour analysis and modelling.\n","authors":["Zhiming Hu","Jiahui Xu","Syn Schmitt","Andreas Bulling"],"pdf_url":"https://arxiv.org/pdf/2312.12042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09783v3","updated":"2023-12-19T10:36:41Z","published":"2023-12-15T13:36:54Z","title":"Keep the Faith: Faithful Explanations in Convolutional Neural Networks\n for Case-Based Reasoning","summary":" Explaining predictions of black-box neural networks is crucial when applied\nto decision-critical tasks. Thus, attribution maps are commonly used to\nidentify important image regions, despite prior work showing that humans prefer\nexplanations based on similar examples. To this end, ProtoPNet learns a set of\nclass-representative feature vectors (prototypes) for case-based reasoning.\nDuring inference, similarities of latent features to prototypes are linearly\nclassified to form predictions and attribution maps are provided to explain the\nsimilarity. In this work, we evaluate whether architectures for case-based\nreasoning fulfill established axioms required for faithful explanations using\nthe example of ProtoPNet. We show that such architectures allow the extraction\nof faithful explanations. However, we prove that the attribution maps used to\nexplain the similarities violate the axioms. We propose a new procedure to\nextract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually,\nthese explanations are Shapley values, calculated on the similarity scores of\neach prototype. They allow to faithfully answer which prototypes are present in\nan unseen image and quantify each pixel's contribution to that presence,\nthereby complying with all axioms. The theoretical violations of ProtoPNet\nmanifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs,\nRSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50,\nResNeXt50). Our experiments show a qualitative difference between the\nexplanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the\nexplanations with the Area Over the Perturbation Curve, on which ProtoPFaith\noutperforms ProtoPNet on all experiments by a factor $>10^3$.\n","authors":["Tom Nuno Wolf","Fabian Bongratz","Anne-Marie Rickmann","Sebastian Pölsterl","Christian Wachinger"],"pdf_url":"https://arxiv.org/pdf/2312.09783v3.pdf","comment":"To be published in proceedings of AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2312.11315v2","updated":"2023-12-19T10:31:08Z","published":"2023-12-18T16:10:18Z","title":"CaRe-CNN: Cascading Refinement CNN for Myocardial Infarct Segmentation\n with Microvascular Obstructions","summary":" Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely\nestablished to assess the viability of myocardial tissue of patients after\nacute myocardial infarction (MI). We propose the Cascading Refinement CNN\n(CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that\nexploits the hierarchical structure of such labeled cardiac data. Throughout\nthe three stages of the cascade, the label definition changes and CaRe-CNN\nlearns to gradually refine its intermediate predictions accordingly.\nFurthermore, to obtain more consistent qualitative predictions, we propose a\nseries of post-processing steps that take anatomical constraints into account.\nOur CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked\nsecond out of 18 participating teams. CaRe-CNN showed great improvements most\nnotably when segmenting the difficult but clinically most relevant myocardial\ninfarct tissue (MIT) as well as microvascular obstructions (MVO). When\ncomputing the average scores over all labels, our method obtained the best\nscore in eight out of ten metrics. Thus, accurate cardiac segmentation after\nacute MI via our CaRe-CNN allows generating patient-specific models of the\nheart serving as an important step towards personalized medicine.\n","authors":["Franz Thaler","Matthias A. F. Gsell","Gernot Plank","Martin Urschler"],"pdf_url":"https://arxiv.org/pdf/2312.11315v2.pdf","comment":"Accepted at VISIGRAPP 2024, 12 pages"},{"id":"http://arxiv.org/abs/2312.12030v1","updated":"2023-12-19T10:30:31Z","published":"2023-12-19T10:30:31Z","title":"Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint\n Method","summary":" Training-free guided sampling in diffusion models leverages off-the-shelf\npre-trained networks, such as an aesthetic evaluation model, to guide the\ngeneration process. Current training-free guided sampling algorithms obtain the\nguidance energy function based on a one-step estimate of the clean image.\nHowever, since the off-the-shelf pre-trained networks are trained on clean\nimages, the one-step estimation procedure of the clean image may be inaccurate,\nespecially in the early stages of the generation process in diffusion models.\nThis causes the guidance in the early time steps to be inaccurate. To overcome\nthis problem, we propose Symplectic Adjoint Guidance (SAG), which calculates\nthe gradient guidance in two inner stages. Firstly, SAG estimates the clean\nimage via $n$ function calls, where $n$ serves as a flexible hyperparameter\nthat can be tailored to meet specific image quality requirements. Secondly, SAG\nuses the symplectic adjoint method to obtain the gradients accurately and\nefficiently in terms of the memory requirements. Extensive experiments\ndemonstrate that SAG generates images with higher qualities compared to the\nbaselines in both guided image and video generation tasks.\n","authors":["Jiachun Pan","Hanshu Yan","Jun Hao Liew","Jiashi Feng","Vincent Y. F. Tan"],"pdf_url":"https://arxiv.org/pdf/2312.12030v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12028v1","updated":"2023-12-19T10:29:29Z","published":"2023-12-19T10:29:29Z","title":"EyePreserve: Identity-Preserving Iris Synthesis","summary":" Synthesis of same-identity biometric iris images, both for existing and\nnon-existing identities while preserving the identity across a wide range of\npupil sizes, is complex due to intricate iris muscle constriction mechanism,\nrequiring a precise model of iris non-linear texture deformations to be\nembedded into the synthesis pipeline. This paper presents the first method of\nfully data-driven, identity-preserving, pupil size-varying s ynthesis of iris\nimages. This approach is capable of synthesizing images of irises with\ndifferent pupil sizes representing non-existing identities as well as\nnon-linearly deforming the texture of iris images of existing subjects given\nthe segmentation mask of the target iris image. Iris recognition experiments\nsuggest that the proposed deformation model not only preserves the identity\nwhen changing the pupil size but offers better similarity between same-identity\niris samples with significant differences in pupil size, compared to\nstate-of-the-art linear and non-linear (bio-mechanical-based) iris deformation\nmodels. Two immediate applications of the proposed approach are: (a) synthesis\nof, or enhancement of the existing biometric datasets for iris recognition,\nmimicking those acquired with iris sensors, and (b) helping forensic human\nexperts in examining iris image pairs with significant differences in pupil\ndilation. Source codes and weights of the models are made available with the\npaper.\n","authors":["Siamul Karim Khan","Patrick Tinsley","Mahsa Mitcheff","Patrick Flynn","Kevin W. Bowyer","Adam Czajka"],"pdf_url":"https://arxiv.org/pdf/2312.12028v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12023v1","updated":"2023-12-19T10:19:44Z","published":"2023-12-19T10:19:44Z","title":"Progressive Frequency-Aware Network for Laparoscopic Image Desmoking","summary":" Laparoscopic surgery offers minimally invasive procedures with better patient\noutcomes, but smoke presence challenges visibility and safety. Existing\nlearning-based methods demand large datasets and high computational resources.\nWe propose the Progressive Frequency-Aware Network (PFAN), a lightweight GAN\nframework for laparoscopic image desmoking, combining the strengths of CNN and\nTransformer for progressive information extraction in the frequency domain.\nPFAN features CNN-based Multi-scale Bottleneck-Inverting (MBI) Blocks for\ncapturing local high-frequency information and Locally-Enhanced Axial Attention\nTransformers (LAT) for efficiently handling global low-frequency information.\nPFAN efficiently desmokes laparoscopic images even with limited training data.\nOur method outperforms state-of-the-art approaches in PSNR, SSIM, CIEDE2000,\nand visual quality on the Cholec80 dataset and retains only 629K parameters.\nOur code and models are made publicly available at:\nhttps://github.com/jlzcode/PFAN.\n","authors":["Jiale Zhang","Wenfeng Huang","Xiangyun Liao","Qiong Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.17338v2","updated":"2023-12-19T10:06:08Z","published":"2023-03-30T12:45:46Z","title":"Local region-learning modules for point cloud classification","summary":" Data organization via forming local regions is an integral part of deep\nlearning networks that process 3D point clouds in a hierarchical manner. At\neach level, the point cloud is sampled to extract representative points and\nthese points are used to be centers of local regions. The organization of local\nregions is of considerable importance since it determines the location and size\nof the receptive field at a particular layer of feature aggregation. In this\npaper, we present two local region-learning modules: Center Shift Module to\ninfer the appropriate shift for each center point, and Radius Update Module to\nalter the radius of each local region. The parameters of the modules are\nlearned through optimizing the loss associated with the particular task within\nan end-to-end network. We present alternatives for these modules through\nvarious ways of modeling the interactions of the features and locations of 3D\npoints in the point cloud. We integrated both modules independently and\ntogether to the PointNet++ and PointCNN object classification architectures,\nand demonstrated that the modules contributed to a significant increase in\nclassification accuracy for the ScanObjectNN data set consisting of scans of\nreal-world objects. Our further experiments on ShapeNet data set showed that\nthe modules are also effective on 3D CAD models.\n","authors":["Kaya Turgut","Helin Dutagaci"],"pdf_url":"https://arxiv.org/pdf/2303.17338v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12000v1","updated":"2023-12-19T09:47:18Z","published":"2023-12-19T09:47:18Z","title":"Diffusing More Objects for Semi-Supervised Domain Adaptation with Less\n Labeling","summary":" For object detection, it is possible to view the prediction of bounding boxes\nas a reverse diffusion process. Using a diffusion model, the random bounding\nboxes are iteratively refined in a denoising step, conditioned on the image. We\npropose a stochastic accumulator function that starts each run with random\nbounding boxes and combines the slightly different predictions. We empirically\nverify that this improves detection performance. The improved detections are\nleveraged on unlabelled images as weighted pseudo-labels for semi-supervised\nlearning. We evaluate the method on a challenging out-of-domain test set. Our\nmethod brings significant improvements and is on par with human-selected\npseudo-labels, while not requiring any human involvement.\n","authors":["Leander van den Heuvel","Gertjan Burghouts","David W. Zhang","Gwenn Englebienne","Sabina B. van Rooij"],"pdf_url":"https://arxiv.org/pdf/2312.12000v1.pdf","comment":"4 pages, Workshop on DiffusionModels, NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.07937v2","updated":"2023-12-19T09:39:58Z","published":"2023-12-13T07:30:19Z","title":"BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics","summary":" The recently emerging text-to-motion advances have spired numerous attempts\nfor convenient and interactive human motion generation. Yet, existing methods\nare largely limited to generating body motions only without considering the\nrich two-hand motions, let alone handling various conditions like body dynamics\nor texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal\ndataset for two-hand motion generation. Our dataset includes accurate motion\ntracking for the human body and hands and provides pair-wised finger-level hand\nannotations and body descriptions. We further provide a strong baseline method,\nBOTH2Hands, for the novel task: generating vivid two-hand motions from both\nimplicit body dynamics and explicit text prompts. We first warm up two parallel\nbody-to-hand and text-to-hand diffusion models and then utilize the\ncross-attention transformer for motion blending. Extensive experiments and\ncross-validations demonstrate the effectiveness of our approach and dataset for\ngenerating convincing two-hand motions from the hybrid body-and-textual\nconditions. Our dataset and code will be disseminated to the community for\nfuture research.\n","authors":["Wenqian Zhang","Molin Huang","Yuxuan Zhou","Juze Zhang","Jingyi Yu","Jingya Wang","Lan Xu"],"pdf_url":"https://arxiv.org/pdf/2312.07937v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11994v1","updated":"2023-12-19T09:37:25Z","published":"2023-12-19T09:37:25Z","title":"Optimizing Diffusion Noise Can Serve As Universal Motion Priors","summary":" We propose Diffusion Noise Optimization (DNO), a new method that effectively\nleverages existing motion diffusion models as motion priors for a wide range of\nmotion-related tasks. Instead of training a task-specific diffusion model for\neach new task, DNO operates by optimizing the diffusion latent noise of an\nexisting pre-trained text-to-motion model. Given the corresponding latent noise\nof a human motion, it propagates the gradient from the target criteria defined\non the motion space through the whole denoising process to update the diffusion\nlatent noise. As a result, DNO supports any use cases where criteria can be\ndefined as a function of motion. In particular, we show that, for motion\nediting and control, DNO outperforms existing methods in both achieving the\nobjective and preserving the motion content. DNO accommodates a diverse range\nof editing modes, including changing trajectory, pose, joint locations, or\navoiding newly added obstacles. In addition, DNO is effective in motion\ndenoising and completion, producing smooth and realistic motion from noisy and\npartial inputs. DNO achieves these results at inference time without the need\nfor model retraining, offering great versatility for any defined reward or loss\nfunction on the motion representation.\n","authors":["Korrawe Karunratanakul","Konpat Preechakul","Emre Aksan","Thabo Beeler","Supasorn Suwajanakorn","Siyu Tang"],"pdf_url":"https://arxiv.org/pdf/2312.11994v1.pdf","comment":"Project page: https://korrawe.github.io/dno-project/"},{"id":"http://arxiv.org/abs/2303.08906v2","updated":"2023-12-19T09:16:31Z","published":"2023-03-15T20:02:54Z","title":"VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression","summary":" In content-based video retrieval (CBVR), dealing with large-scale\ncollections, efficiency is as important as accuracy; thus, several video-level\nfeature-based studies have actively been conducted. Nevertheless, owing to the\nsevere difficulty of embedding a lengthy and untrimmed video into a single\nfeature, these studies have been insufficient for accurate retrieval compared\nto frame-level feature-based studies. In this paper, we show that appropriate\nsuppression of irrelevant frames can provide insight into the current obstacles\nof the video-level approaches. Furthermore, we propose a Video-to-Video\nSuppression network (VVS) as a solution. VVS is an end-to-end framework that\nconsists of an easy distractor elimination stage to identify which frames to\nremove and a suppression weight generation stage to determine the extent to\nsuppress the remaining frames. This structure is intended to effectively\ndescribe an untrimmed video with varying content and meaningless information.\nIts efficacy is proved via extensive experiments, and we show that our approach\nis not only state-of-the-art in video-level approaches but also has a fast\ninference time despite possessing retrieval capabilities close to those of\nframe-level approaches. Code is available at https://github.com/sejong-rcv/VVS\n","authors":["Won Jo","Geuntaek Lim","Gwangjin Lee","Hyunwoo Kim","Byungsoo Ko","Yukyung Choi"],"pdf_url":"https://arxiv.org/pdf/2303.08906v2.pdf","comment":"AAAI-24"},{"id":"http://arxiv.org/abs/2203.16284v3","updated":"2023-12-19T09:12:35Z","published":"2022-03-30T13:24:04Z","title":"FIRe: Fast Inverse Rendering using Directional and Signed Distance\n Functions","summary":" Neural 3D implicit representations learn priors that are useful for diverse\napplications, such as single- or multiple-view 3D reconstruction. A major\ndownside of existing approaches while rendering an image is that they require\nevaluating the network multiple times per camera ray so that the high\ncomputational time forms a bottleneck for downstream applications. We address\nthis problem by introducing a novel neural scene representation that we call\nthe directional distance function (DDF). To this end, we learn a signed\ndistance function (SDF) along with our DDF model to represent a class of\nshapes. Specifically, our DDF is defined on the unit sphere and predicts the\ndistance to the surface along any given direction. Therefore, our DDF allows\nrendering images with just a single network evaluation per camera ray. Based on\nour DDF, we present a novel fast algorithm (FIRe) to reconstruct 3D shapes\ngiven a posed depth map. We evaluate our proposed method on 3D reconstruction\nfrom single-view depth images, where we empirically show that our algorithm\nreconstructs 3D shapes more accurately and it is more than 15 times faster (per\niteration) than competing methods.\n","authors":["Tarun Yenamandra","Ayush Tewari","Nan Yang","Florian Bernard","Christian Theobalt","Daniel Cremers"],"pdf_url":"https://arxiv.org/pdf/2203.16284v3.pdf","comment":"News: Accepted to WACV'24. Project page:\n https://vision.in.tum.de/research/geometry/fire"},{"id":"http://arxiv.org/abs/2312.11973v1","updated":"2023-12-19T09:11:49Z","published":"2023-12-19T09:11:49Z","title":"Continual Learning: Forget-free Winning Subnetworks for Video\n Representations","summary":" Inspired by the Regularized Lottery Ticket Hypothesis (RLTH), which\nhighlights the presence of competitive subnetworks within dense networks for\ncontinual learning tasks, we introduce Winning Subnetworks (WSN). This approach\nutilizes reused weights in dense networks to enhance learning in Task\nIncremental Learning (TIL) scenarios. To mitigate overfitting in Few-Shot Class\nIncremental Learning (FSCIL), we have developed WSN variants referred to as the\nSoft subnetwork (SoftNet). Furthermore, addressing WSN's limitation of sparse\nreused weights in Video Incremental Learning (VIL), we propose the Fourier\nSubneural Operator (FSO). The FSO, operating in Fourier space, adaptively and\ncompactly encodes videos, discovering reusable subnetworks with diverse\nbandwidths. We have applied FSO's Fourier representations to various continual\nlearning contexts, including VIL, TIL, and FSCIL. Our extensive experiments\nacross these scenarios demonstrate FSO's remarkable efficacy in continual\nlearning, significantly enhancing task performance at various convolutional\nrepresentational levels: it boosts performance in the higher layers for TIL and\nFSCIL and the lower layers for VIL.\n","authors":["Haeyong Kang","Jaehong Yoon","Sung Ju Hwang","Chang D. Yoo"],"pdf_url":"https://arxiv.org/pdf/2312.11973v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2303.14962,\n arXiv:2306.11305"},{"id":"http://arxiv.org/abs/2312.11972v1","updated":"2023-12-19T09:09:46Z","published":"2023-12-19T09:09:46Z","title":"Expressive Forecasting of 3D Whole-body Human Motions","summary":" Human motion forecasting, with the goal of estimating future human behavior\nover a period of time, is a fundamental task in many real-world applications.\nHowever, existing works typically concentrate on predicting the major joints of\nthe human body without considering the delicate movements of the human hands.\nIn practical applications, hand gesture plays an important role in human\ncommunication with the real world, and expresses the primary intention of human\nbeings. In this work, we are the first to formulate a whole-body human pose\nforecasting task, which jointly predicts the future body and hand activities.\nCorrespondingly, we propose a novel Encoding-Alignment-Interaction (EAI)\nframework that aims to predict both coarse (body joints) and fine-grained\n(gestures) activities collaboratively, enabling expressive and\ncross-facilitated forecasting of 3D whole-body human motions. Specifically, our\nmodel involves two key constituents: cross-context alignment (XCA) and\ncross-context interaction (XCI). Considering the heterogeneous information\nwithin the whole-body, XCA aims to align the latent features of various human\ncomponents, while XCI focuses on effectively capturing the context interaction\namong the human components. We conduct extensive experiments on a\nnewly-introduced large-scale benchmark and achieve state-of-the-art\nperformance. The code is public for research purposes at\nhttps://github.com/Dingpx/EAI.\n","authors":["Pengxiang Ding","Qiongjie Cui","Min Zhang","Mengyuan Liu","Haofan Wang","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11967v1","updated":"2023-12-19T09:03:53Z","published":"2023-12-19T09:03:53Z","title":"Context Disentangling and Prototype Inheriting for Robust Visual\n Grounding","summary":" Visual grounding (VG) aims to locate a specific target in an image based on a\ngiven language query. The discriminative information from context is important\nfor distinguishing the target from other objects, particularly for the targets\nthat have the same category as others. However, most previous methods\nunderestimate such information. Moreover, they are usually designed for the\nstandard scene (without any novel object), which limits their generalization to\nthe open-vocabulary scene. In this paper, we propose a novel framework with\ncontext disentangling and prototype inheriting for robust visual grounding to\nhandle both scenes. Specifically, the context disentangling disentangles the\nreferent and context features, which achieves better discrimination between\nthem. The prototype inheriting inherits the prototypes discovered from the\ndisentangled visual features by a prototype bank to fully utilize the seen\ndata, especially for the open-vocabulary scene. The fused features, obtained by\nleveraging Hadamard product on disentangled linguistic and visual features of\nprototypes to avoid sharp adjusting the importance between the two types of\nfeatures, are then attached with a special token and feed to a vision\nTransformer encoder for bounding box regression. Extensive experiments are\nconducted on both standard and open-vocabulary scenes. The performance\ncomparisons indicate that our method outperforms the state-of-the-art methods\nin both scenarios. {The code is available at\nhttps://github.com/WayneTomas/TransCP.\n","authors":["Wei Tang","Liang Li","Xuejing Liu","Lu Jin","Jinhui Tang","Zechao Li"],"pdf_url":"https://arxiv.org/pdf/2312.11967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11954v1","updated":"2023-12-19T08:55:00Z","published":"2023-12-19T08:55:00Z","title":"Adversarial AutoMixup","summary":" Data mixing augmentation has been widely applied to improve the\ngeneralization ability of deep neural networks. Recently, offline data mixing\naugmentation, e.g. handcrafted and saliency information-based mixup, has been\ngradually replaced by automatic mixing approaches. Through minimizing two\nsub-tasks, namely, mixed sample generation and mixup classification in an\nend-to-end way, AutoMix significantly improves accuracy on image classification\ntasks. However, as the optimization objective is consistent for the two\nsub-tasks, this approach is prone to generating consistent instead of diverse\nmixed samples, which results in overfitting for target task training. In this\npaper, we propose AdAutomixup, an adversarial automatic mixup augmentation\napproach that generates challenging samples to train a robust classifier for\nimage classification, by alternatively optimizing the classifier and the mixup\nsample generator. AdAutomixup comprises two modules, a mixed example generator,\nand a target classifier. The mixed sample generator aims to produce hard mixed\nexamples to challenge the target classifier while the target classifier`s aim\nis to learn robust features from hard mixed examples to improve generalization.\nTo prevent the collapse of the inherent meanings of images, we further\nintroduce an exponential moving average (EMA) teacher and cosine similarity to\ntrain AdAutomixup in an end-to-end way. Extensive experiments on seven image\nbenchmarks consistently prove that our approach outperforms the state of the\nart in various classification scenarios.\n","authors":["Huafeng Qin","Xin Jin","Yun Jiang","Mounim A. El-Yacoubi","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2312.11954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11938v1","updated":"2023-12-19T08:31:30Z","published":"2023-12-19T08:31:30Z","title":"DMT: Comprehensive Distillation with Multiple Self-supervised Teachers","summary":" Numerous self-supervised learning paradigms, such as contrastive learning and\nmasked image modeling, have been proposed to acquire powerful and general\nrepresentations from unlabeled data. However, these models are commonly\npretrained within their specific framework alone, failing to consider the\ncomplementary nature of visual representations. To tackle this issue, we\nintroduce Comprehensive Distillation with Multiple Self-supervised Teachers\n(DMT) for pretrained model compression, which leverages the strengths of\nmultiple off-the-shelf self-supervised models. Our experimental results on\nprominent benchmark datasets exhibit that the proposed method significantly\nsurpasses state-of-the-art competitors while retaining favorable efficiency\nmetrics. On classification tasks, our DMT framework utilizing three different\nself-supervised ViT-Base teachers enhances the performance of both small/tiny\nmodels and the base model itself. For dense tasks, DMT elevates the AP/mIoU of\nstandard SSL models on MS-COCO and ADE20K datasets by 4.0%.\n","authors":["Yuang Liu","Jing Wang","Qiang Zhou","Fan Wang","Jun Wang","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.11938v1.pdf","comment":"ICASSP 2024"},{"id":"http://arxiv.org/abs/2309.02318v2","updated":"2023-12-19T08:20:43Z","published":"2023-09-05T15:34:37Z","title":"TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA\n Reconstruction","summary":" Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical\nrole in the diagnosis of many medical diseases, such as Arteriovenous\nMalformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant\napplication value, the reconstruction of 4D DSA demands numerous views to\neffectively model the intricate vessels and radiocontrast flow, thereby\nimplying a significant radiation dose. To address this high radiation issue, we\npropose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA\nreconstruction, which paves the way for high-quality 4D imaging. Additionally,\n2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA\nimages. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation\nproperties from both spatial and temporal dimensions. It is optimized by\nminimizing discrepancies between the rendered images and sparse 2D DSA images.\nWithout any neural network involved, TiAVox enjoys specific physical\ninterpretability. The parameters of each learnable voxel represent the\nattenuation coefficients. We validated the TiAVox approach on both clinical and\nsimulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for\nnovel view synthesis using only 30 views on the clinically sourced dataset,\nwhereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly,\nwith merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32\nfor novel view synthesis and 41.40 for 3D reconstruction. We also executed\nablation studies to corroborate the essential components of TiAVox. The code\nwill be publically available.\n","authors":["Zhenghong Zhou","Huangxuan Zhao","Jiemin Fang","Dongqiao Xiang","Lei Chen","Lingxia Wu","Feihong Wu","Wenyu Liu","Chuansheng Zheng","Xinggang Wang"],"pdf_url":"https://arxiv.org/pdf/2309.02318v2.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.07266v2","updated":"2023-12-19T08:18:47Z","published":"2023-12-12T13:45:56Z","title":"ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for\n Open-Vocabulary Object Detection","summary":" Open-vocabulary object detection (OVOD) aims to recognize novel objects whose\ncategories are not included in the training set. In order to classify these\nunseen classes during training, many OVOD frameworks leverage the zero-shot\ncapability of largely pretrained vision and language models, such as CLIP. To\nfurther improve generalization on the unseen novel classes, several approaches\nproposed to additionally train with pseudo region labeling on the external data\nsources that contain a substantial number of novel category labels beyond the\nexisting training data. Albeit its simplicity, these pseudo-labeling methods\nstill exhibit limited improvement with regard to the truly unseen novel classes\nthat were not pseudo-labeled. In this paper, we present a novel, yet simple\ntechnique that helps generalization on the overall distribution of novel\nclasses. Inspired by our observation that numerous novel classes reside within\nthe convex hull constructed by the base (seen) classes in the CLIP embedding\nspace, we propose to synthesize proxy-novel classes approximating novel classes\nvia linear mixup between a pair of base classes. By training our detector with\nthese synthetic proxy-novel classes, we effectively explore the embedding space\nof novel classes. The experimental results on various OVOD benchmarks such as\nLVIS and COCO demonstrate superior performance on novel classes compared to the\nother state-of-the-art methods. Code is available at\nhttps://github.com/clovaai/ProxyDet.\n","authors":["Joonhyun Jeong","Geondo Park","Jayeon Yoo","Hyungsik Jung","Heesu Kim"],"pdf_url":"https://arxiv.org/pdf/2312.07266v2.pdf","comment":"Accepted in AAAI24"},{"id":"http://arxiv.org/abs/2312.11929v1","updated":"2023-12-19T08:15:22Z","published":"2023-12-19T08:15:22Z","title":"Transformer Network for Multi-Person Tracking and Re-Identification in\n Unconstrained Environment","summary":" Multi-object tracking (MOT) has profound applications in a variety of fields,\nincluding surveillance, sports analytics, self-driving, and cooperative\nrobotics. Despite considerable advancements, existing MOT methodologies tend to\nfalter when faced with non-uniform movements, occlusions, and\nappearance-reappearance scenarios of the objects. Recognizing this inadequacy,\nwe put forward an integrated MOT method that not only marries object detection\nand identity linkage within a singular, end-to-end trainable framework but also\nequips the model with the ability to maintain object identity links over long\nperiods of time. Our proposed model, named STMMOT, is built around four key\nmodules: 1) candidate proposal generation, which generates object proposals via\na vision-transformer encoder-decoder architecture that detects the object from\neach frame in the video; 2) scale variant pyramid, a progressive pyramid\nstructure to learn the self-scale and cross-scale similarities in multi-scale\nfeature maps; 3) spatio-temporal memory encoder, extracting the essential\ninformation from the memory associated with each object under tracking; and 4)\nspatio-temporal memory decoder, simultaneously resolving the tasks of object\ndetection and identity association for MOT. Our system leverages a robust\nspatio-temporal memory module that retains extensive historical observations\nand effectively encodes them using an attention-based aggregator. The\nuniqueness of STMMOT lies in representing objects as dynamic query embeddings\nthat are updated continuously, which enables the prediction of object states\nwith attention mechanisms and eradicates the need for post-processing.\n","authors":["Hamza Mukhtar","Muhammad Usman Ghani Khan"],"pdf_url":"https://arxiv.org/pdf/2312.11929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11923v1","updated":"2023-12-19T08:03:19Z","published":"2023-12-19T08:03:19Z","title":"IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text\n Recognition","summary":" Nowadays, scene text recognition has attracted more and more attention due to\nits diverse applications. Most state-of-the-art methods adopt an\nencoder-decoder framework with the attention mechanism, autoregressively\ngenerating text from left to right. Despite the convincing performance, this\nsequential decoding strategy constrains inference speed. Conversely,\nnon-autoregressive models provide faster, simultaneous predictions but often\nsacrifice accuracy. Although utilizing an explicit language model can improve\nperformance, it burdens the computational load. Besides, separating linguistic\nknowledge from vision information may harm the final prediction. In this paper,\nwe propose an alternative solution, using a parallel and iterative decoder that\nadopts an easy-first decoding strategy. Furthermore, we regard text recognition\nas an image-based conditional text generation task and utilize the discrete\ndiffusion strategy, ensuring exhaustive exploration of bidirectional contextual\ninformation. Extensive experiments demonstrate that the proposed approach\nachieves superior results on the benchmark datasets, including both Chinese and\nEnglish text images.\n","authors":["Xiaomeng Yang","Zhi Qiao","Yu Zhou","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11923v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08288v2","updated":"2023-12-19T07:50:23Z","published":"2023-08-16T11:20:23Z","title":"Improving Audio-Visual Segmentation with Bidirectional Generation","summary":" The aim of audio-visual segmentation (AVS) is to precisely differentiate\naudible objects within videos down to the pixel level. Traditional approaches\noften tackle this challenge by combining information from various modalities,\nwhere the contribution of each modality is implicitly or explicitly modeled.\nNevertheless, the interconnections between different modalities tend to be\noverlooked in audio-visual modeling. In this paper, inspired by the human\nability to mentally simulate the sound of an object and its visual appearance,\nwe introduce a bidirectional generation framework. This framework establishes\nrobust correlations between an object's visual characteristics and its\nassociated sound, thereby enhancing the performance of AVS. To achieve this, we\nemploy a visual-to-audio projection component that reconstructs audio features\nfrom object segmentation masks and minimizes reconstruction errors. Moreover,\nrecognizing that many sounds are linked to object movements, we introduce an\nimplicit volumetric motion estimation module to handle temporal dynamics that\nmay be challenging to capture using conventional optical flow methods. To\nshowcase the effectiveness of our approach, we conduct comprehensive\nexperiments and analyses on the widely recognized AVSBench benchmark. As a\nresult, we establish a new state-of-the-art performance level in the AVS\nbenchmark, particularly excelling in the challenging MS3 subset which involves\nsegmenting multiple sound sources. To facilitate reproducibility, we plan to\nrelease both the source code and the pre-trained model.\n","authors":["Dawei Hao","Yuxin Mao","Bowen He","Xiaodong Han","Yuchao Dai","Yiran Zhong"],"pdf_url":"https://arxiv.org/pdf/2308.08288v2.pdf","comment":"AAAI Camera Ready. Dawei Hao and Yuxin Mao contribute equality to\n this paper. Yiran Zhong is the corresponding author. The code will be\n released at https://github.com/OpenNLPLab/AVS-bidirectional"},{"id":"http://arxiv.org/abs/2312.10686v2","updated":"2023-12-19T07:49:07Z","published":"2023-12-17T11:11:02Z","title":"Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated\n Outlier Class Learning","summary":" Existing out-of-distribution (OOD) methods have shown great success on\nbalanced datasets but become ineffective in long-tailed recognition (LTR)\nscenarios where 1) OOD samples are often wrongly classified into head classes\nand/or 2) tail-class samples are treated as OOD samples. To address these\nissues, current studies fit a prior distribution of auxiliary/pseudo OOD data\nto the long-tailed in-distribution (ID) data. However, it is difficult to\nobtain such an accurate prior distribution given the unknowingness of real OOD\nsamples and heavy class imbalance in LTR. A straightforward solution to avoid\nthe requirement of this prior is to learn an outlier class to encapsulate the\nOOD samples. The main challenge is then to tackle the aforementioned confusion\nbetween OOD samples and head/tail-class samples when learning the outlier\nclass. To this end, we introduce a novel calibrated outlier class learning\n(COCL) approach, in which 1) a debiased large margin learning method is\nintroduced in the outlier class learning to distinguish OOD samples from both\nhead and tail classes in the representation space and 2) an outlier-class-aware\nlogit calibration method is defined to enhance the long-tailed classification\nconfidence. Extensive empirical results on three popular benchmarks CIFAR10-LT,\nCIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms\nstate-of-the-art OOD detection methods in LTR while being able to improve the\nclassification accuracy on ID data. Code is available at\nhttps://github.com/mala-lab/COCL.\n","authors":["Wenjun Miao","Guansong Pang","Tianqi Li","Xiao Bai","Jin Zheng"],"pdf_url":"https://arxiv.org/pdf/2312.10686v2.pdf","comment":"AAAI2024, with supplementary material"},{"id":"http://arxiv.org/abs/2303.10343v2","updated":"2023-12-19T07:44:31Z","published":"2023-03-18T06:13:30Z","title":"Supervision Interpolation via LossMix: Generalizing Mixup for Object\n Detection and Beyond","summary":" The success of data mixing augmentations in image classification tasks has\nbeen well-received. However, these techniques cannot be readily applied to\nobject detection due to challenges such as spatial misalignment,\nforeground/background distinction, and plurality of instances. To tackle these\nissues, we first introduce a novel conceptual framework called Supervision\nInterpolation (SI), which offers a fresh perspective on interpolation-based\naugmentations by relaxing and generalizing Mixup. Based on SI, we propose\nLossMix, a simple yet versatile and effective regularization that enhances the\nperformance and robustness of object detectors and more. Our key insight is\nthat we can effectively regularize the training on mixed data by interpolating\ntheir loss errors instead of ground truth labels. Empirical results on the\nPASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently\noutperform state-of-the-art methods widely adopted for detection. Furthermore,\nby jointly leveraging LossMix with unsupervised domain adaptation, we\nsuccessfully improve existing approaches and set a new state of the art for\ncross-domain object detection.\n","authors":["Thanh Vu","Baochen Sun","Bodi Yuan","Alex Ngai","Yueqi Li","Jan-Michael Frahm"],"pdf_url":"https://arxiv.org/pdf/2303.10343v2.pdf","comment":"AAAI-24 Camera Ready Version, with supplementary material, 15 pages"},{"id":"http://arxiv.org/abs/2312.11911v1","updated":"2023-12-19T07:39:45Z","published":"2023-12-19T07:39:45Z","title":"EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State\n Estimation and 3D Dense Mapping","summary":" Event cameras are bio-inspired, motion-activated sensors that demonstrate\nsubstantial potential in handling challenging situations, such as motion blur\nand high-dynamic range. In this paper, we proposed EVI-SAM to tackle the\nproblem of 6 DoF pose tracking and 3D reconstruction using monocular event\ncamera. A novel event-based hybrid tracking framework is designed to estimate\nthe pose, leveraging the robustness of feature matching and the precision of\ndirect alignment. Specifically, we develop an event-based 2D-2D alignment to\nconstruct the photometric constraint, and tightly integrate it with the\nevent-based reprojection constraint. The mapping module recovers the dense and\ncolorful depth of the scene through the image-guided event-based mapping\nmethod. Subsequently, the appearance, texture, and surface mesh of the 3D scene\ncan be reconstructed by fusing the dense depth map from multiple viewpoints\nusing truncated signed distance function (TSDF) fusion. To the best of our\nknowledge, this is the first non-learning work to realize event-based dense\nmapping. Numerical evaluations are performed on both publicly available and\nself-collected datasets, which qualitatively and quantitatively demonstrate the\nsuperior performance of our method. Our EVI-SAM effectively balances accuracy\nand robustness while maintaining computational efficiency, showcasing superior\npose tracking and dense mapping performance in challenging scenarios. Video\nDemo: https://youtu.be/Nn40U4e5Si8.\n","authors":["Weipeng Guan","Peiyu Chen","Huibin Zhao","Yu Wang","Peng Lu"],"pdf_url":"https://arxiv.org/pdf/2312.11911v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.06999v3","updated":"2023-12-19T07:30:25Z","published":"2023-03-13T10:54:52Z","title":"Identifying Label Errors in Object Detection Datasets by Loss Inspection","summary":" Labeling datasets for supervised object detection is a dull and\ntime-consuming task. Errors can be easily introduced during annotation and\noverlooked during review, yielding inaccurate benchmarks and performance\ndegradation of deep neural networks trained on noisy labels. In this work, we\nfor the first time introduce a benchmark for label error detection methods on\nobject detection datasets as well as a label error detection method and a\nnumber of baselines. We simulate four different types of randomly introduced\nlabel errors on train and test sets of well-labeled object detection datasets.\nFor our label error detection method we assume a two-stage object detector to\nbe given and consider the sum of both stages' classification and regression\nlosses. The losses are computed with respect to the predictions and the noisy\nlabels including simulated label errors, aiming at detecting the latter. We\ncompare our method to three baselines: a naive one without deep learning, the\nobject detector's score and the entropy of the classification softmax\ndistribution. We outperform all baselines and demonstrate that among the\nconsidered methods, ours is the only one that detects label errors of all four\ntypes efficiently. Furthermore, we detect real label errors a) on commonly used\ntest datasets in object detection and b) on a proprietary dataset. In both\ncases we achieve low false positives rates, i.e., we detect label errors with a\nprecision for a) of up to 71.5% and for b) with 97%.\n","authors":["Marius Schubert","Tobias Riedlinger","Karsten Kahl","Daniel Kröll","Sebastian Schoenen","Siniša Šegvić","Matthias Rottmann"],"pdf_url":"https://arxiv.org/pdf/2303.06999v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08206v2","updated":"2023-12-19T07:21:03Z","published":"2023-10-12T10:51:23Z","title":"Long-Tailed Classification Based on Coarse-Grained Leading Forest and\n Multi-Center Loss","summary":" Long-tailed (LT) classification is an unavoidable and challenging problem in\nthe real world. Most existing long-tailed classification methods focus only on\nsolving the class-wise imbalance while ignoring the attribute-wise imbalance.\nThe deviation of a classification model is caused by both class-wise and\nattribute-wise imbalance. Due to the fact that attributes are implicit in most\ndatasets and the combination of attributes is complex, attribute-wise imbalance\nis more difficult to handle. For this purpose, we proposed a novel long-tailed\nclassification framework, aiming to build a multi-granularity classification\nmodel by means of invariant feature learning. This method first unsupervisedly\nconstructs Coarse-Grained forest (CLF) to better characterize the distribution\nof attributes within a class. Depending on the distribution of attributes, one\ncan customize suitable sampling strategies to construct different imbalanced\ndatasets. We then introduce multi-center loss (MCL) that aims to gradually\neliminate confusing attributes during feature learning process. The proposed\nframework does not necessarily couple to a specific LT classification model\nstructure and can be integrated with any existing LT method as an independent\ncomponent. Extensive experiments show that our approach achieves\nstate-of-the-art performance on both existing benchmarks ImageNet-GLT and\nMSCOCO-GLT and can improve the performance of existing LT methods. Our codes\nare available on GitHub: \\url{https://github.com/jinyery/cognisance}\n","authors":["Jinye Yang","Ji Xu","Di Wu","Jianhang Tang","Shaobo Li","Guoyin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.08206v2.pdf","comment":"This is another research work to apply leading tree structure along\n with deep learning architecture, aiming to deal with attribute-wise long-tail\n distribution within class"},{"id":"http://arxiv.org/abs/2311.15570v2","updated":"2023-12-19T07:12:21Z","published":"2023-11-27T06:38:07Z","title":"UFDA: Universal Federated Domain Adaptation with Practical Assumptions","summary":" Conventional Federated Domain Adaptation (FDA) approaches usually demand an\nabundance of assumptions, which makes them significantly less feasible for\nreal-world situations and introduces security hazards. This paper relaxes the\nassumptions from previous FDAs and studies a more practical scenario named\nUniversal Federated Domain Adaptation (UFDA). It only requires the black-box\nmodel and the label set information of each source domain, while the label sets\nof different source domains could be inconsistent, and the target-domain label\nset is totally blind. Towards a more effective solution for our newly proposed\nUFDA scenario, we propose a corresponding methodology called Hot-Learning with\nContrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain\nshifts and category gaps problems by using one-hot outputs from the black-box\nmodels of various source domains. Moreover, to better distinguish the shared\nand unknown classes, we further present a cluster-level strategy named\nMutual-Voting Decision (MVD) to extract robust consensus knowledge across peer\nclasses from both source and target domains. Extensive experiments on three\nbenchmark datasets demonstrate that our method achieves comparable performance\nfor our UFDA scenario with much fewer assumptions, compared to previous\nmethodologies with comprehensive additional assumptions.\n","authors":["Xinhui Liu","Zhenghao Chen","Luping Zhou","Dong Xu","Wei Xi","Gairui Bai","Yihan Zhao","Jizhong Zhao"],"pdf_url":"https://arxiv.org/pdf/2311.15570v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.04326v2","updated":"2023-12-19T06:50:21Z","published":"2023-12-07T14:37:01Z","title":"iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image\n Diffusion Model for Interior Design","summary":" With the open-sourcing of text-to-image models (T2I) such as stable diffusion\n(SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned\nin specific domains based on the open-source SD model, such as in anime,\ncharacter portraits, etc. However, there are few specialized models in certain\ndomains, such as interior design, which is attributed to the complex textual\ndescriptions and detailed visual elements inherent in design, alongside the\nnecessity for adaptable resolution. Therefore, text-to-image models for\ninterior design are required to have outstanding prompt-following capabilities,\nas well as iterative collaboration with design professionals to achieve the\ndesired outcome. In this paper, we collect and optimize text-image data in the\ndesign field and continue training in both English and Chinese on the basis of\nthe open-source CLIP model. We also proposed a fine-tuning strategy with\ncurriculum learning and reinforcement learning from CLIP feedback to enhance\nthe prompt-following capabilities of our approach so as to improve the quality\nof image generation. The experimental results on the collected dataset\ndemonstrate the effectiveness of the proposed approach, which achieves\nimpressive results and outperforms strong baselines.\n","authors":["Ruyi Gan","Xiaojun Wu","Junyu Lu","Yuanhe Tian","Dixiang Zhang","Ziwei Wu","Renliang Sun","Chang Liu","Jiaxing Zhang","Pingjian Zhang","Yan Song"],"pdf_url":"https://arxiv.org/pdf/2312.04326v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11897v1","updated":"2023-12-19T06:42:47Z","published":"2023-12-19T06:42:47Z","title":"Text-Conditioned Resampler For Long Form Video Understanding","summary":" Videos are highly redundant data source and it is often enough to identify a\nfew key moments to solve any given task. In this paper, we present a\ntext-conditioned video resampler (TCR) module that uses a pre-trained and\nfrozen visual encoder and large language model (LLM) to process long video\nsequences for a task. TCR localises relevant visual features from the video\ngiven a text condition and provides them to a LLM to generate a text response.\nDue to its lightweight design and use of cross-attention, TCR can process more\nthan 100 frames at a time allowing the model to use much longer chunks of video\nthan earlier works. We make the following contributions: (i) we design a\ntransformer-based sampling architecture that can process long videos\nconditioned on a task, together with a training method that enables it to\nbridge pre-trained visual and language models; (ii) we empirically validate its\nefficacy on a wide variety of evaluation tasks, and set a new state-of-the-art\non NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks\nwhich require longer video contexts and that can thus be used effectively for\nfurther evaluation of long-range video models.\n","authors":["Bruno Korbar","Yongqin Xian","Alessio Tonioni","Andrew Zisserman","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2312.11897v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11894v1","updated":"2023-12-19T06:38:18Z","published":"2023-12-19T06:38:18Z","title":"3D-LFM: Lifting Foundation Model","summary":" The lifting of 3D structure and camera from 2D landmarks is at the\ncornerstone of the entire discipline of computer vision. Traditional methods\nhave been confined to specific rigid objects, such as those in\nPerspective-n-Point (PnP) problems, but deep learning has expanded our\ncapability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL)\nwith resilience to noise, occlusions, and perspective distortions. All these\ntechniques, however, have been limited by the fundamental need to establish\ncorrespondences across the 3D training data -- significantly limiting their\nutility to applications where one has an abundance of \"in-correspondence\" 3D\ndata. Our approach harnesses the inherent permutation equivariance of\ntransformers to manage varying number of points per 3D data instance,\nwithstands occlusions, and generalizes to unseen categories. We demonstrate\nstate of the art performance across 2D-3D lifting task benchmarks. Since our\napproach can be trained across such a broad class of structures we refer to it\nsimply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.\n","authors":["Mosam Dabhi","Laszlo A. Jeni","Simon Lucey"],"pdf_url":"https://arxiv.org/pdf/2312.11894v1.pdf","comment":"Project page is available at https://3dlfm.github.io"},{"id":"http://arxiv.org/abs/2305.07490v4","updated":"2023-12-19T06:27:45Z","published":"2023-05-12T14:04:30Z","title":"ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models\n with Enhanced Adapter","summary":" In recent years, advancements in large language models have been remarkable,\nwith models such as ChatGPT demonstrating exceptional proficiency in diverse\nlinguistic tasks. The pre-training of large models with billions of parameters,\nposes a formidable challenge, primarily due to the scarcity of datasets of a\ncommensurate scale for effective training. Nevertheless, innovative strategies\nhave emerged, including methods to fine-tune these pre-trained models using\nfewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite\ntheir potential in various domains, these models remain limited in their\nunderstanding of artistic imagery. They have yet to fully grasp the intricate\nnuances of art images or to provide an objective articulation of the emotions\nthey evoke, in a manner akin to human perception. This work introduces\nArtGPT-4, a pioneering large vision-language model tailored to address the\ndeficiencies of contemporary models in artistic comprehension. ArtGPT-4\nunderwent training on image-text pairs utilizing a Tesla A100 device in a mere\n2 hours, with a dataset comprising approximately 0.52M entries. Impressively,\nthe model can render images with an artistic-understanding and convey the\nemotions they inspire, mirroring human interpretation. Additionally, this work\npresents a unique dataset designed to evaluate the efficacy of vision-language\nmodels. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art\nperformance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the\nestablished benchmarks introduced in This study, lagging behind professional\nartists' descriptions by a negligible 0.15 points on a 6-point scale. The code\nand the pre-trained model are accessible in\nhttps://huggingface.co/Tyrannosaurus/ArtGPT-4.\n","authors":["Zhengqing Yuan","Xinyi Wang","Kun Wang","Lichao Sun","Yanfang Ye"],"pdf_url":"https://arxiv.org/pdf/2305.07490v4.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2312.10737v2","updated":"2023-12-19T06:24:13Z","published":"2023-12-17T14:52:31Z","title":"Traffic Incident Database with Multiple Labels Including Various\n Perspective Environmental Information","summary":" A large dataset of annotated traffic accidents is necessary to improve the\naccuracy of traffic accident recognition using deep learning models.\nConventional traffic accident datasets provide annotations on traffic accidents\nand other teacher labels, improving traffic accident recognition performance.\nHowever, the labels annotated in conventional datasets need to be more\ncomprehensive to describe traffic accidents in detail. Therefore, we propose\nV-TIDB, a large-scale traffic accident recognition dataset annotated with\nvarious environmental information as multi-labels. Our proposed dataset aims to\nimprove the performance of traffic accident recognition by annotating ten types\nof environmental information as teacher labels in addition to the presence or\nabsence of traffic accidents. V-TIDB is constructed by collecting many videos\nfrom the Internet and annotating them with appropriate environmental\ninformation. In our experiments, we compare the performance of traffic accident\nrecognition when only labels related to the presence or absence of traffic\naccidents are trained and when environmental information is added as a\nmulti-label. In the second experiment, we compare the performance of the\ntraining with only contact level, which represents the severity of the traffic\naccident, and the performance with environmental information added as a\nmulti-label. The results showed that 6 out of 10 environmental information\nlabels improved the performance of recognizing the presence or absence of\ntraffic accidents. In the experiment on the degree of recognition of traffic\naccidents, the performance of recognition of car wrecks and contacts was\nimproved for all environmental information. These experiments show that V-TIDB\ncan be used to learn traffic accident recognition models that take\nenvironmental information into account in detail and can be used for\nappropriate traffic accident analysis.\n","authors":["Shota Nishiyama","Takuma Saito","Ryo Nakamura","Go Ohtani","Hirokatsu Kataoka","Kensho Hara"],"pdf_url":"https://arxiv.org/pdf/2312.10737v2.pdf","comment":"Conference paper accepted to IEEE/RSJ International Conference on\n Intelligent Robots and Systems (IROS), 2023 Reason for revision: Corrected\n due to a missing space between sentences in the preview's abstract, which led\n to an unintended URL interpretation"},{"id":"http://arxiv.org/abs/2312.11880v1","updated":"2023-12-19T06:13:58Z","published":"2023-12-19T06:13:58Z","title":"Point Cloud Segmentation Using Transfer Learning with RandLA-Net: A Case\n Study on Urban Areas","summary":" Urban environments are characterized by complex structures and diverse\nfeatures, making accurate segmentation of point cloud data a challenging task.\nThis paper presents a comprehensive study on the application of RandLA-Net, a\nstate-of-the-art neural network architecture, for the 3D segmentation of\nlarge-scale point cloud data in urban areas. The study focuses on three major\nChinese cities, namely Chengdu, Jiaoda, and Shenzhen, leveraging their unique\ncharacteristics to enhance segmentation performance.\n To address the limited availability of labeled data for these specific urban\nareas, we employed transfer learning techniques. We transferred the learned\nweights from the Sensat Urban and Toronto 3D datasets to initialize our\nRandLA-Net model. Additionally, we performed class remapping to adapt the model\nto the target urban areas, ensuring accurate segmentation results.\n The experimental results demonstrate the effectiveness of the proposed\napproach achieving over 80\\% F1 score for each areas in 3D point cloud\nsegmentation. The transfer learning strategy proves to be crucial in overcoming\ndata scarcity issues, providing a robust solution for urban point cloud\nanalysis. The findings contribute to the advancement of point cloud\nsegmentation methods, especially in the context of rapidly evolving Chinese\nurban areas.\n","authors":["Alperen Enes Bayar","Ufuk Uyan","Elif Toprak","Cao Yuheng","Tang Juncheng","Ahmet Alp Kindiroglu"],"pdf_url":"https://arxiv.org/pdf/2312.11880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.14197v4","updated":"2023-12-19T05:54:13Z","published":"2022-12-29T07:03:29Z","title":"PointVST: Self-Supervised Pre-training for 3D Point Clouds via\n View-Specific Point-to-Image Translation","summary":" The past few years have witnessed the great success and prevalence of\nself-supervised representation learning within the language and 2D vision\ncommunities. However, such advancements have not been fully migrated to the\nfield of 3D point cloud learning. Different from existing pre-training\nparadigms designed for deep point cloud feature extractors that fall into the\nscope of generative modeling or contrastive learning, this paper proposes a\ntranslative pre-training framework, namely PointVST, driven by a novel\nself-supervised pretext task of cross-modal translation from 3D point clouds to\ntheir corresponding diverse forms of 2D rendered images. More specifically, we\nbegin with deducing view-conditioned point-wise embeddings through the\ninsertion of the viewpoint indicator, and then adaptively aggregate a\nview-specific global codeword, which can be further fed into subsequent 2D\nconvolutional translation heads for image generation. Extensive experimental\nevaluations on various downstream task scenarios demonstrate that our PointVST\nshows consistent and prominent performance superiority over current\nstate-of-the-art approaches as well as satisfactory domain transfer capability.\nOur code will be publicly available at https://github.com/keeganhk/PointVST.\n","authors":["Qijian Zhang","Junhui Hou"],"pdf_url":"https://arxiv.org/pdf/2212.14197v4.pdf","comment":"Accepted in IEEE TVCG"},{"id":"http://arxiv.org/abs/2312.11872v1","updated":"2023-12-19T05:52:38Z","published":"2023-12-19T05:52:38Z","title":"Beyond Prototypes: Semantic Anchor Regularization for Better\n Representation Learning","summary":" One of the ultimate goals of representation learning is to achieve\ncompactness within a class and well-separability between classes. Many\noutstanding metric-based and prototype-based methods following the\nExpectation-Maximization paradigm, have been proposed for this objective.\nHowever, they inevitably introduce biases into the learning process,\nparticularly with long-tail distributed training data. In this paper, we reveal\nthat the class prototype is not necessarily to be derived from training\nfeatures and propose a novel perspective to use pre-defined class anchors\nserving as feature centroid to unidirectionally guide feature learning.\nHowever, the pre-defined anchors may have a large semantic distance from the\npixel features, which prevents them from being directly applied. To address\nthis issue and generate feature centroid independent from feature learning, a\nsimple yet effective Semantic Anchor Regularization (SAR) is proposed. SAR\nensures the interclass separability of semantic anchors in the semantic space\nby employing a classifier-aware auxiliary cross-entropy loss during training\nvia disentanglement learning. By pulling the learned features to these semantic\nanchors, several advantages can be attained: 1) the intra-class compactness and\nnaturally inter-class separability, 2) induced bias or errors from feature\nlearning can be avoided, and 3) robustness to the long-tailed problem. The\nproposed SAR can be used in a plug-and-play manner in the existing models.\nExtensive experiments demonstrate that the SAR performs better than previous\nsophisticated prototype-based methods. The implementation is available at\nhttps://github.com/geyanqi/SAR.\n","authors":["Yanqi Ge","Qiang Nie","Ye Huang","Yong Liu","Chengjie Wang","Feng Zheng","Wen Li","Lixin Duan"],"pdf_url":"https://arxiv.org/pdf/2312.11872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11681v3","updated":"2023-12-19T05:51:18Z","published":"2023-03-21T08:43:15Z","title":"DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic\n Segmentation Using Diffusion Models","summary":" Collecting and annotating images with pixel-wise labels is time-consuming and\nlaborious. In contrast, synthetic data can be freely available using a\ngenerative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that\nit is possible to automatically obtain accurate semantic masks of synthetic\nimages generated by the Off-the-shelf Stable Diffusion model, which uses only\ntext-image pairs during training. Our approach, called DiffuMask, exploits the\npotential of the cross-attention map between text and image, which is natural\nand seamless to extend the text-driven image synthesis to semantic mask\ngeneration. DiffuMask uses text-guided cross-attention information to localize\nclass/word-specific regions, which are combined with practical techniques to\ncreate a novel high-resolution and class-discriminative pixel-wise mask. The\nmethods help to reduce data collection and annotation costs obviously.\nExperiments demonstrate that the existing segmentation methods trained on\nsynthetic data of DiffuMask can achieve a competitive performance over the\ncounterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird),\nDiffuMask presents promising performance, close to the stateof-the-art result\nof real data (within 3% mIoU gap). Moreover, in the open-vocabulary\nsegmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on\nUnseen class of VOC 2012. The project website can be found at\nhttps://weijiawu.github.io/DiffusionMask/.\n","authors":["Weijia Wu","Yuzhong Zhao","Mike Zheng Shou","Hong Zhou","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2303.11681v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.10985v2","updated":"2023-12-19T05:51:09Z","published":"2021-12-21T05:07:54Z","title":"Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding","summary":" Drawing on theoretical insights, we advocate an error-based thresholding\n(EBT) mechanism for learned ISTA (LISTA), which utilizes a function of the\nlayer-wise reconstruction error to suggest a specific threshold for each\nobservation in the shrinkage function of each layer. We show that the proposed\nEBT mechanism well disentangles the learnable parameters in the shrinkage\nfunctions from the reconstruction errors, endowing the obtained models with\nimproved adaptivity to possible data variations. With rigorous analyses, we\nfurther show that the proposed EBT also leads to a faster convergence on the\nbasis of LISTA or its variants, in addition to its higher adaptivity. Extensive\nexperimental results confirm our theoretical analyses and verify the\neffectiveness of our methods.\n","authors":["Ziang Li","Kailun Wu","Yiwen Guo","Changshui Zhang"],"pdf_url":"https://arxiv.org/pdf/2112.10985v2.pdf","comment":"Accepted in ICASSP2024"},{"id":"http://arxiv.org/abs/2312.11867v1","updated":"2023-12-19T05:38:14Z","published":"2023-12-19T05:38:14Z","title":"Point Cloud Part Editing: Segmentation, Generation, Assembly, and\n Selection","summary":" Ideal part editing should guarantee the diversity of edited parts, the\nfidelity to the remaining parts, and the quality of the results. However,\nprevious methods do not disentangle each part completely, which means the\nedited parts will affect the others, resulting in poor diversity and fidelity.\nIn addition, some methods lack constraints between parts, which need manual\nselections of edited results to ensure quality. Therefore, we propose a\nfour-stage process for point cloud part editing: Segmentation, Generation,\nAssembly, and Selection. Based on this process, we introduce SGAS, a model for\npart editing that employs two strategies: feature disentanglement and\nconstraint. By independently fitting part-level feature distributions, we\nrealize the feature disentanglement. By explicitly modeling the transformation\nfrom object-level distribution to part-level distributions, we realize the\nfeature constraint. Considerable experiments on different datasets demonstrate\nthe efficiency and effectiveness of SGAS on point cloud part editing. In\naddition, SGAS can be pruned to realize unsupervised part-aware point cloud\ngeneration and achieves state-of-the-art results.\n","authors":["Kaiyi Zhang","Yang Chen","Ximing Yang","Weizhong Zhang","Cheng Jin"],"pdf_url":"https://arxiv.org/pdf/2312.11867v1.pdf","comment":"9 pages, 7 figures, AAAI 2024"},{"id":"http://arxiv.org/abs/2303.14628v2","updated":"2023-12-19T05:28:13Z","published":"2023-03-26T05:26:30Z","title":"Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature\n Fusion in Dynamic Scenes","summary":" Multi-frame methods improve monocular depth estimation over single-frame\napproaches by aggregating spatial-temporal information via feature matching.\nHowever, the spatial-temporal feature leads to accuracy degradation in dynamic\nscenes. To enhance the performance, recent methods tend to propose complex\narchitectures for feature matching and dynamic scenes. In this paper, we show\nthat a simple learning framework, together with designed feature augmentation,\nleads to superior performance. (1) A novel dynamic objects detecting method\nwith geometry explainability is proposed. The detected dynamic objects are\nexcluded during training, which guarantees the static environment assumption\nand relieves the accuracy degradation problem of the multi-frame depth\nestimation. (2) Multi-scale feature fusion is proposed for feature matching in\nthe multi-frame depth network, which improves feature matching, especially\nbetween frames with large camera motion. (3) The robust knowledge distillation\nwith a robust teacher network and reliability guarantee is proposed, which\nimproves the multi-frame depth estimation without computation complexity\nincrease during the test. The experiments show that our proposed methods\nachieve great performance improvement on the multi-frame depth estimation.\n","authors":["Jiquan Zhong","Xiaolin Huang","Xiao Yu"],"pdf_url":"https://arxiv.org/pdf/2303.14628v2.pdf","comment":"11 pages, 8 figures, ACM MM'23 accepted"},{"id":"http://arxiv.org/abs/2312.11862v1","updated":"2023-12-19T05:14:31Z","published":"2023-12-19T05:14:31Z","title":"Topo-MLP : A Simplicial Network Without Message Passing","summary":" Due to their ability to model meaningful higher order relations among a set\nof entities, higher order network models have emerged recently as a powerful\nalternative for graph-based network models which are only capable of modeling\nbinary relationships. Message passing paradigm is still dominantly used to\nlearn representations even for higher order network models. While powerful,\nmessage passing can have disadvantages during inference, particularly when the\nhigher order connectivity information is missing or corrupted. To overcome such\nlimitations, we propose Topo-MLP, a purely MLP-based simplicial neural network\nalgorithm to learn the representation of elements in a simplicial complex\nwithout explicitly relying on message passing. Our framework utilizes a novel\nHigher Order Neighborhood Contrastive (HONC) loss which implicitly incorporates\nthe simplicial structure into representation learning. Our proposed model's\nsimplicity makes it faster during inference. Moreover, we show that our model\nis robust when faced with missing or corrupted connectivity structure.\n","authors":["Karthikeyan Natesan Ramamurthy","Aldo Guzmán-Sáenz","Mustafa Hajij"],"pdf_url":"https://arxiv.org/pdf/2312.11862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11376v2","updated":"2023-12-19T05:08:45Z","published":"2023-12-18T17:39:47Z","title":"CLIM: Contrastive Language-Image Mosaic for Region Representation","summary":" Detecting objects accurately from a large or open vocabulary necessitates the\nvision-language alignment on region representations. However, learning such a\nregion-text alignment by obtaining high-quality box annotations with text\nlabels or descriptions is expensive and infeasible. In contrast, collecting\nimage-text pairs is simpler but lacks precise object location information to\nassociate regions with texts. In this paper, we propose a novel approach called\nContrastive Language-Image Mosaic (CLIM), which leverages large-scale\nimage-text pairs effectively for aligning region and text representations. CLIM\ncombines multiple images into a mosaicked image and treats each image as a\n`pseudo region'. The feature of each pseudo region is extracted and trained to\nbe similar to the corresponding text embedding while dissimilar from others by\na contrastive loss, enabling the model to learn the region-text alignment\nwithout costly box annotations. As a generally applicable approach, CLIM\nconsistently improves different open-vocabulary object detection methods that\nuse caption supervision. Furthermore, CLIM can effectively enhance the region\nrepresentation of vision-language models, thus providing stronger backbones for\nopen-vocabulary object detectors. Our experimental results demonstrate that\nCLIM improves different baseline open-vocabulary object detectors by a large\nmargin on both OV-COCO and OV-LVIS benchmarks. The code is available at\nhttps://github.com/wusize/CLIM.\n","authors":["Size Wu","Wenwei Zhang","Lumin Xu","Sheng Jin","Wentao Liu","Chen Change Loy"],"pdf_url":"https://arxiv.org/pdf/2312.11376v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11856v1","updated":"2023-12-19T04:55:33Z","published":"2023-12-19T04:55:33Z","title":"Self-supervised Learning for Enhancing Geometrical Modeling in 3D-Aware\n Generative Adversarial Network","summary":" 3D-aware Generative Adversarial Networks (3D-GANs) currently exhibit\nartifacts in their 3D geometrical modeling, such as mesh imperfections and\nholes. These shortcomings are primarily attributed to the limited availability\nof annotated 3D data, leading to a constrained \"valid latent area\" for\nsatisfactory modeling. To address this, we present a Self-Supervised Learning\n(SSL) technique tailored as an auxiliary loss for any 3D-GAN, designed to\nimprove its 3D geometrical modeling capabilities. Our approach pioneers an\ninversion technique for 3D-GANs, integrating an encoder that performs adaptive\nspatially-varying range operations. Utilizing this inversion, we introduce the\nCyclic Generative Constraint (CGC), aiming to densify the valid latent space.\nThe CGC operates via augmented local latent vectors that maintain the same\ngeometric form, and it imposes constraints on the cycle path outputs,\nspecifically the generator-encoder-generator sequence. This SSL methodology\nseamlessly integrates with the inherent GAN loss, ensuring the integrity of\npre-existing 3D-GAN architectures without necessitating alterations. We\nvalidate our approach with comprehensive experiments across various datasets\nand architectures, underscoring its efficacy. Our project website:\nhttps://3dgan-ssl.github.io\n","authors":["Jiarong Guo","Xiaogang Xu","Hengshuang Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.11856v1.pdf","comment":"13 pages, 12 figures, 6 tables"},{"id":"http://arxiv.org/abs/2305.10701v2","updated":"2023-12-19T04:41:50Z","published":"2023-05-18T04:28:47Z","title":"Personalization as a Shortcut for Few-Shot Backdoor Attack against\n Text-to-Image Diffusion Models","summary":" Although recent personalization methods have democratized high-resolution\nimage synthesis by enabling swift concept acquisition with minimal examples and\nlightweight computation, they also present an exploitable avenue for high\naccessible backdoor attacks. This paper investigates a critical and unexplored\naspect of text-to-image (T2I) diffusion models - their potential vulnerability\nto backdoor attacks via personalization. Our study focuses on a zero-day\nbackdoor vulnerability prevalent in two families of personalization methods,\nepitomized by Textual Inversion and DreamBooth.Compared to traditional backdoor\nattacks, our proposed method can facilitate more precise, efficient, and easily\naccessible attacks with a lower barrier to entry. We provide a comprehensive\nreview of personalization in T2I diffusion models, highlighting the operation\nand exploitation potential of this backdoor vulnerability. To be specific, by\nstudying the prompt processing of Textual Inversion and DreamBooth, we have\ndevised dedicated backdoor attacks according to the different ways of dealing\nwith unseen tokens and analyzed the influence of triggers and concept images on\nthe attack effect. Through comprehensive empirical study, we endorse the\nutilization of the nouveau-token backdoor attack due to its impressive\neffectiveness, stealthiness, and integrity, markedly outperforming the\nlegacy-token backdoor attack.\n","authors":["Yihao Huang","Felix Juefei-Xu","Qing Guo","Jie Zhang","Yutong Wu","Ming Hu","Tianlin Li","Geguang Pu","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2305.10701v2.pdf","comment":"10 pages, accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11850v1","updated":"2023-12-19T04:35:24Z","published":"2023-12-19T04:35:24Z","title":"GCNext: Towards the Unity of Graph Convolutions for Human Motion\n Prediction","summary":" The past few years has witnessed the dominance of Graph Convolutional\nNetworks (GCNs) over human motion prediction.Various styles of graph\nconvolutions have been proposed, with each one meticulously designed and\nincorporated into a carefully-crafted network architecture. This paper breaks\nthe limits of existing knowledge by proposing Universal Graph Convolution\n(UniGC), a novel graph convolution concept that re-conceptualizes different\ngraph convolutions as its special cases. Leveraging UniGC on network-level, we\npropose GCNext, a novel GCN-building paradigm that dynamically determines the\nbest-fitting graph convolutions both sample-wise and layer-wise. GCNext offers\nmultiple use cases, including training a new GCN from scratch or refining a\npreexisting GCN. Experiments on Human3.6M, AMASS, and 3DPW datasets show that,\nby incorporating unique module-to-network designs, GCNext yields up to 9x lower\ncomputational cost than existing GCN methods, on top of achieving\nstate-of-the-art performance.\n","authors":["Xinshun Wang","Qiongjie Cui","Chen Chen","Mengyuan Liu"],"pdf_url":"https://arxiv.org/pdf/2312.11850v1.pdf","comment":"to be published in the 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI-24)"},{"id":"http://arxiv.org/abs/2312.11849v1","updated":"2023-12-19T04:34:15Z","published":"2023-12-19T04:34:15Z","title":"Active contours driven by local and global intensity fitting energy with\n application to SAR image segmentation and its fast solvers","summary":" In this paper, we propose a novel variational active contour model based on\nAubert-Aujol (AA) denoising model, which hybrides geodesic active contour (GAC)\nmodel with active contours without edges (ACWE) model and can be used to\nsegment images corrupted by multiplicative gamma noise. We transform the\nproposed model into classic ROF model by adding a proximity term. Inspired by a\nfast denosing algorithm proposed by Jia-Zhao recently, we propose two fast\nfixed point algorithms to solve SAR image segmentation question. Experimental\nresults for real SAR images show that the proposed image segmentation model can\nefficiently stop the contours at weak or blurred edges, and can automatically\ndetect the exterior and interior boundaries of images with multiplicative gamma\nnoise. The proposed fast fixed point algorithms are robustness to\ninitialization contour, and can further reduce about 15% of the time needed for\nalgorithm proposed by Goldstein-Osher.\n","authors":["Guangming Liu","Qi Liu","Jing Liang","Quanying Sun"],"pdf_url":"https://arxiv.org/pdf/2312.11849v1.pdf","comment":"20 pages,28 figures. arXiv admin note: substantial text overlap with\n arXiv:2312.08376, arXiv:2312.09365"},{"id":"http://arxiv.org/abs/2310.15646v4","updated":"2023-12-19T04:22:44Z","published":"2023-10-24T09:07:47Z","title":"Mean Teacher DETR with Masked Feature Alignment: A Robust Domain\n Adaptive Detection Transformer Framework","summary":" Unsupervised domain adaptation object detection (UDAOD) research on Detection\nTransformer(DETR) mainly focuses on feature alignment and existing methods can\nbe divided into two kinds, each of which has its unresolved issues. One-stage\nfeature alignment methods can easily lead to performance fluctuation and\ntraining stagnation. Two-stage feature alignment method based on mean teacher\ncomprises a pretraining stage followed by a self-training stage, each facing\nproblems in obtaining reliable pretrained model and achieving consistent\nperformance gains. Methods mentioned above have not yet explore how to utilize\nthe third related domain such as target-like domain to assist adaptation. To\naddress these issues, we propose a two-stage framework named MTM, i.e. Mean\nTeacher-DETR with Masked Feature Alignment. In the pretraining stage, we\nutilize labeled target-like images produced by image style transfer to avoid\nperformance fluctuation. In the self-training stage, we leverage unlabeled\ntarget images by pseudo labels based on mean teacher and propose a module\ncalled Object Queries Knowledge Transfer (OQKT) to ensure consistent\nperformance gains of the student model. Most importantly, we propose masked\nfeature alignment methods including Masked Domain Query-based Feature Alignment\n(MDQFA) and Masked Token-wise Feature Alignment (MTWFA) to alleviate domain\nshift in a more robust way, which not only prevent training stagnation and lead\nto a robust pretrained model in the pretraining stage, but also enhance the\nmodel's target performance in the self-training stage. Experiments on three\nchallenging scenarios and a theoretical analysis verify the effectiveness of\nMTM.\n","authors":["Weixi Weng","Chun Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.15646v4.pdf","comment":"AAAI2024"},{"id":"http://arxiv.org/abs/2312.11841v1","updated":"2023-12-19T04:14:11Z","published":"2023-12-19T04:14:11Z","title":"MixRT: Mixed Neural Representations For Real-Time NeRF Rendering","summary":" Neural Radiance Field (NeRF) has emerged as a leading technique for novel\nview synthesis, owing to its impressive photorealistic reconstruction and\nrendering capability. Nevertheless, achieving real-time NeRF rendering in\nlarge-scale scenes has presented challenges, often leading to the adoption of\neither intricate baked mesh representations with a substantial number of\ntriangles or resource-intensive ray marching in baked representations. We\nchallenge these conventions, observing that high-quality geometry, represented\nby meshes with substantial triangles, is not necessary for achieving\nphotorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF\nrepresentation that includes a low-quality mesh, a view-dependent displacement\nmap, and a compressed NeRF model. This design effectively harnesses the\ncapabilities of existing graphics hardware, thus enabling real-time NeRF\nrendering on edge devices. Leveraging a highly-optimized WebGL-based rendering\nframework, our proposed MixRT attains real-time rendering speeds on edge\ndevices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop),\nbetter rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360\ndatasets), and a smaller storage size (less than 80% compared to\nstate-of-the-art methods).\n","authors":["Chaojian Li","Bichen Wu","Peter Vajda"," Yingyan"," Lin"],"pdf_url":"https://arxiv.org/pdf/2312.11841v1.pdf","comment":"Accepted by 3DV'24. Project Page: https://licj15.github.io/MixRT/"},{"id":"http://arxiv.org/abs/2312.11837v1","updated":"2023-12-19T04:09:05Z","published":"2023-12-19T04:09:05Z","title":"Regulating Intermediate 3D Features for Vision-Centric Autonomous\n Driving","summary":" Multi-camera perception tasks have gained significant attention in the field\nof autonomous driving. However, existing frameworks based on Lift-Splat-Shoot\n(LSS) in the multi-camera setting cannot produce suitable dense 3D features due\nto the projection nature and uncontrollable densification process. To resolve\nthis problem, we propose to regulate intermediate dense 3D features with the\nhelp of volume rendering. Specifically, we employ volume rendering to process\nthe dense 3D features to obtain corresponding 2D features (e.g., depth maps,\nsemantic maps), which are supervised by associated labels in the training. This\nmanner regulates the generation of dense 3D features on the feature level,\nproviding appropriate dense and unified features for multiple perception tasks.\nTherefore, our approach is termed Vampire, stands for \"Volume rendering As\nMulti-camera Perception Intermediate feature REgulator\". Experimental results\non the Occ3D and nuScenes datasets demonstrate that Vampire facilitates\nfine-grained and appropriate extraction of dense 3D features, and is\ncompetitive with existing SOTA methods across diverse downstream perception\ntasks like 3D occupancy prediction, LiDAR segmentation and 3D objection\ndetection, while utilizing moderate GPU resources. We provide a video\ndemonstration in the supplementary materials and Codes are available at\ngithub.com/cskkxjk/Vampire.\n","authors":["Junkai Xu","Liang Peng","Haoran Cheng","Linxuan Xia","Qi Zhou","Dan Deng","Wei Qian","Wenxiao Wang","Deng Cai"],"pdf_url":"https://arxiv.org/pdf/2312.11837v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.00634v2","updated":"2023-12-19T03:49:48Z","published":"2023-12-01T14:54:44Z","title":"A Recent Survey of Vision Transformers for Medical Image Segmentation","summary":" Medical image segmentation plays a crucial role in various healthcare\napplications, enabling accurate diagnosis, treatment planning, and disease\nmonitoring. Traditionally, convolutional neural networks (CNNs) dominated this\ndomain, excelling at local feature extraction. However, their limitations in\ncapturing long-range dependencies across image regions pose challenges for\nsegmenting complex, interconnected structures often encountered in medical\ndata. In recent years, Vision Transformers (ViTs) have emerged as a promising\ntechnique for addressing the challenges in medical image segmentation. Their\nmulti-scale attention mechanism enables effective modeling of long-range\ndependencies between distant structures, crucial for segmenting organs or\nlesions spanning the image. Additionally, ViTs' ability to discern subtle\npattern heterogeneity allows for the precise delineation of intricate\nboundaries and edges, a critical aspect of accurate medical image segmentation.\nHowever, they do lack image-related inductive bias and translational\ninvariance, potentially impacting their performance. Recently, researchers have\ncome up with various ViT-based approaches that incorporate CNNs in their\narchitectures, known as Hybrid Vision Transformers (HVTs) to capture local\ncorrelation in addition to the global information in the images. This survey\npaper provides a detailed review of the recent advancements in ViTs and HVTs\nfor medical image segmentation. Along with the categorization of ViT and\nHVT-based medical image segmentation approaches, we also present a detailed\noverview of their real-time applications in several medical image modalities.\nThis survey may serve as a valuable resource for researchers, healthcare\npractitioners, and students in understanding the state-of-the-art approaches\nfor ViT-based medical image segmentation.\n","authors":["Asifullah Khan","Zunaira Rauf","Abdul Rehman Khan","Saima Rathore","Saddam Hussain Khan","Najmus Saher Shah","Umair Farooq","Hifsa Asif","Aqsa Asif","Umme Zahoora","Rafi Ullah Khalil","Suleman Qamar","Umme Hani Asif","Faiza Babar Khan","Abdul Majid","Jeonghwan Gwak"],"pdf_url":"https://arxiv.org/pdf/2312.00634v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07594v2","updated":"2023-12-19T03:44:25Z","published":"2023-11-10T09:51:24Z","title":"How to Bridge the Gap between Modalities: A Comprehensive Survey on\n Multimodal Large Language Model","summary":" This review paper explores Multimodal Large Language Models (MLLMs), which\nintegrate Large Language Models (LLMs) like GPT-4 to handle multimodal data\nsuch as text and vision. MLLMs demonstrate capabilities like generating image\nnarratives and answering image-based questions, bridging the gap towards\nreal-world human-computer interactions and hinting at a potential pathway to\nartificial general intelligence. However, MLLMs still face challenges in\nprocessing the semantic gap in multimodality, which may lead to erroneous\ngeneration, posing potential risks to society. Choosing the appropriate\nmodality alignment method is crucial, as improper methods might require more\nparameters with limited performance improvement. This paper aims to explore\nmodality alignment methods for LLMs and their existing capabilities.\nImplementing modality alignment allows LLMs to address environmental issues and\nenhance accessibility. The study surveys existing modal alignment methods in\nMLLMs into four groups: (1) Multimodal Converters that change data into\nsomething LLMs can understand; (2) Multimodal Perceivers to improve how LLMs\nperceive different types of data; (3) Tools Assistance for changing data into\none common format, usually text; and (4) Data-Driven methods that teach LLMs to\nunderstand specific types of data in a dataset. This field is still in a phase\nof exploration and experimentation, and we will organize and update various\nexisting research methods for multimodal information alignment.\n","authors":["Shezheng Song","Xiaopeng Li","Shasha Li","Shan Zhao","Jie Yu","Jun Ma","Xiaoguang Mao","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07594v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11829v1","updated":"2023-12-19T03:39:56Z","published":"2023-12-19T03:39:56Z","title":"RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering\n Assisted Distillation","summary":" 3D occupancy prediction is an emerging task that aims to estimate the\noccupancy states and semantics of 3D scenes using multi-view images. However,\nimage-based scene perception encounters significant challenges in achieving\naccurate prediction due to the absence of geometric priors. In this paper, we\naddress this issue by exploring cross-modal knowledge distillation in this\ntask, i.e., we leverage a stronger multi-modal model to guide the visual model\nduring training. In practice, we observe that directly applying features or\nlogits alignment, proposed and widely used in bird's-eyeview (BEV) perception,\ndoes not yield satisfactory results. To overcome this problem, we introduce\nRadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.\nBy employing differentiable volume rendering, we generate depth and semantic\nmaps in perspective views and propose two novel consistency criteria between\nthe rendered outputs of teacher and student models. Specifically, the depth\nconsistency loss aligns the termination distributions of the rendered rays,\nwhile the semantic consistency loss mimics the intra-segment similarity guided\nby vision foundation models (VLMs). Experimental results on the nuScenes\ndataset demonstrate the effectiveness of our proposed method in improving\nvarious 3D occupancy prediction approaches, e.g., our proposed methodology\nenhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D\nbenchmark.\n","authors":["Haiming Zhang","Xu Yan","Dongfeng Bai","Jiantao Gao","Pan Wang","Bingbing Liu","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2312.11829v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2310.04780v5","updated":"2023-12-19T03:39:00Z","published":"2023-10-07T11:45:33Z","title":"IPMix: Label-Preserving Data Augmentation Method for Training Robust\n Classifiers","summary":" Data augmentation has been proven effective for training high-accuracy\nconvolutional neural network classifiers by preventing overfitting. However,\nbuilding deep neural networks in real-world scenarios requires not only high\naccuracy on clean data but also robustness when data distributions shift. While\nprior methods have proposed that there is a trade-off between accuracy and\nrobustness, we propose IPMix, a simple data augmentation approach to improve\nrobustness without hurting clean accuracy. IPMix integrates three levels of\ndata augmentation (image-level, patch-level, and pixel-level) into a coherent\nand label-preserving technique to increase the diversity of training data with\nlimited computational overhead. To further improve the robustness, IPMix\nintroduces structural complexity at different levels to generate more diverse\nimages and adopts the random mixing method for multi-scale information fusion.\nExperiments demonstrate that IPMix outperforms state-of-the-art corruption\nrobustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also\nsignificantly improves the other safety measures, including robustness to\nadversarial perturbations, calibration, prediction consistency, and anomaly\ndetection, achieving state-of-the-art or comparable results on several\nbenchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.\n","authors":["Zhenglin Huang","Xianan Bao","Na Zhang","Qingqi Zhang","Xiaomei Tu","Biao Wu","Xi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.04780v5.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.11826v1","updated":"2023-12-19T03:32:10Z","published":"2023-12-19T03:32:10Z","title":"Decoupled Textual Embeddings for Customized Image Generation","summary":" Customized text-to-image generation, which aims to learn user-specified\nconcepts with a few images, has drawn significant attention recently. However,\nexisting methods usually suffer from overfitting issues and entangle the\nsubject-unrelated information (e.g., background and pose) with the learned\nconcept, limiting the potential to compose concept into new scenes. To address\nthese issues, we propose the DETEX, a novel approach that learns the\ndisentangled concept embedding for flexible customized text-to-image\ngeneration. Unlike conventional methods that learn a single concept embedding\nfrom the given images, our DETEX represents each image using multiple word\nembeddings during training, i.e., a learnable image-shared subject embedding\nand several image-specific subject-unrelated embeddings. To decouple irrelevant\nattributes (i.e., background and pose) from the subject embedding, we further\npresent several attribute mappers that encode each image as several\nimage-specific subject-unrelated embeddings. To encourage these unrelated\nembeddings to capture the irrelevant information, we incorporate them with\ncorresponding attribute words and propose a joint training strategy to\nfacilitate the disentanglement. During inference, we only use the subject\nembedding for image generation, while selectively using image-specific\nembeddings to retain image-specified attributes. Extensive experiments\ndemonstrate that the subject embedding obtained by our method can faithfully\nrepresent the target concept, while showing superior editability compared to\nthe state-of-the-art methods. Our code will be made published available.\n","authors":["Yufei Cai","Yuxiang Wei","Zhilong Ji","Jinfeng Bai","Hu Han","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2312.11826v1.pdf","comment":"16 pages, 16 figures"},{"id":"http://arxiv.org/abs/2312.11816v1","updated":"2023-12-19T03:15:50Z","published":"2023-12-19T03:15:50Z","title":"A Dual-way Enhanced Framework from Text Matching Point of View for\n Multimodal Entity Linking","summary":" Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with\nmultimodal information to entity in Knowledge Graph (KG) such as Wikipedia,\nwhich plays a key role in many applications. However, existing methods suffer\nfrom shortcomings, including modality impurity such as noise in raw image and\nambiguous textual entity representation, which puts obstacles to MEL. We\nformulate multimodal entity linking as a neural text matching problem where\neach multimodal information (text and image) is treated as a query, and the\nmodel learns the mapping from each query to the relevant entity from candidate\nentities. This paper introduces a dual-way enhanced (DWE) framework for MEL:\n(1) our model refines queries with multimodal data and addresses semantic gaps\nusing cross-modal enhancers between text and image information. Besides, DWE\ninnovatively leverages fine-grained image attributes, including facial\ncharacteristic and scene feature, to enhance and refine visual features. (2)By\nusing Wikipedia descriptions, DWE enriches entity semantics and obtains more\ncomprehensive textual representation, which reduces between textual\nrepresentation and the entities in KG. Extensive experiments on three public\nbenchmarks demonstrate that our method achieves state-of-the-art (SOTA)\nperformance, indicating the superiority of our model. The code is released on\nhttps://github.com/season1blue/DWE\n","authors":["Shezheng Song","Shan Zhao","Chengyu Wang","Tianwei Yan","Shasha Li","Xiaoguang Mao","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11816v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10422v2","updated":"2023-12-19T03:12:59Z","published":"2023-12-16T11:31:34Z","title":"Learning Dense Correspondence for NeRF-Based Face Reenactment","summary":" Face reenactment is challenging due to the need to establish dense\ncorrespondence between various face representations for motion transfer. Recent\nstudies have utilized Neural Radiance Field (NeRF) as fundamental\nrepresentation, which further enhanced the performance of multi-view face\nreenactment in photo-realism and 3D consistency. However, establishing dense\ncorrespondence between different face NeRFs is non-trivial, because implicit\nrepresentations lack ground-truth correspondence annotations like mesh-based 3D\nparametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning\n3DMM space with NeRF-based face representations can realize motion control, it\nis sub-optimal for their limited face-only modeling and low identity fidelity.\nTherefore, we are inspired to ask: Can we learn the dense correspondence\nbetween different NeRF-based face representations without a 3D parametric model\nprior? To address this challenge, we propose a novel framework, which adopts\ntri-planes as fundamental NeRF representation and decomposes face tri-planes\ninto three components: canonical tri-planes, identity deformations, and motion.\nIn terms of motion control, our key contribution is proposing a Plane\nDictionary (PlaneDict) module, which efficiently maps the motion conditions to\na linear weighted addition of learnable orthogonal plane bases. To the best of\nour knowledge, our framework is the first method that achieves one-shot\nmulti-view face reenactment without a 3D parametric model prior. Extensive\nexperiments demonstrate that we produce better results in fine-grained motion\ncontrol and identity preservation than previous methods.\n","authors":["Songlin Yang","Wei Wang","Yushi Lan","Xiangyu Fan","Bo Peng","Lei Yang","Jing Dong"],"pdf_url":"https://arxiv.org/pdf/2312.10422v2.pdf","comment":"Accepted by Proceedings of the AAAI Conference on Artificial\n Intelligence, 2024"},{"id":"http://arxiv.org/abs/2312.11812v1","updated":"2023-12-19T03:01:31Z","published":"2023-12-19T03:01:31Z","title":"Advancements and Challenges in Arabic Optical Character Recognition: A\n Comprehensive Survey","summary":" Optical character recognition (OCR) is a vital process that involves the\nextraction of handwritten or printed text from scanned or printed images,\nconverting it into a format that can be understood and processed by machines.\nThis enables further data processing activities such as searching and editing.\nThe automatic extraction of text through OCR plays a crucial role in digitizing\ndocuments, enhancing productivity, improving accessibility, and preserving\nhistorical records. This paper seeks to offer an exhaustive review of\ncontemporary applications, methodologies, and challenges associated with Arabic\nOptical Character Recognition (OCR). A thorough analysis is conducted on\nprevailing techniques utilized throughout the OCR process, with a dedicated\neffort to discern the most efficacious approaches that demonstrate enhanced\noutcomes. To ensure a thorough evaluation, a meticulous keyword-search\nmethodology is adopted, encompassing a comprehensive analysis of articles\nrelevant to Arabic OCR, including both backward and forward citation reviews.\nIn addition to presenting cutting-edge techniques and methods, this paper\ncritically identifies research gaps within the realm of Arabic OCR. By\nhighlighting these gaps, we shed light on potential areas for future\nexploration and development, thereby guiding researchers toward promising\navenues in the field of Arabic OCR. The outcomes of this study provide valuable\ninsights for researchers, practitioners, and stakeholders involved in Arabic\nOCR, ultimately fostering advancements in the field and facilitating the\ncreation of more accurate and efficient OCR systems for the Arabic language.\n","authors":["Mahmoud SalahEldin Kasem","Mohamed Mahmoud","Hyun-Soo Kang"],"pdf_url":"https://arxiv.org/pdf/2312.11812v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07021v2","updated":"2023-12-19T02:46:50Z","published":"2023-12-12T07:15:17Z","title":"Transferring Modality-Aware Pedestrian Attentive Learning for\n Visible-Infrared Person Re-identification","summary":" Visible-infrared person re-identification (VI-ReID) aims to search the same\npedestrian of interest across visible and infrared modalities. Existing models\nmainly focus on compensating for modality-specific information to reduce\nmodality variation. However, these methods often lead to a higher computational\noverhead and may introduce interfering information when generating the\ncorresponding images or features. To address this issue, it is critical to\nleverage pedestrian-attentive features and learn modality-complete and\n-consistent representation. In this paper, a novel Transferring Modality-Aware\nPedestrian Attentive Learning (TMPA) model is proposed, focusing on the\npedestrian regions to efficiently compensate for missing modality-specific\nfeatures. Specifically, we propose a region-based data augmentation module\nPedMix to enhance pedestrian region coherence by mixing the corresponding\nregions from different modalities. A lightweight hybrid compensation module,\ni.e., the Modality Feature Transfer (MFT), is devised to integrate cross\nattention and convolution networks to fully explore the discriminative\nmodality-complete features with minimal computational overhead. Extensive\nexperiments conducted on the benchmark SYSU-MM01 and RegDB datasets\ndemonstrated the effectiveness of our proposed TMPA model.\n","authors":["Yuwei Guo","Wenhao Zhang","Licheng Jiao","Shuang Wang","Shuo Wang","Fang Liu"],"pdf_url":"https://arxiv.org/pdf/2312.07021v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11805v1","updated":"2023-12-19T02:39:27Z","published":"2023-12-19T02:39:27Z","title":"Gemini: A Family of Highly Capable Multimodal Models","summary":" This report introduces a new family of multimodal models, Gemini, that\nexhibit remarkable capabilities across image, audio, video, and text\nunderstanding. The Gemini family consists of Ultra, Pro, and Nano sizes,\nsuitable for applications ranging from complex reasoning tasks to on-device\nmemory-constrained use-cases. Evaluation on a broad range of benchmarks shows\nthat our most-capable Gemini Ultra model advances the state of the art in 30 of\n32 of these benchmarks - notably being the first model to achieve human-expert\nperformance on the well-studied exam benchmark MMLU, and improving the state of\nthe art in every one of the 20 multimodal benchmarks we examined. We believe\nthat the new capabilities of Gemini models in cross-modal reasoning and\nlanguage understanding will enable a wide variety of use cases and we discuss\nour approach toward deploying them responsibly to users.\n","authors":[" Gemini Team","Rohan Anil","Sebastian Borgeaud","Yonghui Wu","Jean-Baptiste Alayrac","Jiahui Yu","Radu Soricut","Johan Schalkwyk","Andrew M. Dai","Anja Hauth","Katie Millican","David Silver","Slav Petrov","Melvin Johnson","Ioannis Antonoglou","Julian Schrittwieser","Amelia Glaese","Jilin Chen","Emily Pitler","Timothy Lillicrap","Angeliki Lazaridou","Orhan Firat","James Molloy","Michael Isard","Paul R. Barham","Tom Hennigan","Benjamin Lee","Fabio Viola","Malcolm Reynolds","Yuanzhong Xu","Ryan Doherty","Eli Collins","Clemens Meyer","Eliza Rutherford","Erica Moreira","Kareem Ayoub","Megha Goel","George Tucker","Enrique Piqueras","Maxim Krikun","Iain Barr","Nikolay Savinov","Ivo Danihelka","Becca Roelofs","Anaïs White","Anders Andreassen","Tamara von Glehn","Lakshman Yagati","Mehran Kazemi","Lucas Gonzalez","Misha Khalman","Jakub Sygnowski","Alexandre Frechette","Charlotte Smith","Laura Culp","Lev Proleev","Yi Luan","Xi Chen","James Lottes","Nathan Schucher","Federico Lebron","Alban Rrustemi","Natalie Clay","Phil Crone","Tomas Kocisky","Jeffrey Zhao","Bartek Perz","Dian Yu","Heidi Howard","Adam Bloniarz","Jack W. Rae","Han Lu","Laurent Sifre","Marcello Maggioni","Fred Alcober","Dan Garrette","Megan Barnes","Shantanu Thakoor","Jacob Austin","Gabriel Barth-Maron","William Wong","Rishabh Joshi","Rahma Chaabouni","Deeni Fatiha","Arun Ahuja","Ruibo Liu","Yunxuan Li","Sarah Cogan","Jeremy Chen","Chao Jia","Chenjie Gu","Qiao Zhang","Jordan Grimstad","Ale Jakse Hartman","Martin Chadwick","Gaurav Singh Tomar","Xavier Garcia","Evan Senter","Emanuel Taropa","Thanumalayan Sankaranarayana Pillai","Jacob Devlin","Michael Laskin","Diego de Las Casas","Dasha Valter","Connie Tao","Lorenzo Blanco","Adrià Puigdomènech Badia","David Reitter","Mianna Chen","Jenny Brennan","Clara Rivera","Sergey Brin","Shariq Iqbal","Gabriela Surita","Jane Labanowski","Abhi Rao","Stephanie Winkler","Emilio Parisotto","Yiming Gu","Kate Olszewska","Yujing Zhang","Ravi Addanki","Antoine Miech","Annie Louis","Laurent El Shafey","Denis Teplyashin","Geoff Brown","Elliot Catt","Nithya Attaluri","Jan Balaguer","Jackie Xiang","Pidong Wang","Zoe Ashwood","Anton Briukhov","Albert Webson","Sanjay Ganapathy","Smit Sanghavi","Ajay Kannan","Ming-Wei Chang","Axel Stjerngren","Josip Djolonga","Yuting Sun","Ankur Bapna","Matthew Aitchison","Pedram Pejman","Henryk Michalewski","Tianhe Yu","Cindy Wang","Juliette Love","Junwhan Ahn","Dawn Bloxwich","Kehang Han","Peter Humphreys","Thibault Sellam","James Bradbury","Varun Godbole","Sina Samangooei","Bogdan Damoc","Alex Kaskasoli","Sébastien M. R. Arnold","Vijay Vasudevan","Shubham Agrawal","Jason Riesa","Dmitry Lepikhin","Richard Tanburn","Srivatsan Srinivasan","Hyeontaek Lim","Sarah Hodkinson","Pranav Shyam","Johan Ferret","Steven Hand","Ankush Garg","Tom Le Paine","Jian Li","Yujia Li","Minh Giang","Alexander Neitz","Zaheer Abbas","Sarah York","Machel Reid","Elizabeth Cole","Aakanksha Chowdhery","Dipanjan Das","Dominika Rogozińska","Vitaly Nikolaev","Pablo Sprechmann","Zachary Nado","Lukas Zilka","Flavien Prost","Luheng He","Marianne Monteiro","Gaurav Mishra","Chris Welty","Josh Newlan","Dawei Jia","Miltiadis Allamanis","Clara Huiyi Hu","Raoul de Liedekerke","Justin Gilmer","Carl Saroufim","Shruti Rijhwani","Shaobo Hou","Disha Shrivastava","Anirudh Baddepudi","Alex Goldin","Adnan Ozturel","Albin Cassirer","Yunhan Xu","Daniel Sohn","Devendra Sachan","Reinald Kim Amplayo","Craig Swanson","Dessie Petrova","Shashi Narayan","Arthur Guez","Siddhartha Brahma","Jessica Landon","Miteyan Patel","Ruizhe Zhao","Kevin Villela","Luyu Wang","Wenhao Jia","Matthew Rahtz","Mai Giménez","Legg Yeung","Hanzhao Lin","James Keeling","Petko Georgiev","Diana Mincu","Boxi Wu","Salem Haykal","Rachel Saputro","Kiran Vodrahalli","James Qin","Zeynep Cankara","Abhanshu Sharma","Nick Fernando","Will Hawkins","Behnam Neyshabur","Solomon Kim","Adrian Hutter","Priyanka Agrawal","Alex Castro-Ros","George van den Driessche","Tao Wang","Fan Yang","Shuo-yiin Chang","Paul Komarek","Ross McIlroy","Mario Lučić","Guodong Zhang","Wael Farhan","Michael Sharman","Paul Natsev","Paul Michel","Yong Cheng","Yamini Bansal","Siyuan Qiao","Kris Cao","Siamak Shakeri","Christina Butterfield","Justin Chung","Paul Kishan Rubenstein","Shivani Agrawal","Arthur Mensch","Kedar Soparkar","Karel Lenc","Timothy Chung","Aedan Pope","Loren Maggiore","Jackie Kay","Priya Jhakra","Shibo Wang","Joshua Maynez","Mary Phuong","Taylor Tobin","Andrea Tacchetti","Maja Trebacz","Kevin Robinson","Yash Katariya","Sebastian Riedel","Paige Bailey","Kefan Xiao","Nimesh Ghelani","Lora Aroyo","Ambrose Slone","Neil Houlsby","Xuehan Xiong","Zhen Yang","Elena Gribovskaya","Jonas Adler","Mateo Wirth","Lisa Lee","Music Li","Thais Kagohara","Jay Pavagadhi","Sophie Bridgers","Anna Bortsova","Sanjay Ghemawat","Zafarali Ahmed","Tianqi Liu","Richard Powell","Vijay Bolina","Mariko Iinuma","Polina Zablotskaia","James Besley","Da-Woon Chung","Timothy Dozat","Ramona Comanescu","Xiance Si","Jeremy Greer","Guolong Su","Martin Polacek","Raphaël Lopez Kaufman","Simon Tokumine","Hexiang Hu","Elena Buchatskaya","Yingjie Miao","Mohamed Elhawaty","Aditya Siddhant","Nenad Tomasev","Jinwei Xing","Christina Greer","Helen Miller","Shereen Ashraf","Aurko Roy","Zizhao Zhang","Ada Ma","Angelos Filos","Milos Besta","Rory Blevins","Ted Klimenko","Chih-Kuan Yeh","Soravit Changpinyo","Jiaqi Mu","Oscar Chang","Mantas Pajarskas","Carrie Muir","Vered Cohen","Charline Le Lan","Krishna Haridasan","Amit Marathe","Steven Hansen","Sholto Douglas","Rajkumar Samuel","Mingqiu Wang","Sophia Austin","Chang Lan","Jiepu Jiang","Justin Chiu","Jaime Alonso Lorenzo","Lars Lowe Sjösund","Sébastien Cevey","Zach Gleicher","Thi Avrahami","Anudhyan Boral","Hansa Srinivasan","Vittorio Selo","Rhys May","Konstantinos Aisopos","Léonard Hussenot","Livio Baldini Soares","Kate Baumli","Michael B. Chang","Adrià Recasens","Ben Caine","Alexander Pritzel","Filip Pavetic","Fabio Pardo","Anita Gergely","Justin Frye","Vinay Ramasesh","Dan Horgan","Kartikeya Badola","Nora Kassner","Subhrajit Roy","Ethan Dyer","Víctor Campos","Alex Tomala","Yunhao Tang","Dalia El Badawy","Elspeth White","Basil Mustafa","Oran Lang","Abhishek Jindal","Sharad Vikram","Zhitao Gong","Sergi Caelles","Ross Hemsley","Gregory Thornton","Fangxiaoyu Feng","Wojciech Stokowiec","Ce Zheng","Phoebe Thacker","Çağlar Ünlü","Zhishuai Zhang","Mohammad Saleh","James Svensson","Max Bileschi","Piyush Patil","Ankesh Anand","Roman Ring","Katerina Tsihlas","Arpi Vezer","Marco Selvi","Toby Shevlane","Mikel Rodriguez","Tom Kwiatkowski","Samira Daruki","Keran Rong","Allan Dafoe","Nicholas FitzGerald","Keren Gu-Lemberg","Mina Khan","Lisa Anne Hendricks","Marie Pellat","Vladimir Feinberg","James Cobon-Kerr","Tara Sainath","Maribeth Rauh","Sayed Hadi Hashemi","Richard Ives","Yana Hasson","YaGuang Li","Eric Noland","Yuan Cao","Nathan Byrd","Le Hou","Qingze Wang","Thibault Sottiaux","Michela Paganini","Jean-Baptiste Lespiau","Alexandre Moufarek","Samer Hassan","Kaushik Shivakumar","Joost van Amersfoort","Amol Mandhane","Pratik Joshi","Anirudh Goyal","Matthew Tung","Andrew Brock","Hannah Sheahan","Vedant Misra","Cheng Li","Nemanja Rakićević","Mostafa Dehghani","Fangyu Liu","Sid Mittal","Junhyuk Oh","Seb Noury","Eren Sezener","Fantine Huot","Matthew Lamm","Nicola De Cao","Charlie Chen","Gamaleldin Elsayed","Ed Chi","Mahdis Mahdieh","Ian Tenney","Nan Hua","Ivan Petrychenko","Patrick Kane","Dylan Scandinaro","Rishub Jain","Jonathan Uesato","Romina Datta","Adam Sadovsky","Oskar Bunyan","Dominik Rabiej","Shimu Wu","John Zhang","Gautam Vasudevan","Edouard Leurent","Mahmoud Alnahlawi","Ionut Georgescu","Nan Wei","Ivy Zheng","Betty Chan","Pam G Rabinovitch","Piotr Stanczyk","Ye Zhang","David Steiner","Subhajit Naskar","Michael Azzam","Matthew Johnson","Adam Paszke","Chung-Cheng Chiu","Jaume Sanchez Elias","Afroz Mohiuddin","Faizan Muhammad","Jin Miao","Andrew Lee","Nino Vieillard","Sahitya Potluri","Jane Park","Elnaz Davoodi","Jiageng Zhang","Jeff Stanway","Drew Garmon","Abhijit Karmarkar","Zhe Dong","Jong Lee","Aviral Kumar","Luowei Zhou","Jonathan Evens","William Isaac","Zhe Chen","Johnson Jia","Anselm Levskaya","Zhenkai Zhu","Chris Gorgolewski","Peter Grabowski","Yu Mao","Alberto Magni","Kaisheng Yao","Javier Snaider","Norman Casagrande","Paul Suganthan","Evan Palmer","Geoffrey Irving","Edward Loper","Manaal Faruqui","Isha Arkatkar","Nanxin Chen","Izhak Shafran","Michael Fink","Alfonso Castaño","Irene Giannoumis","Wooyeol Kim","Mikołaj Rybiński","Ashwin Sreevatsa","Jennifer Prendki","David Soergel","Adrian Goedeckemeyer","Willi Gierke","Mohsen Jafari","Meenu Gaba","Jeremy Wiesner","Diana Gage Wright","Yawen Wei","Harsha Vashisht","Yana Kulizhskaya","Jay Hoover","Maigo Le","Lu Li","Chimezie Iwuanyanwu","Lu Liu","Kevin Ramirez","Andrey Khorlin","Albert Cui","Tian LIN","Marin Georgiev","Marcus Wu","Ricardo Aguilar","Keith Pallo","Abhishek Chakladar","Alena Repina","Xihui Wu","Tom van der Weide","Priya Ponnapalli","Caroline Kaplan","Jiri Simsa","Shuangfeng Li","Olivier Dousse","Fan Yang","Jeff Piper","Nathan Ie","Minnie Lui","Rama Pasumarthi","Nathan Lintz","Anitha Vijayakumar","Lam Nguyen Thiet","Daniel Andor","Pedro Valenzuela","Cosmin Paduraru","Daiyi Peng","Katherine Lee","Shuyuan Zhang","Somer Greene","Duc Dung Nguyen","Paula Kurylowicz","Sarmishta Velury","Sebastian Krause","Cassidy Hardin","Lucas Dixon","Lili Janzer","Kiam Choo","Ziqiang Feng","Biao Zhang","Achintya Singhal","Tejasi Latkar","Mingyang Zhang","Quoc Le","Elena Allica Abellan","Dayou Du","Dan McKinnon","Natasha Antropova","Tolga Bolukbasi","Orgad Keller","David Reid","Daniel Finchelstein","Maria Abi Raad","Remi Crocker","Peter Hawkins","Robert Dadashi","Colin Gaffney","Sid Lall","Ken Franko","Egor Filonov","Anna Bulanova","Rémi Leblond","Vikas Yadav","Shirley Chung","Harry Askham","Luis C. Cobo","Kelvin Xu","Felix Fischer","Jun Xu","Christina Sorokin","Chris Alberti","Chu-Cheng Lin","Colin Evans","Hao Zhou","Alek Dimitriev","Hannah Forbes","Dylan Banarse","Zora Tung","Jeremiah Liu","Mark Omernick","Colton Bishop","Chintu Kumar","Rachel Sterneck","Ryan Foley","Rohan Jain","Swaroop Mishra","Jiawei Xia","Taylor Bos","Geoffrey Cideron","Ehsan Amid","Francesco Piccinno","Xingyu Wang","Praseem Banzal","Petru Gurita","Hila Noga","Premal Shah","Daniel J. Mankowitz","Alex Polozov","Nate Kushman","Victoria Krakovna","Sasha Brown","MohammadHossein Bateni","Dennis Duan","Vlad Firoiu","Meghana Thotakuri","Tom Natan","Anhad Mohananey","Matthieu Geist","Sidharth Mudgal","Sertan Girgin","Hui Li","Jiayu Ye","Ofir Roval","Reiko Tojo","Michael Kwong","James Lee-Thorp","Christopher Yew","Quan Yuan","Sumit Bagri","Danila Sinopalnikov","Sabela Ramos","John Mellor","Abhishek Sharma","Aliaksei Severyn","Jonathan Lai","Kathy Wu","Heng-Tze Cheng","David Miller","Nicolas Sonnerat","Denis Vnukov","Rory Greig","Jennifer Beattie","Emily Caveness","Libin Bai","Julian Eisenschlos","Alex Korchemniy","Tomy Tsai","Mimi Jasarevic","Weize Kong","Phuong Dao","Zeyu Zheng","Frederick Liu","Fan Yang","Rui Zhu","Mark Geller","Tian Huey Teh","Jason Sanmiya","Evgeny Gladchenko","Nejc Trdin","Andrei Sozanschi","Daniel Toyama","Evan Rosen","Sasan Tavakkol","Linting Xue","Chen Elkind","Oliver Woodman","John Carpenter","George Papamakarios","Rupert Kemp","Sushant Kafle","Tanya Grunina","Rishika Sinha","Alice Talbert","Abhimanyu Goyal","Diane Wu","Denese Owusu-Afriyie","Cosmo Du","Chloe Thornton","Jordi Pont-Tuset","Pradyumna Narayana","Jing Li","Sabaer Fatehi","John Wieting","Omar Ajmeri","Benigno Uria","Tao Zhu","Yeongil Ko","Laura Knight","Amélie Héliou","Ning Niu","Shane Gu","Chenxi Pang","Dustin Tran","Yeqing Li","Nir Levine","Ariel Stolovich","Norbert Kalb","Rebeca Santamaria-Fernandez","Sonam Goenka","Wenny Yustalim","Robin Strudel","Ali Elqursh","Balaji Lakshminarayanan","Charlie Deck","Shyam Upadhyay","Hyo Lee","Mike Dusenberry","Zonglin Li","Xuezhi Wang","Kyle Levin","Raphael Hoffmann","Dan Holtmann-Rice","Olivier Bachem","Summer Yue","Sho Arora","Eric Malmi","Daniil Mirylenka","Qijun Tan","Christy Koh","Soheil Hassas Yeganeh","Siim Põder","Steven Zheng","Francesco Pongetti","Mukarram Tariq","Yanhua Sun","Lucian Ionita","Mojtaba Seyedhosseini","Pouya Tafti","Ragha Kotikalapudi","Zhiyu Liu","Anmol Gulati","Jasmine Liu","Xinyu Ye","Bart Chrzaszcz","Lily Wang","Nikhil Sethi","Tianrun Li","Ben Brown","Shreya Singh","Wei Fan","Aaron Parisi","Joe Stanton","Chenkai Kuang","Vinod Koverkathu","Christopher A. Choquette-Choo","Yunjie Li","TJ Lu","Abe Ittycheriah","Prakash Shroff","Pei Sun","Mani Varadarajan","Sanaz Bahargam","Rob Willoughby","David Gaddy","Ishita Dasgupta","Guillaume Desjardins","Marco Cornero","Brona Robenek","Bhavishya Mittal","Ben Albrecht","Ashish Shenoy","Fedor Moiseev","Henrik Jacobsson","Alireza Ghaffarkhah","Morgane Rivière","Alanna Walton","Clément Crepy","Alicia Parrish","Yuan Liu","Zongwei Zhou","Clement Farabet","Carey Radebaugh","Praveen Srinivasan","Claudia van der Salm","Andreas Fidjeland","Salvatore Scellato","Eri Latorre-Chimoto","Hanna Klimczak-Plucińska","David Bridson","Dario de Cesare","Tom Hudson","Piermaria Mendolicchio","Lexi Walker","Alex Morris","Ivo Penchev","Matthew Mauger","Alexey Guseynov","Alison Reid","Seth Odoom","Lucia Loher","Victor Cotruta","Madhavi Yenugula","Dominik Grewe","Anastasia Petrushkina","Tom Duerig","Antonio Sanchez","Steve Yadlowsky","Amy Shen","Amir Globerson","Adam Kurzrok","Lynette Webb","Sahil Dua","Dong Li","Preethi Lahoti","Surya Bhupatiraju","Dan Hurt","Haroon Qureshi","Ananth Agarwal","Tomer Shani","Matan Eyal","Anuj Khare","Shreyas Rammohan Belle","Lei Wang","Chetan Tekur","Mihir Sanjay Kale","Jinliang Wei","Ruoxin Sang","Brennan Saeta","Tyler Liechty","Yi Sun","Yao Zhao","Stephan Lee","Pandu Nayak","Doug Fritz","Manish Reddy Vuyyuru","John Aslanides","Nidhi Vyas","Martin Wicke","Xiao Ma","Taylan Bilal","Evgenii Eltyshev","Daniel Balle","Nina Martin","Hardie Cate","James Manyika","Keyvan Amiri","Yelin Kim","Xi Xiong","Kai Kang","Florian Luisier","Nilesh Tripuraneni","David Madras","Mandy Guo","Austin Waters","Oliver Wang","Joshua Ainslie","Jason Baldridge","Han Zhang","Garima Pruthi","Jakob Bauer","Feng Yang","Riham Mansour","Jason Gelman","Yang Xu","George Polovets","Ji Liu","Honglong Cai","Warren Chen","XiangHai Sheng","Emily Xue","Sherjil Ozair","Adams Yu","Christof Angermueller","Xiaowei Li","Weiren Wang","Julia Wiesinger","Emmanouil Koukoumidis","Yuan Tian","Anand Iyer","Madhu Gurumurthy","Mark Goldenson","Parashar Shah","MK Blake","Hongkun Yu","Anthony Urbanowicz","Jennimaria Palomaki","Chrisantha Fernando","Kevin Brooks","Ken Durden","Harsh Mehta","Nikola Momchev","Elahe Rahimtoroghi","Maria Georgaki","Amit Raul","Sebastian Ruder","Morgan Redshaw","Jinhyuk Lee","Komal Jalan","Dinghua Li","Ginger Perng","Blake Hechtman","Parker Schuh","Milad Nasr","Mia Chen","Kieran Milan","Vladimir Mikulik","Trevor Strohman","Juliana Franco","Tim Green","Demis Hassabis","Koray Kavukcuoglu","Jeffrey Dean","Oriol Vinyals"],"pdf_url":"https://arxiv.org/pdf/2312.11805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11793v1","updated":"2023-12-19T02:09:38Z","published":"2023-12-19T02:09:38Z","title":"An effective image copy-move forgery detection using entropy image","summary":" Image forensics has become increasingly important in our daily lives. As a\nfundamental type of forgeries, Copy-Move Forgery Detection (CMFD) has received\nsignificant attention in the academic community. Keypoint-based algorithms,\nparticularly those based on SIFT, have achieved good results in CMFD. However,\nthe most of keypoint detection algorithms often fail to generate sufficient\nmatches when tampered patches are present in smooth areas. To tackle this\nproblem, we introduce entropy images to determine the coordinates and scales of\nkeypoints, resulting significantly increasing the number of keypoints.\nFurthermore, we develop an entropy level clustering algorithm to avoid\nincreased matching complexity caused by non-ideal distribution of grayscale\nvalues in keypoints. Experimental results demonstrate that our algorithm\nachieves a good balance between performance and time efficiency.\n","authors":["Zhaowei Lu","Li Jiang"],"pdf_url":"https://arxiv.org/pdf/2312.11793v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10300v2","updated":"2023-12-19T02:04:18Z","published":"2023-12-16T03:17:30Z","title":"Shot2Story20K: A New Benchmark for Comprehensive Understanding of\n Multi-shot Videos","summary":" A short clip of video may contain progression of multiple events and an\ninteresting story line. A human need to capture both the event in every shot\nand associate them together to understand the story behind it. In this work, we\npresent a new multi-shot video understanding benchmark Shot2Story20K with\ndetailed shot-level captions and comprehensive video summaries. To facilitate\nbetter semantic understanding of videos, we provide captions for both visual\nsignals and human narrations. We design several distinct tasks including\nsingle-shot video and narration captioning, multi-shot video summarization, and\nvideo retrieval with shot descriptions. Preliminary experiments show some\nchallenges to generate a long and comprehensive video summary. Nevertheless,\nthe generated imperfect summaries can already significantly boost the\nperformance of existing video understanding tasks such as video\nquestion-answering, promoting an under-explored setting of video understanding\nwith detailed summaries.\n","authors":["Mingfei Han","Linjie Yang","Xiaojun Chang","Heng Wang"],"pdf_url":"https://arxiv.org/pdf/2312.10300v2.pdf","comment":"See https://mingfei.info/shot2story for updates and more information"},{"id":"http://arxiv.org/abs/2306.12681v3","updated":"2023-12-19T02:03:44Z","published":"2023-06-22T05:55:53Z","title":"One at a Time: Progressive Multi-step Volumetric Probability Learning\n for Reliable 3D Scene Perception","summary":" Numerous studies have investigated the pivotal role of reliable 3D volume\nrepresentation in scene perception tasks, such as multi-view stereo (MVS) and\nsemantic scene completion (SSC). They typically construct 3D probability\nvolumes directly with geometric correspondence, attempting to fully address the\nscene perception tasks in a single forward pass. However, such a single-step\nsolution makes it hard to learn accurate and convincing volumetric probability,\nespecially in challenging regions like unexpected occlusions and complicated\nlight reflections. Therefore, this paper proposes to decompose the complicated\n3D volume representation learning into a sequence of generative steps to\nfacilitate fine and reliable scene perception. Considering the recent advances\nachieved by strong generative diffusion models, we introduce a multi-step\nlearning framework, dubbed as VPD, dedicated to progressively refining the\nVolumetric Probability in a Diffusion process. Extensive experiments are\nconducted on scene perception tasks including multi-view stereo (MVS) and\nsemantic scene completion (SSC), to validate the efficacy of our method in\nlearning reliable volumetric representations. Notably, for the SSC task, our\nwork stands out as the first to surpass LiDAR-based methods on the\nSemanticKITTI dataset.\n","authors":["Bohan Li","Yasheng Sun","Jingxin Dong","Zheng Zhu","Jinming Liu","Xin Jin","Wenjun Zeng"],"pdf_url":"https://arxiv.org/pdf/2306.12681v3.pdf","comment":"AAAI2024"},{"id":"http://arxiv.org/abs/2312.10088v2","updated":"2023-12-19T01:44:13Z","published":"2023-12-13T05:32:52Z","title":"On Robustness to Missing Video for Audiovisual Speech Recognition","summary":" It has been shown that learning audiovisual features can lead to improved\nspeech recognition performance over audio-only features, especially for noisy\nspeech. However, in many common applications, the visual features are partially\nor entirely missing, e.g.~the speaker might move off screen. Multi-modal models\nneed to be robust: missing video frames should not degrade the performance of\nan audiovisual model to be worse than that of a single-modality audio-only\nmodel. While there have been many attempts at building robust models, there is\nlittle consensus on how robustness should be evaluated. To address this, we\nintroduce a framework that allows claims about robustness to be evaluated in a\nprecise and testable way. We also conduct a systematic empirical study of the\nrobustness of common audiovisual speech recognition architectures on a range of\nacoustic noise conditions and test suites. Finally, we show that an\narchitecture-agnostic solution based on cascades can consistently achieve\nrobustness to missing video, even in settings where existing techniques for\nrobustness like dropout fall short.\n","authors":["Oscar Chang","Otavio Braga","Hank Liao","Dmitriy Serdyuk","Olivier Siohan"],"pdf_url":"https://arxiv.org/pdf/2312.10088v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11782v1","updated":"2023-12-19T01:33:46Z","published":"2023-12-19T01:33:46Z","title":"Learning Object State Changes in Videos: An Open-World Perspective","summary":" Object State Changes (OSCs) are pivotal for video understanding. While humans\ncan effortlessly generalize OSC understanding from familiar to unknown objects,\ncurrent approaches are confined to a closed vocabulary. Addressing this gap, we\nintroduce a novel open-world formulation for the video OSC problem. The goal is\nto temporally localize the three stages of an OSC -- the object's initial\nstate, its transitioning state, and its end state -- whether or not the object\nhas been observed during training. Towards this end, we develop VidOSC, a\nholistic learning approach that: (1) leverages text and vision-language models\nfor supervisory signals to obviate manually labeling OSC training data, and (2)\nabstracts fine-grained shared state representations from objects to enhance\ngeneralization. Furthermore, we present HowToChange, the first open-world\nbenchmark for video OSC localization, which offers an order of magnitude\nincrease in the label space and annotation volume compared to the best existing\nbenchmark. Experimental results demonstrate the efficacy of our approach, in\nboth traditional closed-world and open-world scenarios.\n","authors":["Zihui Xue","Kumar Ashutosh","Kristen Grauman"],"pdf_url":"https://arxiv.org/pdf/2312.11782v1.pdf","comment":"Project website: https://vision.cs.utexas.edu/projects/VidOSC/"},{"id":"http://arxiv.org/abs/2207.14513v2","updated":"2023-12-19T01:31:29Z","published":"2022-07-29T07:21:15Z","title":"Uncertainty-Driven Action Quality Assessment","summary":" Automatic action quality assessment (AQA) has attracted increasing attention\ndue to its wide applications. However, most existing AQA methods employ\ndeterministic models to predict the final score for each action, while\noverlooking the subjectivity and diversity among expert judges during the\nscoring process. In this paper, we propose a novel probabilistic model, named\nUncertainty-Driven AQA (UD-AQA), to utilize and capture the diversity among\nmultiple judge scores. Specifically, we design a Conditional Variational\nAuto-Encoder (CVAE)-based module to encode the uncertainty in expert\nassessment, where multiple judge scores can be produced by sampling latent\nfeatures from the learned latent space multiple times. To further utilize the\nuncertainty, we generate the estimation of uncertainty for each prediction,\nwhich is employed to re-weight AQA regression loss, effectively reducing the\ninfluence of uncertain samples during training. Moreover, we further design an\nuncertainty-guided training strategy to dynamically adjust the learning order\nof the samples from low uncertainty to high uncertainty. The experiments show\nthat our proposed method achieves competitive results on three benchmarks\nincluding the Olympic events MTL-AQA and FineDiving, and the surgical skill\nJIGSAWS datasets.\n","authors":["Caixia Zhou","Yaping Huang","Haibin Ling"],"pdf_url":"https://arxiv.org/pdf/2207.14513v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11775v1","updated":"2023-12-19T01:10:11Z","published":"2023-12-19T01:10:11Z","title":"Towards SAMBA: Segment Anything Model for Brain Tumor Segmentation in\n Sub-Sharan African Populations","summary":" Gliomas, the most prevalent primary brain tumors, require precise\nsegmentation for diagnosis and treatment planning. However, this task poses\nsignificant challenges, particularly in the African population, were limited\naccess to high-quality imaging data hampers algorithm performance. In this\nstudy, we propose an innovative approach combining the Segment Anything Model\n(SAM) and a voting network for multi-modal glioma segmentation. By fine-tuning\nSAM with bounding box-guided prompts (SAMBA), we adapt the model to the\ncomplexities of African datasets. Our ensemble strategy, utilizing multiple\nmodalities and views, produces a robust consensus segmentation, addressing\nintra-tumoral heterogeneity. Although the low quality of scans presents\ndifficulties, our methodology has the potential to profoundly impact clinical\npractice in resource-limited settings such as Africa, improving treatment\ndecisions and advancing neuro-oncology research. Furthermore, successful\napplication to other brain tumor types and lesions in the future holds promise\nfor a broader transformation in neurological imaging, improving healthcare\noutcomes across all settings. This study was conducted on the Brain Tumor\nSegmentation (BraTS) Challenge Africa (BraTS-Africa) dataset, which provides a\nvaluable resource for addressing challenges specific to resource-limited\nsettings, particularly the African population, and facilitating the development\nof effective and more generalizable segmentation algorithms. To illustrate our\napproach's potential, our experiments on the BraTS-Africa dataset yielded\ncompelling results, with SAM attaining a Dice coefficient of 86.6 for binary\nsegmentation and 60.4 for multi-class segmentation.\n","authors":["Mohannad Barakat","Noha Magdy","Jjuuko George William","Ethel Phiri","Raymond Confidence","Dong Zhang","Udunna C Anazodo"],"pdf_url":"https://arxiv.org/pdf/2312.11775v1.pdf","comment":"13 pages, 6 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.11774v1","updated":"2023-12-19T01:09:49Z","published":"2023-12-19T01:09:49Z","title":"Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation","summary":" By lifting the pre-trained 2D diffusion models into Neural Radiance Fields\n(NeRFs), text-to-3D generation methods have made great progress. Many\nstate-of-the-art approaches usually apply score distillation sampling (SDS) to\noptimize the NeRF representations, which supervises the NeRF optimization with\npre-trained text-conditioned 2D diffusion models such as Imagen. However, the\nsupervision signal provided by such pre-trained diffusion models only depends\non text prompts and does not constrain the multi-view consistency. To inject\nthe cross-view consistency into diffusion priors, some recent works finetune\nthe 2D diffusion model with multi-view data, but still lack fine-grained view\ncoherence. To tackle this challenge, we incorporate multi-view image conditions\ninto the supervision signal of NeRF optimization, which explicitly enforces\nfine-grained view consistency. With such stronger supervision, our proposed\ntext-to-3D method effectively mitigates the generation of floaters (due to\nexcessive densities) and completely empty spaces (due to insufficient\ndensities). Our quantitative evaluations on the T$^3$Bench dataset demonstrate\nthat our method achieves state-of-the-art performance over existing text-to-3D\nmethods. We will make the code publicly available.\n","authors":["Yuze He","Yushi Bai","Matthieu Lin","Jenny Sheng","Yubin Hu","Qi Wang","Yu-Hui Wen","Yong-Jin Liu"],"pdf_url":"https://arxiv.org/pdf/2312.11774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11772v1","updated":"2023-12-19T01:07:36Z","published":"2023-12-19T01:07:36Z","title":"CAManim: Animating end-to-end network activation maps","summary":" Deep neural networks have been widely adopted in numerous domains due to\ntheir high performance and accessibility to developers and application-specific\nend-users. Fundamental to image-based applications is the development of\nConvolutional Neural Networks (CNNs), which possess the ability to\nautomatically extract features from data. However, comprehending these complex\nmodels and their learned representations, which typically comprise millions of\nparameters and numerous layers, remains a challenge for both developers and\nend-users. This challenge arises due to the absence of interpretable and\ntransparent tools to make sense of black-box models. There exists a growing\nbody of Explainable Artificial Intelligence (XAI) literature, including a\ncollection of methods denoted Class Activation Maps (CAMs), that seek to\ndemystify what representations the model learns from the data, how it informs a\ngiven prediction, and why it, at times, performs poorly in certain tasks. We\npropose a novel XAI visualization method denoted CAManim that seeks to\nsimultaneously broaden and focus end-user understanding of CNN predictions by\nanimating the CAM-based network activation maps through all layers, effectively\ndepicting from end-to-end how a model progressively arrives at the final layer\nactivation. Herein, we demonstrate that CAManim works with any CAM-based method\nand various CNN architectures. Beyond qualitative model assessments, we\nadditionally propose a novel quantitative assessment that expands upon the\nRemove and Debias (ROAD) metric, pairing the qualitative end-to-end network\nvisual explanations assessment with our novel quantitative \"yellow brick ROAD\"\nassessment (ybROAD). This builds upon prior research to address the increasing\ndemand for interpretable, robust, and transparent model assessment methodology,\nultimately improving an end-user's trust in a given model's predictions.\n","authors":["Emily Kaczmarek","Olivier X. Miguel","Alexa C. Bowie","Robin Ducharme","Alysha L. J. Dingwall-Harvey","Steven Hawken","Christine M. Armour","Mark C. Walker","Kevin Dick"],"pdf_url":"https://arxiv.org/pdf/2312.11772v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11770v1","updated":"2023-12-19T01:03:19Z","published":"2023-12-19T01:03:19Z","title":"Bridging the Gap: Generalising State-of-the-Art U-Net Models to\n Sub-Saharan African Populations","summary":" A critical challenge for tumour segmentation models is the ability to adapt\nto diverse clinical settings, particularly when applied to poor-quality\nneuroimaging data. The uncertainty surrounding this adaptation stems from the\nlack of representative datasets, leaving top-performing models without exposure\nto common artifacts found in MRI data throughout Sub-Saharan Africa (SSA). We\nreplicated a framework that secured the 2nd position in the 2022 BraTS\ncompetition to investigate the impact of dataset composition on model\nperformance and pursued four distinct approaches through training a model with:\n1) BraTS-Africa data only (train_SSA, N=60), 2) BraTS-Adult Glioma data only\n(train_GLI, N=1251), 3) both datasets together (train_ALL, N=1311), and 4)\nthrough further training the train_GLI model with BraTS-Africa data\n(train_ftSSA). Notably, training on a smaller low-quality dataset alone\n(train_SSA) yielded subpar results, and training on a larger high-quality\ndataset alone (train_GLI) struggled to delineate oedematous tissue in the\nlow-quality validation set. The most promising approach (train_ftSSA) involved\npre-training a model on high-quality neuroimages and then fine-tuning it on the\nsmaller, low-quality dataset. This approach outperformed the others, ranking\nsecond in the MICCAI BraTS Africa global challenge external testing phase.\nThese findings underscore the significance of larger sample sizes and broad\nexposure to data in improving segmentation performance. Furthermore, we\ndemonstrated that there is potential for improving such models by fine-tuning\nthem with a wider range of data locally.\n","authors":["Alyssa R. Amod","Alexandra Smith","Pearly Joubert","Confidence Raymond","Dong Zhang","Udunna C. Anazodo","Dodzi Motchon","Tinashe E. M. Mutsvangwa","Sébastien Quetin"],"pdf_url":"https://arxiv.org/pdf/2312.11770v1.pdf","comment":"14 pages, 5 figures, 3 tables"},{"id":"http://arxiv.org/abs/2312.11763v1","updated":"2023-12-19T00:17:34Z","published":"2023-12-19T00:17:34Z","title":"ADMM-MM Algorithm for General Tensor Decomposition","summary":" In this paper, we propose a new unified optimization algorithm for general\ntensor decomposition which is formulated as an inverse problem for low-rank\ntensors in the general linear observation models. The proposed algorithm\nsupports three basic loss functions ($\\ell_2$-loss, $\\ell_1$-loss and KL\ndivergence) and various low-rank tensor decomposition models (CP, Tucker, TT,\nand TR decompositions). We derive the optimization algorithm based on\nhierarchical combination of the alternating direction method of multiplier\n(ADMM) and majorization-minimization (MM). We show that wide-range applications\ncan be solved by the proposed algorithm, and can be easily extended to any\nestablished tensor decomposition models in a {plug-and-play} manner.\n","authors":["Manabu Mukai","Hidekata Hontani","Tatsuya Yokota"],"pdf_url":"https://arxiv.org/pdf/2312.11763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.12015v2","updated":"2023-12-19T00:07:16Z","published":"2023-05-19T21:59:23Z","title":"Inventing art styles with no artistic training data","summary":" We propose two procedures to create painting styles using models trained only\non natural images, providing objective proof that the model is not plagiarizing\nhuman art styles. In the first procedure we use the inductive bias from the\nartistic medium to achieve creative expression. Abstraction is achieved by\nusing a reconstruction loss. The second procedure uses an additional natural\nimage as inspiration to create a new style. These two procedures make it\npossible to invent new painting styles with no artistic training data. We\nbelieve that our approach can help pave the way for the ethical employment of\ngenerative AI in art, without infringing upon the originality of human\ncreators.\n","authors":["Nilin Abrahamsen","Jiahao Yao"],"pdf_url":"https://arxiv.org/pdf/2305.12015v2.pdf","comment":"updated title"},{"id":"http://arxiv.org/abs/2305.12554v2","updated":"2023-12-19T23:52:51Z","published":"2023-05-21T19:31:56Z","title":"Towards Consistent Stochastic Human Motion Prediction via Motion\n Diffusion","summary":" Stochastic Human Motion Prediction (HMP) aims to predict multiple possible\nupcoming pose sequences based on past human motion trajectories. Although\nprevious approaches have shown impressive performance, they face several\nissues, including complex training processes and a tendency to generate\npredictions that are often inconsistent with the provided history, and\nsometimes even becoming entirely unreasonable. To overcome these issues, we\npropose DiffMotion, an end-to-end diffusion-based stochastic HMP framework.\nDiffMotion's motion predictor is composed of two modules, including (1) a\nTransformer-based network for initial motion reconstruction from corrupted\nmotion, and (2) a Graph Convolutional Network (GCN) to refine the generated\nmotion considering past observations. Our method, facilitated by this novel\nTransformer-GCN module design and a proposed variance scheduler, excels in\npredicting accurate, realistic, and consistent motions, while maintaining an\nappropriate level of diversity. Our results on benchmark datasets show that\nDiffMotion significantly outperforms previous methods in terms of both accuracy\nand fidelity, while demonstrating superior robustness.\n","authors":["Jiarui Sun","Girish Chowdhary"],"pdf_url":"https://arxiv.org/pdf/2305.12554v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12668v1","updated":"2023-12-19T23:48:43Z","published":"2023-12-19T23:48:43Z","title":"Convolutional Channel-wise Competitive Learning for the Forward-Forward\n Algorithm","summary":" The Forward-Forward (FF) Algorithm has been recently proposed to alleviate\nthe issues of backpropagation (BP) commonly used to train deep neural networks.\nHowever, its current formulation exhibits limitations such as the generation of\nnegative data, slower convergence, and inadequate performance on complex tasks.\nIn this paper, we take the main ideas of FF and improve them by leveraging\nchannel-wise competitive learning in the context of convolutional neural\nnetworks for image classification tasks. A layer-wise loss function is\nintroduced that promotes competitive learning and eliminates the need for\nnegative data construction. To enhance both the learning of compositional\nfeatures and feature space partitioning, a channel-wise feature separator and\nextractor block is proposed that complements the competitive learning process.\nOur method outperforms recent FF-based models on image classification tasks,\nachieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST,\nFashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the\nperformance gap between FF learning and BP methods, indicating the potential of\nour proposed approach to learn useful representations in a layer-wise modular\nfashion, enabling more efficient and flexible learning.\n","authors":["Andreas Papachristodoulou","Christos Kyrkou","Stelios Timotheou","Theocharis Theocharides"],"pdf_url":"https://arxiv.org/pdf/2312.12668v1.pdf","comment":"To be published in AAAI 2024, 11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.12664v1","updated":"2023-12-19T23:34:43Z","published":"2023-12-19T23:34:43Z","title":"UnionDet: Union-Level Detector Towards Real-Time Human-Object\n Interaction Detection","summary":" Recent advances in deep neural networks have achieved significant progress in\ndetecting individual objects from an image. However, object detection is not\nsufficient to fully understand a visual scene. Towards a deeper visual\nunderstanding, the interactions between objects, especially humans and objects\nare essential. Most prior works have obtained this information with a bottom-up\napproach, where the objects are first detected and the interactions are\npredicted sequentially by pairing the objects. This is a major bottleneck in\nHOI detection inference time. To tackle this problem, we propose UnionDet, a\none-stage meta-architecture for HOI detection powered by a novel union-level\ndetector that eliminates this additional inference stage by directly capturing\nthe region of interaction. Our one-stage detector for human-object interaction\nshows a significant reduction in interaction prediction time 4x~14x while\noutperforming state-of-the-art methods on two public datasets: V-COCO and\nHICO-DET.\n","authors":["Bumsoo Kim","Taeho Choi","Jaewoo Kang","Hyunwoo J. Kim"],"pdf_url":"https://arxiv.org/pdf/2312.12664v1.pdf","comment":"ECCV 2020"},{"id":"http://arxiv.org/abs/2302.02515v2","updated":"2023-12-19T23:30:52Z","published":"2023-02-06T01:01:00Z","title":"Deep Learning for Time Series Classification and Extrinsic Regression: A\n Current Survey","summary":" Time Series Classification and Extrinsic Regression are important and\nchallenging machine learning tasks. Deep learning has revolutionized natural\nlanguage processing and computer vision and holds great promise in other fields\nsuch as time series analysis where the relevant features must often be\nabstracted from the raw data but are not known a priori. This paper surveys the\ncurrent state of the art in the fast-moving field of deep learning for time\nseries classification and extrinsic regression. We review different network\narchitectures and training methods used for these tasks and discuss the\nchallenges and opportunities when applying deep learning to time series data.\nWe also summarize two critical applications of time series classification and\nextrinsic regression, human activity recognition and satellite earth\nobservation.\n","authors":["Navid Mohammadi Foumani","Lynn Miller","Chang Wei Tan","Geoffrey I. Webb","Germain Forestier","Mahsa Salehi"],"pdf_url":"https://arxiv.org/pdf/2302.02515v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12661v1","updated":"2023-12-19T23:22:47Z","published":"2023-12-19T23:22:47Z","title":"Misalign, Contrast then Distill: Rethinking Misalignments in\n Language-Image Pretraining","summary":" Contrastive Language-Image Pretraining has emerged as a prominent approach\nfor training vision and text encoders with uncurated image-text pairs from the\nweb. To enhance data-efficiency, recent efforts have introduced additional\nsupervision terms that involve random-augmented views of the image. However,\nsince the image augmentation process is unaware of its text counterpart, this\nprocedure could cause various degrees of image-text misalignments during\ntraining. Prior methods either disregarded this discrepancy or introduced\nexternal models to mitigate the impact of misalignments during training. In\ncontrast, we propose a novel metric learning approach that capitalizes on these\nmisalignments as an additional training source, which we term \"Misalign,\nContrast then Distill (MCD)\". Unlike previous methods that treat augmented\nimages and their text counterparts as simple positive pairs, MCD predicts the\ncontinuous scales of misalignment caused by the augmentation. Our extensive\nexperimental results show that our proposed MCD achieves state-of-the-art\ntransferability in multiple classification and retrieval downstream datasets.\n","authors":["Bumsoo Kim","Yeonsik Jo","Jinhyung Kim","Seung Hwan Kim"],"pdf_url":"https://arxiv.org/pdf/2312.12661v1.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2312.12659v1","updated":"2023-12-19T23:11:06Z","published":"2023-12-19T23:11:06Z","title":"Expediting Contrastive Language-Image Pretraining via Self-distilled\n Encoders","summary":" Recent advances in vision language pretraining (VLP) have been largely\nattributed to the large-scale data collected from the web. However, uncurated\ndataset contains weakly correlated image-text pairs, causing data inefficiency.\nTo address the issue, knowledge distillation have been explored at the expense\nof extra image and text momentum encoders to generate teaching signals for\nmisaligned image-text pairs. In this paper, our goal is to resolve the\nmisalignment problem with an efficient distillation framework. To this end, we\npropose ECLIPSE: Expediting Contrastive Language-Image Pretraining with\nSelf-distilled Encoders. ECLIPSE features a distinctive distillation\narchitecture wherein a shared text encoder is utilized between an online image\nencoder and a momentum image encoder. This strategic design choice enables the\ndistillation to operate within a unified projected space of text embedding,\nresulting in better performance. Based on the unified text embedding space,\nECLIPSE compensates for the additional computational cost of the momentum image\nencoder by expediting the online image encoder. Through our extensive\nexperiments, we validate that there is a sweet spot between expedition and\ndistillation where the partial view from the expedited online image encoder\ninteracts complementarily with the momentum teacher. As a result, ECLIPSE\noutperforms its counterparts while achieving substantial acceleration in\ninference speed.\n","authors":["Bumsoo Kim","Jinhyung Kim","Yeonsik Jo","Seung Hwan Kim"],"pdf_url":"https://arxiv.org/pdf/2312.12659v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12653v1","updated":"2023-12-19T22:53:32Z","published":"2023-12-19T22:53:32Z","title":"Diagnosis Of Takotsubo Syndrome By Robust Feature Selection From The\n Complex Latent Space Of DL-based Segmentation Network","summary":" Researchers have shown significant correlations among segmented objects in\nvarious medical imaging modalities and disease related pathologies. Several\nstudies showed that using hand crafted features for disease prediction neglects\nthe immense possibility to use latent features from deep learning (DL) models\nwhich may reduce the overall accuracy of differential diagnosis. However,\ndirectly using classification or segmentation models on medical to learn latent\nfeatures opt out robust feature selection and may lead to overfitting. To fill\nthis gap, we propose a novel feature selection technique using the latent space\nof a segmentation model that can aid diagnosis. We evaluated our method in\ndifferentiating a rare cardiac disease: Takotsubo Syndrome (TTS) from the ST\nelevation myocardial infarction (STEMI) using echocardiogram videos (echo). TTS\ncan mimic clinical features of STEMI in echo and extremely hard to distinguish.\nOur approach shows promising results in differential diagnosis of TTS with 82%\ndiagnosis accuracy beating the previous state-of-the-art (SOTA) approach.\nMoreover, the robust feature selection technique using LASSO algorithm shows\ngreat potential in reducing the redundant features and creates a robust\npipeline for short- and long-term disease prognoses in the downstream analysis.\n","authors":["Fahim Ahmed Zaman","Wahidul Alam","Tarun Kanti Roy","Amanda Chang","Kan Liu","Xiaodong Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12653v1.pdf","comment":"5 pages, 3 figures, conference"},{"id":"http://arxiv.org/abs/2312.12649v1","updated":"2023-12-19T22:50:02Z","published":"2023-12-19T22:50:02Z","title":"Surf-CDM: Score-Based Surface Cold-Diffusion Model For Medical Image\n Segmentation","summary":" Diffusion models have shown impressive performance for image generation,\noften times outperforming other generative models. Since their introduction,\nresearchers have extended the powerful noise-to-image denoising pipeline to\ndiscriminative tasks, including image segmentation. In this work we propose a\nconditional score-based generative modeling framework for medical image\nsegmentation which relies on a parametric surface representation for the\nsegmentation masks. The surface re-parameterization allows the direct\napplication of standard diffusion theory, as opposed to when the mask is\nrepresented as a binary mask. Moreover, we adapted an extended variant of the\ndiffusion technique known as the \"cold-diffusion\" where the diffusion model can\nbe constructed with deterministic perturbations instead of Gaussian noise,\nwhich facilitates significantly faster convergence in the reverse diffusion. We\nevaluated our method on the segmentation of the left ventricle from 65\ntransthoracic echocardiogram videos (2230 echo image frames) and compared its\nperformance to the most popular and widely used image segmentation models. Our\nproposed model not only outperformed the compared methods in terms of\nsegmentation accuracy, but also showed potential in estimating segmentation\nuncertainties for further downstream analyses due to its inherent generative\nnature.\n","authors":["Fahim Ahmed Zaman","Mathews Jacob","Amanda Chang","Kan Liu","Milan Sonka","Xiaodong Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12649v1.pdf","comment":"5 pages, 5 figures, conference"},{"id":"http://arxiv.org/abs/2312.12648v1","updated":"2023-12-19T22:45:57Z","published":"2023-12-19T22:45:57Z","title":"IS-DARTS: Stabilizing DARTS through Precise Measurement on Candidate\n Importance","summary":" Among existing Neural Architecture Search methods, DARTS is known for its\nefficiency and simplicity. This approach applies continuous relaxation of\nnetwork representation to construct a weight-sharing supernet and enables the\nidentification of excellent subnets in just a few GPU days. However,\nperformance collapse in DARTS results in deteriorating architectures filled\nwith parameter-free operations and remains a great challenge to the robustness.\nTo resolve this problem, we reveal that the fundamental reason is the biased\nestimation of the candidate importance in the search space through theoretical\nand experimental analysis, and more precisely select operations via\ninformation-based measurements. Furthermore, we demonstrate that the excessive\nconcern over the supernet and inefficient utilization of data in bi-level\noptimization also account for suboptimal results. We adopt a more realistic\nobjective focusing on the performance of subnets and simplify it with the help\nof the information-based measurements. Finally, we explain theoretically why\nprogressively shrinking the width of the supernet is necessary and reduce the\napproximation error of optimal weights in DARTS. Our proposed method, named\nIS-DARTS, comprehensively improves DARTS and resolves the aforementioned\nproblems. Extensive experiments on NAS-Bench-201 and DARTS-based search space\ndemonstrate the effectiveness of IS-DARTS.\n","authors":["Hongyi He","Longjun Liu","Haonan Zhang","Nanning Zheng"],"pdf_url":"https://arxiv.org/pdf/2312.12648v1.pdf","comment":"accepted by AAAI2024, paper + supplementary, 11 pages"},{"id":"http://arxiv.org/abs/2312.12644v1","updated":"2023-12-19T22:40:51Z","published":"2023-12-19T22:40:51Z","title":"Rotational Augmented Noise2Inverse for Low-dose Computed Tomography\n Reconstruction","summary":" In this work, we present a novel self-supervised method for Low Dose Computed\nTomography (LDCT) reconstruction. Reducing the radiation dose to patients\nduring a CT scan is a crucial challenge since the quality of the reconstruction\nhighly degrades because of low photons or limited measurements. Supervised deep\nlearning methods have shown the ability to remove noise in images but require\naccurate ground truth which can be obtained only by performing additional\nhigh-radiation CT scans. Therefore, we propose a novel self-supervised\nframework for LDCT, in which ground truth is not required for training the\nconvolutional neural network (CNN). Based on the Noise2Inverse (N2I) method, we\nenforce in the training loss the equivariant property of rotation\ntransformation, which is induced by the CT imaging system, to improve the\nquality of the CT image in a lower dose. Numerical and experimental results\nshow that the reconstruction accuracy of N2I with sparse views is degrading\nwhile the proposed rotational augmented Noise2Inverse (RAN2I) method keeps\nbetter image quality over a different range of sampling angles. Finally, the\nquantitative results demonstrate that RAN2I achieves higher image quality\ncompared to N2I, and experimental results of RAN2I on real projection data show\ncomparable performance to supervised learning.\n","authors":["Hang Xu","Alessandro Perelli"],"pdf_url":"https://arxiv.org/pdf/2312.12644v1.pdf","comment":"14 pages, 12 figures, accepted manuscript in IEEE Transactions on\n Radiation and Plasma Medical Sciences"},{"id":"http://arxiv.org/abs/2312.12635v1","updated":"2023-12-19T22:33:42Z","published":"2023-12-19T22:33:42Z","title":"RealCraft: Attention Control as A Solution for Zero-shot Long Video\n Editing","summary":" Although large-scale text-to-image generative models have shown promising\nperformance in synthesizing high-quality images, directly applying these models\nto image editing remains a significant challenge. This challenge is further\namplified in video editing due to the additional dimension of time. Especially\nfor editing real videos as it necessitates maintaining a stable semantic layout\nacross the frames while executing localized edits precisely without disrupting\nthe existing backgrounds. In this paper, we propose \\textit{RealCraft}, an\nattention-control-based method for zero-shot editing in real videos. By\nemploying the object-centric manipulation of cross-attention between prompts\nand frames and spatial-temporal attention within the frames, we achieve precise\nshape-wise editing along with enhanced consistency. Our model can be used\ndirectly with Stable Diffusion and operates without the need for additional\nlocalized information. We showcase our zero-shot attention-control-based method\nacross a range of videos, demonstrating localized, high-fidelity, shape-precise\nand time-consistent editing in videos of various lengths, up to 64 frames.\n","authors":["Shutong Jin","Ruiyu Wang","Florian T. Pokorny"],"pdf_url":"https://arxiv.org/pdf/2312.12635v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12634v1","updated":"2023-12-19T22:33:17Z","published":"2023-12-19T22:33:17Z","title":"MotionScript: Natural Language Descriptions for Expressive 3D Human\n Motions","summary":" This paper proposes MotionScript, a motion-to-text conversion algorithm and\nnatural language representation for human body motions. MotionScript aims to\ndescribe movements in greater detail and with more accuracy than previous\nnatural language approaches. Many motion datasets describe relatively objective\nand simple actions with little variation on the way they are expressed (e.g.\nsitting, walking, dribbling a ball). But for expressive actions that contain a\ndiversity of movements in the class (e.g. being sad, dancing), or for actions\noutside the domain of standard motion capture datasets (e.g. stylistic walking,\nsign-language), more specific and granular natural language descriptions are\nneeded. Our proposed MotionScript descriptions differ from existing natural\nlanguage representations in that it provides direct descriptions in natural\nlanguage instead of simple action labels or high-level human captions. To the\nbest of our knowledge, this is the first attempt at translating 3D motions to\nnatural language descriptions without requiring training data. Our experiments\nshow that when MotionScript representations are used in a text-to-motion neural\ntask, body movements are more accurately reconstructed, and large language\nmodels can be used to generate unseen complex motions.\n","authors":["Payam Jome Yazdian","Eric Liu","Li Cheng","Angelica Lim"],"pdf_url":"https://arxiv.org/pdf/2312.12634v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16999v3","updated":"2023-12-19T22:29:46Z","published":"2023-10-25T20:55:07Z","title":"Trust, but Verify: Robust Image Segmentation using Deep Learning","summary":" We describe a method for verifying the output of a deep neural network for\nmedical image segmentation that is robust to several classes of random as well\nas worst-case perturbations i.e. adversarial attacks. This method is based on a\ngeneral approach recently developed by the authors called \"Trust, but Verify\"\nwherein an auxiliary verification network produces predictions about certain\nmasked features in the input image using the segmentation as an input. A\nwell-designed auxiliary network will produce high-quality predictions when the\ninput segmentations are accurate, but will produce low-quality predictions when\nthe segmentations are incorrect. Checking the predictions of such a network\nwith the original image allows us to detect bad segmentations. However, to\nensure the verification method is truly robust, we need a method for checking\nthe quality of the predictions that does not itself rely on a black-box neural\nnetwork. Indeed, we show that previous methods for segmentation evaluation that\ndo use deep neural regression networks are vulnerable to false negatives i.e.\ncan inaccurately label bad segmentations as good. We describe the design of a\nverification network that avoids such vulnerability and present results to\ndemonstrate its robustness compared to previous methods.\n","authors":["Fahim Ahmed Zaman","Xiaodong Wu","Weiyu Xu","Milan Sonka","Raghuraman Mudumbai"],"pdf_url":"https://arxiv.org/pdf/2310.16999v3.pdf","comment":"5 Pages, 8 Figures, conference"},{"id":"http://arxiv.org/abs/2312.06914v2","updated":"2023-12-19T22:16:34Z","published":"2023-12-12T00:54:39Z","title":"Exploring Novel Object Recognition and Spontaneous Location Recognition\n Machine Learning Analysis Techniques in Alzheimer's Mice","summary":" Understanding object recognition patterns in mice is crucial for advancing\nbehavioral neuroscience and has significant implications for human health,\nparticularly in the realm of Alzheimer's research. This study is centered on\nthe development, application, and evaluation of a state-of-the-art\ncomputational pipeline designed to analyze such behaviors, specifically\nfocusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition\n(SLR) tasks. The pipeline integrates three advanced computational models:\nAny-Maze for initial data collection, DeepLabCut for detailed pose estimation,\nand Convolutional Neural Networks (CNNs) for nuanced behavioral classification.\nEmployed across four distinct mouse groups, this pipeline demonstrated high\nlevels of accuracy and robustness. Despite certain challenges like video\nquality limitations and the need for manual calculations, the results affirm\nthe pipeline's efficacy and potential for scalability. The study serves as a\nproof of concept for a multidimensional computational approach to behavioral\nneuroscience, emphasizing the pipeline's versatility and readiness for future,\nmore complex analyses.\n","authors":["Soham Bafana"],"pdf_url":"https://arxiv.org/pdf/2312.06914v2.pdf","comment":"10 Pages. All code used in this research can be found at\n https://github.com/bafanaS/DLC-Object-Recognition-Analysis.git"},{"id":"http://arxiv.org/abs/2307.07063v4","updated":"2023-12-19T22:13:30Z","published":"2023-07-13T21:08:15Z","title":"Bootstrapping Vision-Language Learning with Decoupled Language\n Pre-training","summary":" We present a novel methodology aimed at optimizing the application of frozen\nlarge language models (LLMs) for resource-intensive vision-language (VL)\npre-training. The current paradigm uses visual features as prompts to guide\nlanguage models, with a focus on determining the most relevant visual features\nfor corresponding text. Our approach diverges by concentrating on the language\ncomponent, specifically identifying the optimal prompts to align with visual\nfeatures. We introduce the Prompt-Transformer (P-Former), a model that predicts\nthese ideal prompts, which is trained exclusively on linguistic data, bypassing\nthe need for image-text pairings. This strategy subtly bifurcates the\nend-to-end VL training process into an additional, separate stage. Our\nexperiments reveal that our framework significantly enhances the performance of\na robust image-to-text baseline (BLIP-2), and effectively narrows the\nperformance gap between models trained with either 4M or 129M image-text pairs.\nImportantly, our framework is modality-agnostic and flexible in terms of\narchitectural design, as validated by its successful application in a video\nlearning task using varied base modules. The code will be made available at\nhttps://github.com/yiren-jian/BLIText.\n","authors":["Yiren Jian","Chongyang Gao","Soroush Vosoughi"],"pdf_url":"https://arxiv.org/pdf/2307.07063v4.pdf","comment":"Accepted to NeurIPS 2023 (spotlight). The code is available at\n https://github.com/yiren-jian/BLIText"},{"id":"http://arxiv.org/abs/2303.15413v5","updated":"2023-12-19T22:03:12Z","published":"2023-03-27T17:31:13Z","title":"Debiasing Scores and Prompts of 2D Diffusion for View-consistent\n Text-to-3D Generation","summary":" Existing score-distilling text-to-3D generation techniques, despite their\nconsiderable promise, often encounter the view inconsistency problem. One of\nthe most notable issues is the Janus problem, where the most canonical view of\nan object (\\textit{e.g}., face or head) appears in other views. In this work,\nwe explore existing frameworks for score-distilling text-to-3D generation and\nidentify the main causes of the view inconsistency problem -- the embedded bias\nof 2D diffusion models. Based on these findings, we propose two approaches to\ndebias the score-distillation frameworks for view-consistent text-to-3D\ngeneration. Our first approach, called score debiasing, involves cutting off\nthe score estimated by 2D diffusion models and gradually increasing the\ntruncation value throughout the optimization process. Our second approach,\ncalled prompt debiasing, identifies conflicting words between user prompts and\nview prompts using a language model, and adjusts the discrepancy between view\nprompts and the viewing direction of an object. Our experimental results show\nthat our methods improve the realism of the generated 3D objects by\nsignificantly reducing artifacts and achieve a good trade-off between\nfaithfulness to the 2D diffusion models and 3D consistency with little\noverhead. Our project page is available\nat~\\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.\n","authors":["Susung Hong","Donghoon Ahn","Seungryong Kim"],"pdf_url":"https://arxiv.org/pdf/2303.15413v5.pdf","comment":"Accepted to NeurIPS 2023. Project Page:\n https://susunghong.github.io/Debiased-Score-Distillation-Sampling/"},{"id":"http://arxiv.org/abs/2311.04207v2","updated":"2023-12-19T22:01:10Z","published":"2023-11-07T18:47:28Z","title":"Deep Hashing via Householder Quantization","summary":" Hashing is at the heart of large-scale image similarity search, and recent\nmethods have been substantially improved through deep learning techniques. Such\nalgorithms typically learn continuous embeddings of the data. To avoid a\nsubsequent costly binarization step, a common solution is to employ loss\nfunctions that combine a similarity learning term (to ensure similar images are\ngrouped to nearby embeddings) and a quantization penalty term (to ensure that\nthe embedding entries are close to binarized entries, e.g., -1 or 1). Still,\nthe interaction between these two terms can make learning harder and the\nembeddings worse. We propose an alternative quantization strategy that\ndecomposes the learning problem in two stages: first, perform similarity\nlearning over the embedding space with no quantization; second, find an optimal\northogonal transformation of the embeddings so each coordinate of the embedding\nis close to its sign, and then quantize the transformed embedding through the\nsign function. In the second step, we parametrize orthogonal transformations\nusing Householder matrices to efficiently leverage stochastic gradient descent.\nSince similarity measures are usually invariant under orthogonal\ntransformations, this quantization strategy comes at no cost in terms of\nperformance. The resulting algorithm is unsupervised, fast, hyperparameter-free\nand can be run on top of any existing deep hashing or metric learning\nalgorithm. We provide extensive experimental results showing that this approach\nleads to state-of-the-art performance on widely used image datasets, and,\nunlike other quantization strategies, brings consistent improvements in\nperformance to existing deep hashing algorithms.\n","authors":["Lucas R. Schwengber","Lucas Resende","Paulo Orenstein","Roberto I. Oliveira"],"pdf_url":"https://arxiv.org/pdf/2311.04207v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12619v1","updated":"2023-12-19T21:53:12Z","published":"2023-12-19T21:53:12Z","title":"Hierarchical Vision Transformers for Context-Aware Prostate Cancer\n Grading in Whole Slide Images","summary":" Vision Transformers (ViTs) have ushered in a new era in computer vision,\nshowcasing unparalleled performance in many challenging tasks. However, their\npractical deployment in computational pathology has largely been constrained by\nthe sheer size of whole slide images (WSIs), which result in lengthy input\nsequences. Transformers faced a similar limitation when applied to long\ndocuments, and Hierarchical Transformers were introduced to circumvent it.\nGiven the analogous challenge with WSIs and their inherent hierarchical\nstructure, Hierarchical Vision Transformers (H-ViTs) emerge as a promising\nsolution in computational pathology. This work delves into the capabilities of\nH-ViTs, evaluating their efficiency for prostate cancer grading in WSIs. Our\nresults show that they achieve competitive performance against existing\nstate-of-the-art solutions.\n","authors":["Clément Grisi","Geert Litjens","Jeroen van der Laak"],"pdf_url":"https://arxiv.org/pdf/2312.12619v1.pdf","comment":"Accepted at Medical Imaging meets NeurIPS 2023 workshop"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2312.12430v1","updated":"2023-12-19T18:56:52Z","published":"2023-12-19T18:56:52Z","title":"Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP","summary":" We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel\ntitle reranking technique to achieve efficient title reranking 20x-40x faster\nthan vanilla passage reranker. However, one of the challenges with the training\nof Efficient Title Reranker is the instability. Analyzing the issue, we found\nsome very difficult ground truths might act as noisy labels causing accuracy to\ndrop as well as some extreme values in model probability output causing nan. To\naddress these issues, we introduce the Sigmoid Trick, a novel technique that\nreduces the gradient update of both cases resulting in better retrieval\nefficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we\nachieved four state-of-the-art positions on the kilt knowledge benchmark.\n","authors":["Ziyi Chen","Heyi Tao","Daqian Zuo","Jize Jiang","Yang Jun","Yuxiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12162v1","updated":"2023-12-19T13:51:48Z","published":"2023-12-19T13:51:48Z","title":"PEPT: Expert Finding Meets Personalized Pre-training","summary":" Finding appropriate experts is essential in Community Question Answering\n(CQA) platforms as it enables the effective routing of questions to potential\nusers who can provide relevant answers. The key is to personalized learning\nexpert representations based on their historical answered questions, and\naccurately matching them with target questions. There have been some\npreliminary works exploring the usability of PLMs in expert finding, such as\npre-training expert or question representations. However, these models usually\nlearn pure text representations of experts from histories, disregarding\npersonalized and fine-grained expert modeling. For alleviating this, we present\na personalized pre-training and fine-tuning paradigm, which could effectively\nlearn expert interest and expertise simultaneously. Specifically, in our\npre-training framework, we integrate historical answered questions of one\nexpert with one target question, and regard it as a candidate aware\nexpert-level input unit. Then, we fuse expert IDs into the pre-training for\nguiding the model to model personalized expert representations, which can help\ncapture the unique characteristics and expertise of each individual expert.\nAdditionally, in our pre-training task, we design: 1) a question-level masked\nlanguage model task to learn the relatedness between histories, enabling the\nmodeling of question-level expert interest; 2) a vote-oriented task to capture\nquestion-level expert expertise by predicting the vote score the expert would\nreceive. Through our pre-training framework and tasks, our approach could\nholistically learn expert representations including interests and expertise.\nOur method has been extensively evaluated on six real-world CQA datasets, and\nthe experimental results consistently demonstrate the superiority of our\napproach over competitive baseline methods.\n","authors":["Qiyao Peng","Hongtao Liu","Hongyan Xu","Yinghui Wang","Wenjun Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12162v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.02327v2","updated":"2023-12-19T13:06:54Z","published":"2022-01-07T04:55:45Z","title":"On the Effectiveness of Sampled Softmax Loss for Item Recommendation","summary":" The learning objective plays a fundamental role to build a recommender\nsystem. Most methods routinely adopt either pointwise or pairwise loss to train\nthe model parameters, while rarely pay attention to softmax loss due to its\ncomputational complexity when scaling up to large datasets or intractability\nfor streaming data. The sampled softmax (SSM) loss emerges as an efficient\nsubstitute for softmax loss. Its special case, InfoNCE loss, has been widely\nused in self-supervised learning and exhibited remarkable performance for\ncontrastive learning. Nonetheless, limited recommendation work uses the SSM\nloss as the learning objective. Worse still, none of them explores its\nproperties thoroughly and answers ``Does SSM loss suit for item\nrecommendation?'' and ``What are the conceptual advantages of SSM loss, as\ncompared with the prevalent losses?'', to the best of our knowledge.\n In this work, we aim to offer a better understanding of SSM for item\nrecommendation. Specifically, we first theoretically reveal three\nmodel-agnostic advantages: (1) mitigating popularity bias; (2) mining hard\nnegative samples; and (3) maximizing the ranking metric. However, based on our\nempirical studies, we recognize that the default choice of cosine similarity\nfunction in SSM limits its ability in learning the magnitudes of representation\nvectors. As such, the combinations of SSM with the models that also fall short\nin adjusting magnitudes may result in poor representations. One step further,\nwe provide mathematical proof that message passing schemes in graph convolution\nnetworks can adjust representation magnitude according to node degree, which\nnaturally compensates for the shortcoming of SSM. Extensive experiments on four\nbenchmark datasets justify our analyses, demonstrating the superiority of SSM\nfor item recommendation. Our implementations are available in both TensorFlow\nand PyTorch.\n","authors":["Jiancan Wu","Xiang Wang","Xingyu Gao","Jiawei Chen","Hongcheng Fu","Tianyu Qiu"],"pdf_url":"https://arxiv.org/pdf/2201.02327v2.pdf","comment":"Accepted by TOIS"},{"id":"http://arxiv.org/abs/2312.12111v1","updated":"2023-12-19T12:33:38Z","published":"2023-12-19T12:33:38Z","title":"Designing and Evaluating General-Purpose User Representations Based on\n Behavioral Logs from a Measurement Process Perspective: A Case Study with\n Snapchat","summary":" In human-computer interaction, understanding user behaviors and tailoring\nsystems accordingly is pivotal. To this end, general-purpose user\nrepresentation learning based on behavior logs is emerging as a powerful tool\nin user modeling, offering adaptability to various downstream tasks such as\nitem recommendations and ad conversion prediction, without the need to\nfine-tune the upstream user model. While this methodology has shown promise in\ncontexts like search engines and e-commerce platforms, its fit for instant\nmessaging apps, a cornerstone of modern digital communication, remains largely\nuncharted. These apps, with their distinct interaction patterns, data\nstructures, and user expectations, necessitate specialized attention. We\nexplore this user modeling approach with Snapchat data as a case study.\nFurthermore, we introduce a novel design and evaluation framework rooted in the\nprinciples of the Measurement Process Framework from social science research\nmethodology. Using this new framework, we design a Transformer-based user model\nthat can produce high-quality general-purpose user representations for instant\nmessaging platforms like Snapchat.\n","authors":["Qixiang Fang","Zhihan Zhou","Francesco Barbieri","Yozen Liu","Leonardo Neves","Dong Nguyen","Daniel L. Oberski","Maarten W. Bos","Ron Dotsch"],"pdf_url":"https://arxiv.org/pdf/2312.12111v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12100v1","updated":"2023-12-19T12:22:40Z","published":"2023-12-19T12:22:40Z","title":"VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication\n Recommendation","summary":" We address the medication recommendation problem, which aims to recommend\neffective medications for a patient's current visit by utilizing information\n(e.g., diagnoses and procedures) given at the patient's current and past\nvisits. While there exist a number of recommender systems designed for this\nproblem, we point out that they are challenged in accurately capturing the\nrelation (spec., the degree of relevance) between the current and each of the\npast visits for the patient when obtaining her current health status, which is\nthe basis for recommending medications. To address this limitation, we propose\na novel medication recommendation framework, named VITA, based on the following\ntwo novel ideas: (1) relevant-Visit selectIon; (2) Target-aware Attention.\nThrough extensive experiments using real-world datasets, we demonstrate the\nsuperiority of VITA (spec., up to 5.56% higher accuracy, in terms of Jaccard,\nthan the best competitor) and the effectiveness of its two core ideas. The code\nis available at https://github.com/jhheo0123/VITA.\n","authors":["Taeri Kim","Jiho Heo","Hongil Kim","Kijung Shin","Sang-Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2312.12100v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11229v2","updated":"2023-12-19T08:01:16Z","published":"2023-12-18T14:23:47Z","title":"CaseGNN: Graph Neural Networks for Legal Case Retrieval with\n Text-Attributed Graphs","summary":" Legal case retrieval is an information retrieval task in the legal domain,\nwhich aims to retrieve relevant cases with a given query case. Recent research\nof legal case retrieval mainly relies on traditional bag-of-words models and\nlanguage models. Although these methods have achieved significant improvement\nin retrieval accuracy, there are still two challenges: (1) Legal structural\ninformation neglect. Previous neural legal case retrieval models mostly encode\nthe unstructured raw text of case into a case representation, which causes the\nlack of important legal structural information in a case and leads to poor case\nrepresentation; (2) Lengthy legal text limitation. When using the powerful\nBERT-based models, there is a limit of input text lengths, which inevitably\nrequires to shorten the input via truncation or division with a loss of legal\ncontext information. In this paper, a graph neural networks-based legal case\nretrieval model, CaseGNN, is developed to tackle these challenges. To\neffectively utilise the legal structural information during encoding, a case is\nfirstly converted into a Text-Attributed Case Graph (TACG), followed by a\ndesigned Edge Graph Attention Layer and a readout function to obtain the case\ngraph representation. The CaseGNN model is optimised with a carefully designed\ncontrastive loss with easy and hard negative sampling. Since the text\nattributes in the case graph come from individual sentences, the restriction of\nusing language models is further avoided without losing the legal context.\nExtensive experiments have been conducted on two benchmarks from COLIEE 2022\nand COLIEE 2023, which demonstrate that CaseGNN outperforms other\nstate-of-the-art legal case retrieval methods. The code has been released on\nhttps://github.com/yanran-tang/CaseGNN.\n","authors":["Yanran Tang","Ruihong Qiu","Yilun Liu","Xue Li","Zi Huang"],"pdf_url":"https://arxiv.org/pdf/2312.11229v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.11518v2","updated":"2023-12-19T07:40:45Z","published":"2023-09-19T09:17:07Z","title":"Ad-load Balancing via Off-policy Learning in a Content Marketplace","summary":" Ad-load balancing is a critical challenge in online advertising systems,\nparticularly in the context of social media platforms, where the goal is to\nmaximize user engagement and revenue while maintaining a satisfactory user\nexperience. This requires the optimization of conflicting objectives, such as\nuser satisfaction and ads revenue. Traditional approaches to ad-load balancing\nrely on static allocation policies, which fail to adapt to changing user\npreferences and contextual factors. In this paper, we present an approach that\nleverages off-policy learning and evaluation from logged bandit feedback. We\nstart by presenting a motivating analysis of the ad-load balancing problem,\nhighlighting the conflicting objectives between user satisfaction and ads\nrevenue. We emphasize the nuances that arise due to user heterogeneity and the\ndependence on the user's position within a session. Based on this analysis, we\ndefine the problem as determining the optimal ad-load for a particular feed\nfetch. To tackle this problem, we propose an off-policy learning framework that\nleverages unbiased estimators such as Inverse Propensity Scoring (IPS) and\nDoubly Robust (DR) to learn and estimate the policy values using offline\ncollected stochastic data. We present insights from online A/B experiments\ndeployed at scale across over 80 million users generating over 200 million\nsessions, where we find statistically significant improvements in both user\nsatisfaction metrics and ads revenue for the platform.\n","authors":["Hitesh Sagtani","Madan Jhawar","Rishabh Mehrotra","Olivier Jeunen"],"pdf_url":"https://arxiv.org/pdf/2309.11518v2.pdf","comment":"Early version presented at the CONSEQUENCES '23 workshop at RecSys\n '23, final version appearing at WSDM '24"},{"id":"http://arxiv.org/abs/2311.16751v2","updated":"2023-12-19T01:00:12Z","published":"2023-11-28T12:50:40Z","title":"MultiCBR: Multi-view Contrastive Learning for Bundle Recommendation","summary":" Bundle recommendation seeks to recommend a bundle of related items to users\nto improve both user experience and the profits of platform. Existing bundle\nrecommendation models have progressed from capturing only user-bundle\ninteractions to the modeling of multiple relations among users, bundles and\nitems. CrossCBR, in particular, incorporates cross-view contrastive learning\ninto a two-view preference learning framework, significantly improving SOTA\nperformance. It does, however, have two limitations: 1) the two-view\nformulation does not fully exploit all the heterogeneous relations among users,\nbundles and items; and 2) the \"early contrast and late fusion\" framework is\nless effective in capturing user preference and difficult to generalize to\nmultiple views. In this paper, we present MultiCBR, a novel Multi-view\nContrastive learning framework for Bundle Recommendation. First, we devise a\nmulti-view representation learning framework capable of capturing all the\nuser-bundle, user-item and bundle-item relations, especially better utilizing\nthe bundle-item affiliations to enhance sparse bundles' representations.\nSecond, we innovatively adopt an \"early fusion and late contrast\" design that\nfirst fuses the multi-view representations before performing self-supervised\ncontrastive learning. In comparison to existing approaches, our framework\nreverses the order of fusion and contrast, introducing the following\nadvantages: 1)our framework is capable of modeling both cross-view and ego-view\npreferences, allowing us to achieve enhanced user preference modeling; and 2)\ninstead of requiring quadratic number of cross-view contrastive losses, we only\nrequire two self-supervised contrastive losses, resulting in minimal extra\ncosts. Experimental results on three public datasets indicate that our method\noutperforms SOTA methods.\n","authors":["Yunshan Ma","Yingzhi He","Xiang Wang","Yinwei Wei","Xiaoyu Du","Yuyangzi Fu","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2311.16751v2.pdf","comment":"fix a typo in Table 2, i.e., the R@20 and N@20 of LightGCL are\n updated"},{"id":"http://arxiv.org/abs/2311.04207v2","updated":"2023-12-19T22:01:10Z","published":"2023-11-07T18:47:28Z","title":"Deep Hashing via Householder Quantization","summary":" Hashing is at the heart of large-scale image similarity search, and recent\nmethods have been substantially improved through deep learning techniques. Such\nalgorithms typically learn continuous embeddings of the data. To avoid a\nsubsequent costly binarization step, a common solution is to employ loss\nfunctions that combine a similarity learning term (to ensure similar images are\ngrouped to nearby embeddings) and a quantization penalty term (to ensure that\nthe embedding entries are close to binarized entries, e.g., -1 or 1). Still,\nthe interaction between these two terms can make learning harder and the\nembeddings worse. We propose an alternative quantization strategy that\ndecomposes the learning problem in two stages: first, perform similarity\nlearning over the embedding space with no quantization; second, find an optimal\northogonal transformation of the embeddings so each coordinate of the embedding\nis close to its sign, and then quantize the transformed embedding through the\nsign function. In the second step, we parametrize orthogonal transformations\nusing Householder matrices to efficiently leverage stochastic gradient descent.\nSince similarity measures are usually invariant under orthogonal\ntransformations, this quantization strategy comes at no cost in terms of\nperformance. The resulting algorithm is unsupervised, fast, hyperparameter-free\nand can be run on top of any existing deep hashing or metric learning\nalgorithm. We provide extensive experimental results showing that this approach\nleads to state-of-the-art performance on widely used image datasets, and,\nunlike other quantization strategies, brings consistent improvements in\nperformance to existing deep hashing algorithms.\n","authors":["Lucas R. Schwengber","Lucas Resende","Paulo Orenstein","Roberto I. Oliveira"],"pdf_url":"https://arxiv.org/pdf/2311.04207v2.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2311.18826v3","updated":"2023-12-19T18:59:34Z","published":"2023-11-30T18:59:05Z","title":"Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal\n Inference","summary":" This manuscript enriches the framework of continuous normalizing flows (CNFs)\nwithin causal inference, primarily to augment the geometric properties of\nparametric submodels used in targeted maximum likelihood estimation (TMLE). By\nintroducing an innovative application of CNFs, we construct a refined series of\nparametric submodels that enable a directed interpolation between the prior\ndistribution $p_0$ and the empirical distribution $p_1$. This proposed\nmethodology serves to optimize the semiparametric efficiency bound in causal\ninference by orchestrating CNFs to align with Wasserstein gradient flows. Our\napproach not only endeavors to minimize the mean squared error in the\nestimation but also imbues the estimators with geometric sophistication,\nthereby enhancing robustness against misspecification. This robustness is\ncrucial, as it alleviates the dependence on the standard $n^{\\frac{1}{4}}$ rate\nfor a doubly-robust perturbation direction in TMLE. By incorporating robust\noptimization principles and differential geometry into the estimators, the\ndeveloped geometry-aware CNFs represent a significant advancement in the\npursuit of doubly robust causal inference.\n","authors":["Kaiwen Hou"],"pdf_url":"https://arxiv.org/pdf/2311.18826v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12433v1","updated":"2023-12-19T18:58:40Z","published":"2023-12-19T18:58:40Z","title":"Tracking Any Object Amodally","summary":" Amodal perception, the ability to comprehend complete object structures from\npartial visibility, is a fundamental skill, even for infants. Its significance\nextends to applications like autonomous driving, where a clear understanding of\nheavily occluded objects is essential. However, modern detection and tracking\nalgorithms often overlook this critical capability, perhaps due to the\nprevalence of modal annotations in most datasets. To address the scarcity of\namodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse\ncategories in thousands of video sequences. Our dataset includes amodal and\nmodal bounding boxes for visible and occluded objects, including objects that\nare partially out-of-frame. To enhance amodal tracking with object permanence,\nwe leverage a lightweight plug-in module, the amodal expander, to transform\nstandard, modal trackers into amodal ones through fine-tuning on a few hundred\nvideo sequences with data augmentation. We achieve a 3.3\\% and 1.6\\%\nimprovement on the detection and tracking of occluded objects on TAO-Amodal.\nWhen evaluated on people, our method produces dramatic improvements of 2x\ncompared to state-of-the-art modal baselines.\n","authors":["Cheng-Yen Hsieh","Tarasha Khurana","Achal Dave","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2312.12433v1.pdf","comment":"Project Page: https://tao-amodal.github.io"},{"id":"http://arxiv.org/abs/2312.12430v1","updated":"2023-12-19T18:56:52Z","published":"2023-12-19T18:56:52Z","title":"Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP","summary":" We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel\ntitle reranking technique to achieve efficient title reranking 20x-40x faster\nthan vanilla passage reranker. However, one of the challenges with the training\nof Efficient Title Reranker is the instability. Analyzing the issue, we found\nsome very difficult ground truths might act as noisy labels causing accuracy to\ndrop as well as some extreme values in model probability output causing nan. To\naddress these issues, we introduce the Sigmoid Trick, a novel technique that\nreduces the gradient update of both cases resulting in better retrieval\nefficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we\nachieved four state-of-the-art positions on the kilt knowledge benchmark.\n","authors":["Ziyi Chen","Heyi Tao","Daqian Zuo","Jize Jiang","Yang Jun","Yuxiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12416v1","updated":"2023-12-19T18:47:30Z","published":"2023-12-19T18:47:30Z","title":"Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image\n Diffusion Models","summary":" The quality of the prompts provided to text-to-image diffusion models\ndetermines how faithful the generated content is to the user's intent, often\nrequiring `prompt engineering'. To harness visual concepts from target images\nwithout prompt engineering, current approaches largely rely on embedding\ninversion by optimizing and then mapping them to pseudo-tokens. However,\nworking with such high-dimensional vector representations is challenging\nbecause they lack semantics and interpretability, and only allow simple vector\noperations when using them. Instead, this work focuses on inverting the\ndiffusion model to obtain interpretable language prompts directly. The\nchallenge of doing this lies in the fact that the resulting optimization\nproblem is fundamentally discrete and the space of prompts is exponentially\nlarge; this makes using standard optimization techniques, such as stochastic\ngradient descent, difficult. To this end, we utilize a delayed projection\nscheme to optimize for prompts representative of the vocabulary space in the\nmodel. Further, we leverage the findings that different timesteps of the\ndiffusion process cater to different levels of detail in an image. The later,\nnoisy, timesteps of the forward diffusion process correspond to the semantic\ninformation, and therefore, prompt inversion in this range provides tokens\nrepresentative of the image semantics. We show that our approach can identify\nsemantically interpretable and meaningful prompts for a target image which can\nbe used to synthesize diverse images with similar content. We further\nillustrate the application of the optimized prompts in evolutionary image\ngeneration and concept removal.\n","authors":["Shweta Mahajan","Tanzila Rahman","Kwang Moo Yi","Leonid Sigal"],"pdf_url":"https://arxiv.org/pdf/2312.12416v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.08645v2","updated":"2023-12-19T18:46:19Z","published":"2022-12-16T18:39:32Z","title":"Efficient Conditionally Invariant Representation Learning","summary":" We introduce the Conditional Independence Regression CovariancE (CIRCE), a\nmeasure of conditional independence for multivariate continuous-valued\nvariables. CIRCE applies as a regularizer in settings where we wish to learn\nneural features $\\varphi(X)$ of data $X$ to estimate a target $Y$, while being\nconditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are\nassumed to be continuous-valued but relatively low dimensional, whereas $X$ and\nits features may be complex and high dimensional. Relevant settings include\ndomain-invariant learning, fairness, and causal learning. The procedure\nrequires just a single ridge regression from $Y$ to kernelized features of $Z$,\nwhich can be done in advance. It is then only necessary to enforce independence\nof $\\varphi(X)$ from residuals of this regression, which is possible with\nattractive estimation properties and consistency guarantees. By contrast,\nearlier measures of conditional feature dependence require multiple regressions\nfor each step of feature learning, resulting in more severe bias and variance,\nand greater computational cost. When sufficiently rich features are used, we\nestablish that CIRCE is zero if and only if $\\varphi(X) \\perp \\!\\!\\! \\perp Z\n\\mid Y$. In experiments, we show superior performance to previous methods on\nchallenging benchmarks, including learning conditionally invariant image\nfeatures.\n","authors":["Roman Pogodin","Namrata Deka","Yazhe Li","Danica J. Sutherland","Victor Veitch","Arthur Gretton"],"pdf_url":"https://arxiv.org/pdf/2212.08645v2.pdf","comment":"ICLR 2023"},{"id":"http://arxiv.org/abs/2308.13304v2","updated":"2023-12-19T18:45:10Z","published":"2023-08-25T11:04:35Z","title":"Rapid Artefact Removal and H&E-Stained Tissue Segmentation","summary":" We present an innovative method for rapidly segmenting hematoxylin and eosin\n(H&E)-stained tissue in whole-slide images (WSIs) that eliminates a wide range\nof undesirable artefacts such as pen marks and scanning artefacts. Our method\ninvolves taking a single-channel representation of a lowmagnification RGB\noverview of the WSI in which the pixel values are bimodally distributed such\nthat H&E-stained tissue is easily distinguished from both background and a wide\nvariety of artefacts. We demonstrate our method on 30 WSIs prepared from a wide\nrange of institutions and WSI digital scanners, each containing substantial\nartefacts, and compare it to segmentations provided by Otsu thresholding and\nHistolab tissue segmentation and pen filtering tools. We found that our method\nsegmented the tissue and fully removed all artefacts in 29 out of 30 WSIs,\nwhereas Otsu thresholding failed to remove any artefacts, and the Histolab pen\nfiltering tools only partially removed the pen marks. The beauty of our\napproach lies in its simplicity: manipulating RGB colour space and using Otsu\nthresholding allows for the segmentation of H&E-stained tissue and the rapid\nremoval of artefacts without the need for machine learning or parameter tuning.\n","authors":["B. A. Schreiber","J. Denholm","F. Jaeckle","M. J. Arends","K. M. Branson","C. -B. Schönlieb","E. J. Soilleux"],"pdf_url":"https://arxiv.org/pdf/2308.13304v2.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.12400v1","updated":"2023-12-19T18:35:33Z","published":"2023-12-19T18:35:33Z","title":"New classes of the greedy-applicable arm feature distributions in the\n sparse linear bandit problem","summary":" We consider the sparse contextual bandit problem where arm feature affects\nreward through the inner product of sparse parameters. Recent studies have\ndeveloped sparsity-agnostic algorithms based on the greedy arm selection\npolicy. However, the analysis of these algorithms requires strong assumptions\non the arm feature distribution to ensure that the greedily selected samples\nare sufficiently diverse; One of the most common assumptions, relaxed symmetry,\nimposes approximate origin-symmetry on the distribution, which cannot allow\ndistributions that has origin-asymmetric support. In this paper, we show that\nthe greedy algorithm is applicable to a wider range of the arm feature\ndistributions from two aspects. Firstly, we show that a mixture distribution\nthat has a greedy-applicable component is also greedy-applicable. Second, we\npropose new distribution classes, related to Gaussian mixture, discrete, and\nradial distribution, for which the sample diversity is guaranteed. The proposed\nclasses can describe distributions with origin-asymmetric support and, in\nconjunction with the first claim, provide theoretical guarantees of the greedy\npolicy for a very wide range of the arm feature distributions.\n","authors":["Koji Ichikawa","Shinji Ito","Daisuke Hatano","Hanna Sumita","Takuro Fukunaga","Naonori Kakimura","Ken-ichi Kawarabayashi"],"pdf_url":"https://arxiv.org/pdf/2312.12400v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.08830v2","updated":"2023-12-19T18:31:20Z","published":"2023-01-20T23:55:30Z","title":"Finding Nash equilibria by minimizing approximate exploitability with\n learned best responses","summary":" There has been substantial progress on finding game-theoretic equilibria.\nMost of that work has focused on games with finite, discrete action spaces.\nHowever, many games involving space, time, money, and other fine-grained\nquantities have continuous action spaces (or are best modeled as such). We\nstudy the problem of finding an approximate Nash equilibrium of games with\ncontinuous action sets. The standard measure of closeness to Nash equilibrium\nis exploitability, which measures how much players can benefit from\nunilaterally changing their strategy. We propose two new methods that minimize\nan approximation of the exploitability with respect to the strategy profile.\nThe first method uses a learned best-response function, which takes the current\nstrategy profile as input and returns candidate best responses for each player.\nThe strategy profile and best-response functions are trained simultaneously,\nwith the former trying to minimize exploitability while the latter tries to\nmaximize it. The second method maintains an ensemble of candidate best\nresponses for each player. In each iteration, the best-performing elements of\neach ensemble are used to update the current strategy profile. The strategy\nprofile and best-response ensembles are simultaneously trained to minimize and\nmaximize the approximate exploitability, respectively. We evaluate our methods\non various continuous games, showing that they outperform prior methods.\n","authors":["Carlos Martin","Tuomas Sandholm"],"pdf_url":"https://arxiv.org/pdf/2301.08830v2.pdf","comment":"arXiv admin note: text overlap with arXiv:1611.01673 by other authors"},{"id":"http://arxiv.org/abs/2309.15188v3","updated":"2023-12-19T18:25:08Z","published":"2023-09-26T18:49:30Z","title":"ICML 2023 Topological Deep Learning Challenge : Design and Results","summary":" This paper presents the computational challenge on topological deep learning\nthat was hosted within the ICML 2023 Workshop on Topology and Geometry in\nMachine Learning. The competition asked participants to provide open-source\nimplementations of topological neural networks from the literature by\ncontributing to the python packages TopoNetX (data processing) and TopoModelX\n(deep learning). The challenge attracted twenty-eight qualifying submissions in\nits two-month duration. This paper describes the design of the challenge and\nsummarizes its main findings.\n","authors":["Mathilde Papillon","Mustafa Hajij","Helen Jenne","Johan Mathe","Audun Myers","Theodore Papamarkou","Ghada Zamzmi","Tolga Birdal","Tamal Dey","Tim Doster","Tegan Emerson","Gurusankar Gopalakrishnan","Devendra Govil","Aldo Guzmán-Sáenz","Henry Kvinge","Neal Livesay","Soham Mukherjee","Shreyas N. Samaga","Karthikeyan Natesan Ramamurthy","Maneel Reddy Karri","Paul Rosen","Sophia Sanborn","Robin Walters","Jens Agerberg","Sadrodin Barikbin","Claudio Battiloro","Gleb Bazhenov","Guillermo Bernardez","Aiden Brent","Sergio Escalera","Simone Fiorellino","Dmitrii Gavrilev","Mohammed Hassanin","Paul Häusner","Odin Hoff Gardaa","Abdelwahed Khamis","Manuel Lecha","German Magai","Tatiana Malygina","Rubén Ballester","Kalyan Nadimpalli","Alexander Nikitin","Abraham Rabinowitz","Alessandro Salatiello","Simone Scardapane","Luca Scofano","Suraj Singh","Jens Sjölund","Pavel Snopov","Indro Spinelli","Lev Telyatnikov","Lucia Testa","Maosheng Yang","Yixiao Yue","Olga Zaghen","Ali Zia","Nina Miolane"],"pdf_url":"https://arxiv.org/pdf/2309.15188v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12369v1","updated":"2023-12-19T18:00:15Z","published":"2023-12-19T18:00:15Z","title":"Chasing Fairness in Graphs: A GNN Architecture Perspective","summary":" There has been significant progress in improving the performance of graph\nneural networks (GNNs) through enhancements in graph data, model architecture\ndesign, and training strategies. For fairness in graphs, recent studies achieve\nfair representations and predictions through either graph data pre-processing\n(e.g., node feature masking, and topology rewiring) or fair training strategies\n(e.g., regularization, adversarial debiasing, and fair contrastive learning).\nHow to achieve fairness in graphs from the model architecture perspective is\nless explored. More importantly, GNNs exhibit worse fairness performance\ncompared to multilayer perception since their model architecture (i.e.,\nneighbor aggregation) amplifies biases. To this end, we aim to achieve fairness\nvia a new GNN architecture. We propose \\textsf{F}air \\textsf{M}essage\n\\textsf{P}assing (FMP) designed within a unified optimization framework for\nGNNs. Notably, FMP \\textit{explicitly} renders sensitive attribute usage in\n\\textit{forward propagation} for node classification task using cross-entropy\nloss without data pre-processing. In FMP, the aggregation is first adopted to\nutilize neighbors' information and then the bias mitigation step explicitly\npushes demographic group node presentation centers together. In this way, FMP\nscheme can aggregate useful information from neighbors and mitigate bias to\nachieve better fairness and prediction tradeoff performance. Experiments on\nnode classification tasks demonstrate that the proposed FMP outperforms several\nbaselines in terms of fairness and accuracy on three real-world datasets. The\ncode is available in {\\url{https://github.com/zhimengj0326/FMP}}.\n","authors":["Zhimeng Jiang","Xiaotian Han","Chao Fan","Zirui Liu","Na Zou","Ali Mostafavi","Xia Hu"],"pdf_url":"https://arxiv.org/pdf/2312.12369v1.pdf","comment":"Accepted by AAAI Conference on Artificial Intelligence (AAAI) 2024.\n arXiv admin note: substantial text overlap with arXiv:2202.04187"},{"id":"http://arxiv.org/abs/2312.05571v2","updated":"2023-12-19T17:48:20Z","published":"2023-12-09T13:20:49Z","title":"Frugal LMs Trained to Invoke Symbolic Solvers Achieve\n Parameter-Efficient Arithmetic Reasoning","summary":" Large Language Models (LLM) exhibit zero-shot mathematical reasoning capacity\nas a behavior emergent with scale, commonly manifesting as chain-of-thoughts\n(CoT) reasoning. However, multiple empirical findings suggest that this prowess\nis exclusive to LLMs with exorbitant sizes (beyond 50 billion parameters).\nMeanwhile, educational neuroscientists suggest that symbolic algebraic\nmanipulation be introduced around the same time as arithmetic word problems to\nmodularize language-to-formulation, symbolic manipulation of the formulation,\nand endgame arithmetic. In this paper, we start with the hypothesis that much\nsmaller LMs, which are weak at multi-step reasoning, can achieve reasonable\narithmetic reasoning if arithmetic word problems are posed as a\nformalize-then-solve task. In our architecture, which we call SYRELM, the LM\nserves the role of a translator to map natural language arithmetic questions\ninto a formal language (FL) description. A symbolic solver then evaluates the\nFL expression to obtain the answer. A small frozen LM, equipped with an\nefficient low-rank adapter, is capable of generating FL expressions that\nincorporate natural language descriptions of the arithmetic problem (e.g.,\nvariable names and their purposes, formal expressions combining variables,\netc.). We adopt policy-gradient reinforcement learning to train the adapted LM,\ninformed by the non-differentiable symbolic solver. This marks a sharp\ndeparture from the recent development in tool-augmented LLMs, in which the\nexternal tools (e.g., calculator, Web search, etc.) are essentially detached\nfrom the learning phase of the LM. SYRELM shows massive improvements (e.g.,\n+30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J\n6B model) over base LMs, while keeping our testbed easy to diagnose, interpret\nand within reach of most researchers.\n","authors":["Subhabrata Dutta","Joykirat Singh","Ishan Pandey","Sunny Manchanda","Soumen Chakrabarti","Tanmoy Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2312.05571v2.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2205.15677v4","updated":"2023-12-19T17:42:05Z","published":"2022-05-31T10:35:55Z","title":"Augmentation-Aware Self-Supervision for Data-Efficient GAN Training","summary":" Training generative adversarial networks (GANs) with limited data is\nchallenging because the discriminator is prone to overfitting. Previously\nproposed differentiable augmentation demonstrates improved data efficiency of\ntraining GANs. However, the augmentation implicitly introduces undesired\ninvariance to augmentation for the discriminator since it ignores the change of\nsemantics in the label space caused by data transformation, which may limit the\nrepresentation learning ability of the discriminator and ultimately affect the\ngenerative modeling performance of the generator. To mitigate the negative\nimpact of invariance while inheriting the benefits of data augmentation, we\npropose a novel augmentation-aware self-supervised discriminator that predicts\nthe augmentation parameter of the augmented data. Particularly, the prediction\ntargets of real data and generated data are required to be distinguished since\nthey are different during training. We further encourage the generator to\nadversarially learn from the self-supervised discriminator by generating\naugmentation-predictable real and not fake data. This formulation connects the\nlearning objective of the generator and the arithmetic $-$ harmonic mean\ndivergence under certain assumptions. We compare our method with\nstate-of-the-art (SOTA) methods using the class-conditional BigGAN and\nunconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100,\nFFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate\nsignificant improvements of our method over SOTA methods in training\ndata-efficient GANs.\n","authors":["Liang Hou","Qi Cao","Yige Yuan","Songtao Zhao","Chongyang Ma","Siyuan Pan","Pengfei Wan","Zhongyuan Wang","Huawei Shen","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2205.15677v4.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.12357v1","updated":"2023-12-19T17:38:26Z","published":"2023-12-19T17:38:26Z","title":"Modeling non-linear Effects with Neural Networks in Relational Event\n Models","summary":" Dynamic networks offer an insight of how relational systems evolve. However,\nmodeling these networks efficiently remains a challenge, primarily due to\ncomputational constraints, especially as the number of observed events grows.\nThis paper addresses this issue by introducing the Deep Relational Event\nAdditive Model (DREAM) as a solution to the computational challenges presented\nby modeling non-linear effects in Relational Event Models (REMs). DREAM relies\non Neural Additive Models to model non-linear effects, allowing each effect to\nbe captured by an independent neural network. By strategically trading\ncomputational complexity for improved memory management and leveraging the\ncomputational capabilities of Graphic Processor Units (GPUs), DREAM efficiently\ncaptures complex non-linear relationships within data. This approach\ndemonstrates the capability of DREAM in modeling dynamic networks and scaling\nto larger networks. Comparisons with traditional REM approaches showcase DREAM\nsuperior computational efficiency. The model potential is further demonstrated\nby an examination of the patent citation network, which contains nearly 8\nmillion nodes and 100 million events.\n","authors":["Edoardo Filippi-Mazzola","Ernst C. Wit"],"pdf_url":"https://arxiv.org/pdf/2312.12357v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.16222v2","updated":"2023-12-19T17:19:49Z","published":"2022-10-28T15:56:55Z","title":"Improving Lipschitz-Constrained Neural Networks by Learning Activation\n Functions","summary":" Lipschitz-constrained neural networks have several advantages over\nunconstrained ones and can be applied to a variety of problems, making them a\ntopic of attention in the deep learning community. Unfortunately, it has been\nshown both theoretically and empirically that they perform poorly when equipped\nwith ReLU activation functions. By contrast, neural networks with learnable\n1-Lipschitz linear splines are known to be more expressive. In this paper, we\nshow that such networks correspond to global optima of a constrained functional\noptimization problem that consists of the training of a neural network composed\nof 1-Lipschitz linear layers and 1-Lipschitz freeform activation functions with\nsecond-order total-variation regularization. Further, we propose an efficient\nmethod to train these neural networks. Our numerical experiments show that our\ntrained networks compare favorably with existing 1-Lipschitz neural\narchitectures.\n","authors":["Stanislas Ducotterd","Alexis Goujon","Pakshal Bohra","Dimitris Perdios","Sebastian Neumayer","Michael Unser"],"pdf_url":"https://arxiv.org/pdf/2210.16222v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12345v1","updated":"2023-12-19T17:17:52Z","published":"2023-12-19T17:17:52Z","title":"On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation","summary":" Imitation learning with visual observations is notoriously inefficient when\naddressed with end-to-end behavioural cloning methods. In this paper, we\nexplore an alternative paradigm which decomposes reasoning into three phases.\nFirst, a retrieval phase, which informs the robot what it can do with an\nobject. Second, an alignment phase, which informs the robot where to interact\nwith the object. And third, a replay phase, which informs the robot how to\ninteract with the object. Through a series of real-world experiments on\neveryday tasks, such as grasping, pouring, and inserting objects, we show that\nthis decomposition brings unprecedented learning efficiency, and effective\ninter- and intra-class generalisation. Videos are available at\nhttps://www.robot-learning.uk/retrieval-alignment-replay.\n","authors":["Norman Di Palo","Edward Johns"],"pdf_url":"https://arxiv.org/pdf/2312.12345v1.pdf","comment":"Published in IEEE Robotics and Automation Letters (RA-L). (Accepted\n December 2023)"},{"id":"http://arxiv.org/abs/2312.12339v1","updated":"2023-12-19T17:12:35Z","published":"2023-12-19T17:12:35Z","title":"Value Explicit Pretraining for Goal-Based Transfer Learning","summary":" We propose a method that allows for learning task-agnostic representations\nbased on value function estimates from a sequence of observations where the\nlast frame corresponds to a goal. These representations would learn to relate\nstates across different tasks, based on the temporal distance to the goal\nstate, irrespective of the appearance changes and dynamics. This method could\nbe used to transfer learnt policies/skills to unseen related tasks.\n","authors":["Kiran Lekkala","Henghui Bao","Sumedh Sontakke","Laurent Itti"],"pdf_url":"https://arxiv.org/pdf/2312.12339v1.pdf","comment":"Accepted at CoRL 2023 Workshop on PRL"},{"id":"http://arxiv.org/abs/2311.05587v4","updated":"2023-12-19T17:07:19Z","published":"2023-11-09T18:47:33Z","title":"Bayesian Methods for Media Mix Modelling with shape and funnel effects","summary":" In recent years, significant progress in generative AI has highlighted the\nimportant role of physics-inspired models that utilize advanced mathematical\nconcepts based on fundamental physics principles to enhance artificial\nintelligence capabilities. Among these models, those based on diffusion\nequations have greatly improved image quality. This study aims to explore the\npotential uses of Maxwell-Boltzmann equation, which forms the basis of the\nkinetic theory of gases, and the Michaelis-Menten model in Marketing Mix\nModelling (MMM) applications. We propose incorporating these equations into\nHierarchical Bayesian models to analyse consumer behaviour in the context of\nadvertising. These equation sets excel in accurately describing the random\ndynamics in complex systems like social interactions and consumer-advertising\ninteractions.\n","authors":["Javier Marin"],"pdf_url":"https://arxiv.org/pdf/2311.05587v4.pdf","comment":"Rev. 4, December 2023"},{"id":"http://arxiv.org/abs/2312.08528v2","updated":"2023-12-19T17:07:02Z","published":"2023-12-13T21:34:30Z","title":"auto-sktime: Automated Time Series Forecasting","summary":" In today's data-driven landscape, time series forecasting is pivotal in\ndecision-making across various sectors. Yet, the proliferation of more diverse\ntime series data, coupled with the expanding landscape of available forecasting\nmethods, poses significant challenges for forecasters. To meet the growing\ndemand for efficient forecasting, we introduce auto-sktime, a novel framework\nfor automated time series forecasting. The proposed framework uses the power of\nautomated machine learning (AutoML) techniques to automate the creation of the\nentire forecasting pipeline. The framework employs Bayesian optimization, to\nautomatically construct pipelines from statistical, machine learning (ML) and\ndeep neural network (DNN) models. Furthermore, we propose three essential\nimprovements to adapt AutoML to time series data: First, pipeline templates to\naccount for the different supported forecasting models. Second, a novel\nwarm-starting technique to start the optimization from prior optimization runs.\nThird, we adapt multi-fidelity optimizations to make them applicable to a\nsearch space containing statistical, ML and DNN models. Experimental results on\n64 diverse real-world time series datasets demonstrate the effectiveness and\nefficiency of the framework, outperforming traditional methods while requiring\nminimal human involvement.\n","authors":["Marc-André Zöller","Marius Lindauer","Marco F. Huber"],"pdf_url":"https://arxiv.org/pdf/2312.08528v2.pdf","comment":"Submitted to AISTATS 2024"},{"id":"http://arxiv.org/abs/2211.08494v2","updated":"2023-12-19T17:04:59Z","published":"2022-11-15T20:47:14Z","title":"Who Reviews The Reviewers? A Multi-Level Jury Problem","summary":" We consider the problem of determining a binary ground truth using advice\nfrom a group of independent reviewers (experts) who express their guess about a\nground truth correctly with some independent probability (competence). In this\nsetting, when all reviewers are competent (competence greater than one-half),\nthe Condorcet Jury Theorem tells us that adding more reviewers increases the\noverall accuracy, and if all competences are known, then there exists an\noptimal weighting of the reviewers. However, in practical settings, reviewers\nmay be noisy or incompetent, i.e., competence below half, and the number of\nexperts may be small, so the asymptotic Condorcet Jury Theorem is not\npractically relevant. In such cases we explore appointing one or more chairs\n(judges) who determine the weight of each reviewer for aggregation, creating\nmultiple levels. However, these chairs may be unable to correctly identify the\ncompetence of the reviewers they oversee, and therefore unable to compute the\noptimal weighting. We give conditions when a set of chairs is able to weight\nthe reviewers optimally, and depending on the competence distribution of the\nagents, give results about when it is better to have more chairs or more\nreviewers. Through numerical simulations we show that in some cases it is\nbetter to have more chairs, but in many cases it is better to have more\nreviewers.\n","authors":["Ben Abramowitz","Omer Lev","Nicholas Mattei"],"pdf_url":"https://arxiv.org/pdf/2211.08494v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12337v1","updated":"2023-12-19T17:03:50Z","published":"2023-12-19T17:03:50Z","title":"pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable\n Generalizable 3D Reconstruction","summary":" We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D\nradiance fields parameterized by 3D Gaussian primitives from pairs of images.\nOur model features real-time and memory-efficient rendering for scalable\ntraining as well as fast 3D reconstruction at inference time. To overcome local\nminima inherent to sparse and locally supported representations, we predict a\ndense probability distribution over 3D and sample Gaussian means from that\nprobability distribution. We make this sampling operation differentiable via a\nreparameterization trick, allowing us to back-propagate gradients through the\nGaussian splatting representation. We benchmark our method on wide-baseline\nnovel view synthesis on the real-world RealEstate10k and ACID datasets, where\nwe outperform state-of-the-art light field transformers and accelerate\nrendering by 2.5 orders of magnitude while reconstructing an interpretable and\neditable 3D radiance field.\n","authors":["David Charatan","Sizhe Li","Andrea Tagliasacchi","Vincent Sitzmann"],"pdf_url":"https://arxiv.org/pdf/2312.12337v1.pdf","comment":"Project page: https://pixelsplat.github.io/"},{"id":"http://arxiv.org/abs/2312.12321v1","updated":"2023-12-19T16:47:12Z","published":"2023-12-19T16:47:12Z","title":"Bypassing the Safety Training of Open-Source LLMs with Priming Attacks","summary":" With the recent surge in popularity of LLMs has come an ever-increasing need\nfor LLM safety training. In this paper, we show that SOTA open-source LLMs are\nvulnerable to simple, optimization-free attacks we refer to as $\\textit{priming\nattacks}$, which are easy to execute and effectively bypass alignment from\nsafety training. Our proposed attack improves the Attack Success Rate on\nHarmful Behaviors, as measured by Llama Guard, by up to $3.3\\times$ compared to\nbaselines. Source code and data are available at\nhttps://github.com/uiuc-focal-lab/llm-priming-attacks .\n","authors":["Jason Vega","Isha Chaudhary","Changming Xu","Gagandeep Singh"],"pdf_url":"https://arxiv.org/pdf/2312.12321v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12318v1","updated":"2023-12-19T16:43:17Z","published":"2023-12-19T16:43:17Z","title":"An Alternate View on Optimal Filtering in an RKHS","summary":" Kernel Adaptive Filtering (KAF) are mathematically principled methods which\nsearch for a function in a Reproducing Kernel Hilbert Space. While they work\nwell for tasks such as time series prediction and system identification they\nare plagued by a linear relationship between number of training samples and\nmodel size, hampering their use on the very large data sets common in today's\ndata saturated world. Previous methods try to solve this issue by\nsparsification. We describe a novel view of optimal filtering which may provide\na route towards solutions in a RKHS which do not necessarily have this linear\ngrowth in model size. We do this by defining a RKHS in which the time structure\nof a stochastic process is still present. Using correntropy [11], an extension\nof the idea of a covariance function, we create a time based functional which\ndescribes some potentially nonlinear desired mapping function. This form of a\nsolution may provide a fruitful line of research for creating more efficient\nrepresentations of functionals in a RKHS, while theoretically providing\ncomputational complexity in the test set similar to Wiener solution.\n","authors":["Benjamin Colburn","Jose C. Principe","Luis G. Sanchez Giraldo"],"pdf_url":"https://arxiv.org/pdf/2312.12318v1.pdf","comment":"5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2312.12315v1","updated":"2023-12-19T16:39:32Z","published":"2023-12-19T16:39:32Z","title":"Celestial Machine Learning: Discovering the Planarity, Heliocentricity,\n and Orbital Equation of Mars with AI Feynman","summary":" Can a machine or algorithm discover or learn the elliptical orbit of Mars\nfrom astronomical sightings alone? Johannes Kepler required two paradigm shifts\nto discover his First Law regarding the elliptical orbit of Mars. Firstly, a\nshift from the geocentric to the heliocentric frame of reference. Secondly, the\nreduction of the orbit of Mars from a three- to a two-dimensional space. We\nextend AI Feynman, a physics-inspired tool for symbolic regression, to discover\nthe heliocentricity and planarity of Mars' orbit and emulate his discovery of\nKepler's first law.\n","authors":["Zi-Yu Khoo","Gokul Rajiv","Abel Yang","Jonathan Sze Choong Low","Stéphane Bressan"],"pdf_url":"https://arxiv.org/pdf/2312.12315v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09775v2","updated":"2023-12-19T16:35:56Z","published":"2023-12-15T13:28:42Z","title":"A Comparative Evaluation of Additive Separability Tests for\n Physics-Informed Machine Learning","summary":" Many functions characterising physical systems are additively separable. This\nis the case, for instance, of mechanical Hamiltonian functions in physics,\npopulation growth equations in biology, and consumer preference and utility\nfunctions in economics. We consider the scenario in which a surrogate of a\nfunction is to be tested for additive separability. The detection that the\nsurrogate is additively separable can be leveraged to improve further learning.\nHence, it is beneficial to have the ability to test for such separability in\nsurrogates. The mathematical approach is to test if the mixed partial\nderivative of the surrogate is zero; or empirically, lower than a threshold. We\npresent and comparatively and empirically evaluate the eight methods to compute\nthe mixed partial derivative of a surrogate function.\n","authors":["Zi-Yu Khoo","Jonathan Sze Choong Low","Stéphane Bressan"],"pdf_url":"https://arxiv.org/pdf/2312.09775v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.12756v2","updated":"2023-12-19T16:17:04Z","published":"2022-09-21T08:28:43Z","title":"FAL-CUR: Fair Active Learning using Uncertainty and Representativeness\n on Fair Clustering","summary":" Active Learning (AL) techniques have proven to be highly effective in\nreducing data labeling costs across a range of machine learning tasks.\nNevertheless, one known challenge of these methods is their potential to\nintroduce unfairness towards sensitive attributes. Although recent approaches\nhave focused on enhancing fairness in AL, they tend to reduce the model's\naccuracy. To address this issue, we propose a novel strategy, named Fair Active\nLearning using fair Clustering, Uncertainty, and Representativeness (FAL-CUR),\nto improve fairness in AL. FAL-CUR tackles the fairness problem in AL by\ncombining fair clustering with an acquisition function that determines which\nsamples to query based on their uncertainty and representativeness scores. We\nevaluate the performance of FAL-CUR on four real-world datasets, and the\nresults demonstrate that FAL-CUR achieves a 15% - 20% improvement in fairness\ncompared to the best state-of-the-art method in terms of equalized odds while\nmaintaining stable accuracy scores. Furthermore, an ablation study highlights\nthe crucial roles of fair clustering in preserving fairness and the acquisition\nfunction in stabilizing the accuracy performance.\n","authors":["Ricky Fajri","Akrati Saxena","Yulong Pei","Mykola Pechenizkiy"],"pdf_url":"https://arxiv.org/pdf/2209.12756v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.14743v5","updated":"2023-12-19T16:05:51Z","published":"2023-11-21T18:41:26Z","title":"A Baseline Analysis of Reward Models' Ability To Accurately Analyze\n Foundation Models Under Distribution Shift","summary":" Foundation models, specifically Large Language Models (LLM's), have lately\ngained wide-spread attention and adoption. Reinforcement Learning with Human\nFeedback (RLHF) involves training a reward model to capture desired behaviors,\nwhich is then used to align LLM's. These reward models are additionally used at\ninference-time to estimate LLM responses' adherence to those desired behaviors.\nHowever, there is little work measuring how robust these reward models are to\ndistribution shifts. In this work, we evaluate how reward model performance -\nmeasured via accuracy and calibration (i.e. alignment between accuracy and\nconfidence) - is affected by distribution shift. We show novel calibration\npatterns and accuracy drops due to OOD prompts and responses, and that the\nreward model is more sensitive to shifts in responses than prompts.\nAdditionally, we adapt an OOD detection technique commonly used in\nclassification to the reward model setting to detect these distribution shifts\nin prompts and responses.\n","authors":["Will LeVine","Ben Pikus","Tony Chen","Sean Hendryx"],"pdf_url":"https://arxiv.org/pdf/2311.14743v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12276v1","updated":"2023-12-19T15:57:37Z","published":"2023-12-19T15:57:37Z","title":"Prompt-based Domain Discrimination for Multi-source Time Series Domain\n Adaptation","summary":" Time series domain adaptation stands as a pivotal and intricate challenge\nwith diverse applications, including but not limited to human activity\nrecognition, sleep stage classification, and machine fault diagnosis. Despite\nthe numerous domain adaptation techniques proposed to tackle this complex\nproblem, their primary focus has been on the common representations of time\nseries data. This concentration might inadvertently lead to the oversight of\nvaluable domain-specific information originating from different source domains.\nTo bridge this gap, we introduce POND, a novel prompt-based deep learning model\ndesigned explicitly for multi-source time series domain adaptation. POND is\ntailored to address significant challenges, notably: 1) The unavailability of a\nquantitative relationship between meta-data information and time series\ndistributions, and 2) The dearth of exploration into extracting domain-specific\nmeta-data information. In this paper, we present an instance-level prompt\ngenerator and a fidelity loss mechanism to facilitate the faithful learning of\nmeta-data information. Additionally, we propose a domain discrimination\ntechnique to discern domain-specific meta-data information from multiple source\ndomains. Our approach involves a simple yet effective meta-learning algorithm\nto optimize the objective efficiently. Furthermore, we augment the model's\nperformance by incorporating the Mixture of Expert (MoE) technique. The\nefficacy and robustness of our proposed POND model are extensively validated\nthrough experiments across 50 scenarios encompassing five datasets, which\ndemonstrates that our proposed POND model outperforms the state-of-the-art\nmethods by up to $66\\%$ on the F1-score.\n","authors":["Junxiang Wang","Guangji Bai","Wei Cheng","Zhengzhang Chen","Liang Zhao","Haifeng Chen"],"pdf_url":"https://arxiv.org/pdf/2312.12276v1.pdf","comment":"Undergoing work"},{"id":"http://arxiv.org/abs/2312.12275v1","updated":"2023-12-19T15:56:30Z","published":"2023-12-19T15:56:30Z","title":"Emergence of In-Context Reinforcement Learning from Noise Distillation","summary":" In-Context Reinforcement Learning is an emerging field with great potential\nfor advancing Artificial Intelligence. Its core capability lies in generalizing\nto unseen tasks through interaction with the environment. To master these\ncapabilities, an agent must be trained on specifically curated data that\nincludes a policy improvement that an algorithm seeks to extract and then apply\nin context in the environment. However, for numerous tasks, training RL agents\nmay be unfeasible, while obtaining human demonstrations can be relatively easy.\nAdditionally, it is rare to be given the optimal policy, typically, only\nsuboptimal demonstrations are available. We propose $AD^{\\epsilon}$, a method\nthat leverages demonstrations without policy improvement and enables multi-task\nin-context learning in the presence of a suboptimal demonstrator. This is\nachieved by artificially creating a history of incremental improvement, wherein\nnoise is systematically introduced into the demonstrator's policy.\nConsequently, each successive transition illustrates a marginally better\ntrajectory than the previous one. Our approach was tested on the Dark Room and\nDark Key-to-Door environments, resulting in over a $\\textbf{2}$x improvement\ncompared to the best available policy in the data.\n","authors":["Ilya Zisman","Vladislav Kurenkov","Alexander Nikulin","Viacheslav Sinii","Sergey Kolesnikov"],"pdf_url":"https://arxiv.org/pdf/2312.12275v1.pdf","comment":"Preprint, work in progress"},{"id":"http://arxiv.org/abs/2312.10237v2","updated":"2023-12-19T15:44:40Z","published":"2023-12-15T22:09:04Z","title":"Vertical Federated Alzheimer's Detection on Multimodal Data","summary":" In the era of rapidly advancing medical technologies, the segmentation of\nmedical data has become inevitable, necessitating the development of privacy\npreserving machine learning algorithms that can train on distributed data.\nConsolidating sensitive medical data is not always an option particularly due\nto the stringent privacy regulations imposed by the Health Insurance\nPortability and Accountability Act (HIPAA). In this paper, we introduce a HIPAA\ncompliant framework that can train from distributed data. We then propose a\nmultimodal vertical federated model for Alzheimer's Disease (AD) detection, a\nserious neurodegenerative condition that can cause dementia, severely impairing\nbrain function and hindering simple tasks, especially without preventative\ncare. This vertical federated model offers a distributed architecture that\nenables collaborative learning across diverse sources of medical data while\nrespecting privacy constraints imposed by HIPAA. It is also able to leverage\nmultiple modalities of data, enhancing the robustness and accuracy of AD\ndetection. Our proposed model not only contributes to the advancement of\nfederated learning techniques but also holds promise for overcoming the hurdles\nposed by data segmentation in medical research. By using vertical federated\nlearning, this research strives to provide a framework that enables healthcare\ninstitutions to harness the collective intelligence embedded in their\ndistributed datasets without compromising patient privacy.\n","authors":["Paul K. Mandal"],"pdf_url":"https://arxiv.org/pdf/2312.10237v2.pdf","comment":"14 pages, 7 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.12258v1","updated":"2023-12-19T15:43:50Z","published":"2023-12-19T15:43:50Z","title":"Inferring the relationship between soil temperature and the normalized\n difference vegetation index with machine learning","summary":" Changes in climate can greatly affect the phenology of plants, which can have\nimportant feedback effects, such as altering the carbon cycle. These\nphenological feedback effects are often induced by a shift in the start or end\ndates of the growing season of plants. The normalized difference vegetation\nindex (NDVI) serves as a straightforward indicator for assessing the presence\nof green vegetation and can also provide an estimation of the plants' growing\nseason. In this study, we investigated the effect of soil temperature on the\ntiming of the start of the season (SOS), timing of the peak of the season\n(POS), and the maximum annual NDVI value (PEAK) in subarctic grassland\necosystems between 2014 and 2019. We also explored the impact of other\nmeteorological variables, including air temperature, precipitation, and\nirradiance, on the inter-annual variation in vegetation phenology. Using\nmachine learning (ML) techniques and SHapley Additive exPlanations (SHAP)\nvalues, we analyzed the relative importance and contribution of each variable\nto the phenological predictions. Our results reveal a significant relationship\nbetween soil temperature and SOS and POS, indicating that higher soil\ntemperatures lead to an earlier start and peak of the growing season. However,\nthe Peak NDVI values showed just a slight increase with higher soil\ntemperatures. The analysis of other meteorological variables demonstrated their\nimpacts on the inter-annual variation of the vegetation phenology. Ultimately,\nthis study contributes to our knowledge of the relationships between soil\ntemperature, meteorological variables, and vegetation phenology, providing\nvaluable insights for predicting vegetation phenology characteristics and\nmanaging subarctic grasslands in the face of climate change. Additionally, this\nwork provides a solid foundation for future ML-based vegetation phenology\nstudies.\n","authors":["Steven Mortier","Amir Hamedpour","Bart Bussmann","Ruth Phoebe Tchana Wandji","Steven Latré","Bjarni D. Sigurdsson","Tom De Schepper","Tim Verdonck"],"pdf_url":"https://arxiv.org/pdf/2312.12258v1.pdf","comment":"31 pages, 7 figures, 5 tables"},{"id":"http://arxiv.org/abs/2312.12255v1","updated":"2023-12-19T15:39:09Z","published":"2023-12-19T15:39:09Z","title":"TaskFlex Solver for Multi-Agent Pursuit via Automatic Curriculum\n Learning","summary":" This paper addresses the problem of multi-agent pursuit, where slow pursuers\ncooperate to capture fast evaders in a confined environment with obstacles.\nExisting heuristic algorithms often lack expressive coordination strategies and\nare highly sensitive to task conditions, requiring extensive hyperparameter\ntuning. In contrast, reinforcement learning (RL) has been applied to this\nproblem and is capable of obtaining cooperative pursuit strategies. However,\nRL-based methods face challenges in training for complex scenarios due to the\nvast amount of training data and limited adaptability to varying task\nconditions, such as different scene sizes, varying numbers and speeds of\nobstacles, and flexible speed ratios of the evader to the pursuer. In this\nwork, we combine RL and curriculum learning to introduce a flexible solver for\nmultiagent pursuit problems, named TaskFlex Solver (TFS), which is capable of\nsolving multi-agent pursuit problems with diverse and dynamically changing task\nconditions in both 2-dimensional and 3-dimensional scenarios. TFS utilizes a\ncurriculum learning method that constructs task distributions based on training\nprogress, enhancing training efficiency and final performance. Our algorithm\nconsists of two main components: the Task Evaluator, which evaluates task\nsuccess rates and selects tasks of moderate difficulty to maintain a curriculum\narchive, and the Task Sampler, which constructs training distributions by\nsampling tasks from the curriculum archive to maximize policy improvement.\nExperiments show that TFS produces much stronger performance than baselines and\nachieves close to 100% capture rates in both 2-dimensional and 3-dimensional\nmulti-agent pursuit problems with diverse and dynamically changing scenes. The\nproject website is at https://sites.google.com/view/tfs-2023.\n","authors":["Jiayu Chen","Guosheng Li","Chao Yu","Xinyi Yang","Botian Xu","Huazhong Yang","Yu Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12255v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06154v3","updated":"2023-12-19T15:37:22Z","published":"2023-06-09T14:49:20Z","title":"HypLL: The Hyperbolic Learning Library","summary":" Deep learning in hyperbolic space is quickly gaining traction in the fields\nof machine learning, multimedia, and computer vision. Deep networks commonly\noperate in Euclidean space, implicitly assuming that data lies on regular\ngrids. Recent advances have shown that hyperbolic geometry provides a viable\nalternative foundation for deep learning, especially when data is hierarchical\nin nature and when working with few embedding dimensions. Currently however, no\naccessible open-source library exists to build hyperbolic network modules akin\nto well-known deep learning libraries. We present HypLL, the Hyperbolic\nLearning Library to bring the progress on hyperbolic deep learning together.\nHypLL is built on top of PyTorch, with an emphasis in its design for\nease-of-use, in order to attract a broad audience towards this new and\nopen-ended research direction. The code is available at:\nhttps://github.com/maxvanspengler/hyperbolic_learning_library.\n","authors":["Max van Spengler","Philipp Wirth","Pascal Mettes"],"pdf_url":"https://arxiv.org/pdf/2306.06154v3.pdf","comment":"ACM Multimedia Open-Source Software Competition 2023"},{"id":"http://arxiv.org/abs/2312.12246v1","updated":"2023-12-19T15:30:10Z","published":"2023-12-19T15:30:10Z","title":"MDD-UNet: Domain Adaptation for Medical Image Segmentation with\n Theoretical Guarantees, a Proof of Concept","summary":" The current state-of-the art techniques for image segmentation are often\nbased on U-Net architectures, a U-shaped encoder-decoder networks with skip\nconnections. Despite the powerful performance, the architecture often does not\nperform well when used on data which has different characteristics than the\ndata it was trained on. Many techniques for improving performance in the\npresence of domain shift have been developed, however typically only have loose\nconnections to the theory of domain adaption. In this work, we propose an\nunsupervised domain adaptation framework for U-Nets with theoretical guarantees\nbased on the Margin Disparity Discrepancy [1] called the MDD-UNet. We evaluate\nthe proposed technique on the task of hippocampus segmentation, and find that\nthe MDD-UNet is able to learn features which are domain-invariant with no\nknowledge about the labels in the target domain. The MDD-UNet improves\nperformance over the standard U-Net on 11 out of 12 combinations of datasets.\nThis work serves as a proof of concept by demonstrating an improvement on the\nU-Net in it's standard form without modern enhancements, which opens up a new\navenue of studying domain adaptation for models with very large hypothesis\nspaces from both methodological and practical perspectives. Code is available\nat https://github.com/asbjrnmunk/mdd-unet.\n","authors":["Asbjørn Munk","Ao Ma","Mads Nielsen"],"pdf_url":"https://arxiv.org/pdf/2312.12246v1.pdf","comment":"Published at NLDL 2024"},{"id":"http://arxiv.org/abs/2307.05152v2","updated":"2023-12-19T15:24:58Z","published":"2023-07-11T10:17:57Z","title":"Fast Neural Network Inference on FPGAs for Triggering on Long-Lived\n Particles at Colliders","summary":" Experimental particle physics demands a sophisticated trigger and acquisition\nsystem capable to efficiently retain the collisions of interest for further\ninvestigation. Heterogeneous computing with the employment of FPGA cards may\nemerge as a trending technology for the triggering strategy of the upcoming\nhigh-luminosity program of the Large Hadron Collider at CERN. In this context,\nwe present two machine-learning algorithms for selecting events where neutral\nlong-lived particles decay within the detector volume studying their accuracy\nand inference time when accelerated on commercially available Xilinx FPGA\naccelerator cards. The inference time is also confronted with a CPU- and\nGPU-based hardware setup. The proposed new algorithms are proven efficient for\nthe considered benchmark physics scenario and their accuracy is found to not\ndegrade when accelerated on the FPGA cards. The results indicate that all\ntested architectures fit within the latency requirements of a second-level\ntrigger farm and that exploiting accelerator technologies for real-time\nprocessing of particle-physics collisions is a promising research field that\ndeserves additional investigations, in particular with machine-learning models\nwith a large number of trainable parameters.\n","authors":["Andrea Coccaro","Francesco Armando Di Bello","Stefano Giagu","Lucrezia Rambelli","Nicola Stocchetti"],"pdf_url":"https://arxiv.org/pdf/2307.05152v2.pdf","comment":"12 pages, 10 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.12237v1","updated":"2023-12-19T15:22:37Z","published":"2023-12-19T15:22:37Z","title":"Roll With the Punches: Expansion and Shrinkage of Soft Label Selection\n for Semi-supervised Fine-Grained Learning","summary":" While semi-supervised learning (SSL) has yielded promising results, the more\nrealistic SSL scenario remains to be explored, in which the unlabeled data\nexhibits extremely high recognition difficulty, e.g., fine-grained visual\nclassification in the context of SSL (SS-FGVC). The increased recognition\ndifficulty on fine-grained unlabeled data spells disaster for pseudo-labeling\naccuracy, resulting in poor performance of the SSL model. To tackle this\nchallenge, we propose Soft Label Selection with Confidence-Aware Clustering\nbased on Class Transition Tracking (SoC) by reconstructing the pseudo-label\nselection process by jointly optimizing Expansion Objective and Shrinkage\nObjective, which is based on a soft label manner. Respectively, the former\nobjective encourages soft labels to absorb more candidate classes to ensure the\nattendance of ground-truth class, while the latter encourages soft labels to\nreject more noisy classes, which is theoretically proved to be equivalent to\nentropy minimization. In comparisons with various state-of-the-art methods, our\napproach demonstrates its superior performance in SS-FGVC. Checkpoints and\nsource code are available at https://github.com/NJUyued/SoC4SS-FGVC.\n","authors":["Yue Duan","Zhen Zhao","Lei Qi","Luping Zhou","Lei Wang","Yinghuan Shi"],"pdf_url":"https://arxiv.org/pdf/2312.12237v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2305.14978v2","updated":"2023-12-19T15:21:24Z","published":"2023-05-24T10:13:13Z","title":"Probabilistic Exponential Integrators","summary":" Probabilistic solvers provide a flexible and efficient framework for\nsimulation, uncertainty quantification, and inference in dynamical systems.\nHowever, like standard solvers, they suffer performance penalties for certain\nstiff systems, where small steps are required not for reasons of numerical\naccuracy but for the sake of stability. This issue is greatly alleviated in\nsemi-linear problems by the probabilistic exponential integrators developed in\nthis paper. By including the fast, linear dynamics in the prior, we arrive at a\nclass of probabilistic integrators with favorable properties. Namely, they are\nproven to be L-stable, and in a certain case reduce to a classic exponential\nintegrator -- with the added benefit of providing a probabilistic account of\nthe numerical error. The method is also generalized to arbitrary non-linear\nsystems by imposing piece-wise semi-linearity on the prior via Jacobians of the\nvector field at the previous estimates, resulting in probabilistic exponential\nRosenbrock methods. We evaluate the proposed methods on multiple stiff\ndifferential equations and demonstrate their improved stability and efficiency\nover established probabilistic solvers. The present contribution thus expands\nthe range of problems that can be effectively tackled within probabilistic\nnumerics.\n","authors":["Nathanael Bosch","Philipp Hennig","Filip Tronarp"],"pdf_url":"https://arxiv.org/pdf/2305.14978v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12236v1","updated":"2023-12-19T15:20:27Z","published":"2023-12-19T15:20:27Z","title":"Generalization Analysis of Machine Learning Algorithms via the\n Worst-Case Data-Generating Probability Measure","summary":" In this paper, the worst-case probability measure over the data is introduced\nas a tool for characterizing the generalization capabilities of machine\nlearning algorithms. More specifically, the worst-case probability measure is a\nGibbs probability measure and the unique solution to the maximization of the\nexpected loss under a relative entropy constraint with respect to a reference\nprobability measure. Fundamental generalization metrics, such as the\nsensitivity of the expected loss, the sensitivity of the empirical risk, and\nthe generalization gap are shown to have closed-form expressions involving the\nworst-case data-generating probability measure. Existing results for the Gibbs\nalgorithm, such as characterizing the generalization gap as a sum of mutual\ninformation and lautum information, up to a constant factor, are recovered. A\nnovel parallel is established between the worst-case data-generating\nprobability measure and the Gibbs algorithm. Specifically, the Gibbs\nprobability measure is identified as a fundamental commonality of the model\nspace and the data space for machine learning algorithms.\n","authors":["Xinying Zou","Samir M. Perlaza","Iñaki Esnaola","Eitan Altman"],"pdf_url":"https://arxiv.org/pdf/2312.12236v1.pdf","comment":"To appear in the Proceedings of the AAAI Conference on Artificial\n Intelligence (7 + 2 pages)"},{"id":"http://arxiv.org/abs/2312.12230v1","updated":"2023-12-19T15:15:52Z","published":"2023-12-19T15:15:52Z","title":"It's All in the Mix: Wasserstein Machine Learning with Mixed Features","summary":" Problem definition: The recent advent of data-driven and end-to-end\ndecision-making across different areas of operations management has led to an\never closer integration of prediction models from machine learning and\noptimization models from operations research. A key challenge in this context\nis the presence of estimation errors in the prediction models, which tend to be\namplified by the subsequent optimization model -- a phenomenon that is often\nreferred to as the Optimizer's Curse or the Error-Maximization Effect of\nOptimization.\n Methodology/results: A contemporary approach to combat such estimation errors\nis offered by distributionally robust problem formulations that consider all\ndata-generating distributions close to the empirical distribution derived from\nhistorical samples, where `closeness' is determined by the Wasserstein\ndistance. While those techniques show significant promise in problems where all\ninput features are continuous, they scale exponentially when binary and/or\ncategorical features are present. This paper demonstrates that such\nmixed-feature problems can indeed be solved in polynomial time. We present a\npractically efficient algorithm to solve mixed-feature problems, and we compare\nour method against alternative techniques both theoretically and empirically on\nstandard benchmark instances.\n Managerial implications: Data-driven operations management problems often\ninvolve prediction models with discrete features. We develop and analyze a\nmethodology that faithfully accounts for the presence of discrete features, and\nwe demonstrate that our approach can significantly outperform existing methods\nthat are agnostic to the presence of discrete features, both theoretically and\nacross standard benchmark instances.\n","authors":["Reza Belbasi","Aras Selvi","Wolfram Wiesemann"],"pdf_url":"https://arxiv.org/pdf/2312.12230v1.pdf","comment":"48 pages (31 main + proofs), 7 tables, 2 colored plots, an early\n version appeared in NeurIPS 2022 main track (arXiv 2205.13501)"},{"id":"http://arxiv.org/abs/2305.14160v4","updated":"2023-12-19T15:13:52Z","published":"2023-05-23T15:26:20Z","title":"Label Words are Anchors: An Information Flow Perspective for\n Understanding In-Context Learning","summary":" In-context learning (ICL) emerges as a promising capability of large language\nmodels (LLMs) by providing them with demonstration examples to perform diverse\ntasks. However, the underlying mechanism of how LLMs learn from the provided\ncontext remains under-explored. In this paper, we investigate the working\nmechanism of ICL through an information flow lens. Our findings reveal that\nlabel words in the demonstration examples function as anchors: (1) semantic\ninformation aggregates into label word representations during the shallow\ncomputation layers' processing; (2) the consolidated information in label words\nserves as a reference for LLMs' final predictions. Based on these insights, we\nintroduce an anchor re-weighting method to improve ICL performance, a\ndemonstration compression technique to expedite inference, and an analysis\nframework for diagnosing ICL errors in GPT2-XL. The promising applications of\nour findings again validate the uncovered ICL working mechanism and pave the\nway for future studies.\n","authors":["Lean Wang","Lei Li","Damai Dai","Deli Chen","Hao Zhou","Fandong Meng","Jie Zhou","Xu Sun"],"pdf_url":"https://arxiv.org/pdf/2305.14160v4.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2312.12226v1","updated":"2023-12-19T15:12:39Z","published":"2023-12-19T15:12:39Z","title":"On the Parameterization of Second-Order Optimization Effective Towards\n the Infinite Width","summary":" Second-order optimization has been developed to accelerate the training of\ndeep neural networks and it is being applied to increasingly larger-scale\nmodels. In this study, towards training on further larger scales, we identify a\nspecific parameterization for second-order optimization that promotes feature\nlearning in a stable manner even if the network width increases significantly.\nInspired by a maximal update parameterization, we consider a one-step update of\nthe gradient and reveal the appropriate scales of hyperparameters including\nrandom initialization, learning rates, and damping terms. Our approach covers\ntwo major second-order optimization algorithms, K-FAC and Shampoo, and we\ndemonstrate that our parameterization achieves higher generalization\nperformance in feature learning. In particular, it enables us to transfer the\nhyperparameters across models with different widths.\n","authors":["Satoki Ishikawa","Ryo Karakida"],"pdf_url":"https://arxiv.org/pdf/2312.12226v1.pdf","comment":"34 pages"},{"id":"http://arxiv.org/abs/2305.17205v2","updated":"2023-12-19T15:12:37Z","published":"2023-05-26T18:53:35Z","title":"Ghost Noise for Regularizing Deep Neural Networks","summary":" Batch Normalization (BN) is widely used to stabilize the optimization process\nand improve the test performance of deep neural networks. The regularization\neffect of BN depends on the batch size and explicitly using smaller batch sizes\nwith Batch Normalization, a method known as Ghost Batch Normalization (GBN),\nhas been found to improve generalization in many settings. We investigate the\neffectiveness of GBN by disentangling the induced ``Ghost Noise'' from\nnormalization and quantitatively analyzing the distribution of noise as well as\nits impact on model performance. Inspired by our analysis, we propose a new\nregularization technique called Ghost Noise Injection (GNI) that imitates the\nnoise in GBN without incurring the detrimental train-test discrepancy effects\nof small batch training. We experimentally show that GNI can provide a greater\ngeneralization benefit than GBN. Ghost Noise Injection can also be beneficial\nin otherwise non-noisy settings such as layer-normalized networks, providing\nadditional evidence of the usefulness of Ghost Noise in Batch Normalization as\na regularizer.\n","authors":["Atli Kosson","Dongyang Fan","Martin Jaggi"],"pdf_url":"https://arxiv.org/pdf/2305.17205v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12223v1","updated":"2023-12-19T15:11:46Z","published":"2023-12-19T15:11:46Z","title":"Self-Supervised Detection of Perfect and Partial Input-Dependent\n Symmetries","summary":" Group equivariance ensures consistent responses to group transformations of\nthe input, leading to more robust models and enhanced generalization\ncapabilities. However, this property can lead to overly constrained models if\nthe symmetries considered in the group differ from those observed in data.\nWhile common methods address this by determining the appropriate level of\nsymmetry at the dataset level, they are limited to supervised settings and\nignore scenarios in which multiple levels of symmetry co-exist in the same\ndataset. For instance, pictures of cars and planes exhibit different levels of\nrotation, yet both are included in the CIFAR-10 dataset. In this paper, we\npropose a method able to detect the level of symmetry of each input without the\nneed for labels. To this end, we derive a sufficient and necessary condition to\nlearn the distribution of symmetries in the data. Using the learned\ndistribution, we generate pseudo-labels that allow us to learn the levels of\nsymmetry of each input in a self-supervised manner. We validate the\neffectiveness of our approach on synthetic datasets with different per-class\nlevels of symmetries e.g. MNISTMultiple, in which digits are uniformly rotated\nwithin a class-dependent interval. We demonstrate that our method can be used\nfor practical applications such as the generation of standardized datasets in\nwhich the symmetries are not present, as well as the detection of\nout-of-distribution symmetries during inference. By doing so, both the\ngeneralization and robustness of non-equivariant models can be improved. Our\ncode is publicly available at https://github.com/aurban0/ssl-sym.\n","authors":["Alonso Urbano","David W. Romero"],"pdf_url":"https://arxiv.org/pdf/2312.12223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12216v1","updated":"2023-12-19T15:05:52Z","published":"2023-12-19T15:05:52Z","title":"Sharing is CAIRing: Characterizing Principles and Assessing Properties\n of Universal Privacy Evaluation for Synthetic Tabular Data","summary":" Data sharing is a necessity for innovative progress in many domains,\nespecially in healthcare. However, the ability to share data is hindered by\nregulations protecting the privacy of natural persons. Synthetic tabular data\nprovide a promising solution to address data sharing difficulties but does not\ninherently guarantee privacy. Still, there is a lack of agreement on\nappropriate methods for assessing the privacy-preserving capabilities of\nsynthetic data, making it difficult to compare results across studies. To the\nbest of our knowledge, this is the first work to identify properties that\nconstitute good universal privacy evaluation metrics for synthetic tabular\ndata. The goal of such metrics is to enable comparability across studies and to\nallow non-technical stakeholders to understand how privacy is protected. We\nidentify four principles for the assessment of metrics: Comparability,\nApplicability, Interpretability, and Representativeness (CAIR). To quantify and\nrank the degree to which evaluation metrics conform to the CAIR principles, we\ndesign a rubric using a scale of 1-4. Each of the four properties is scored on\nfour parameters, yielding 16 total dimensions. We study the applicability and\nusefulness of the CAIR principles and rubric by assessing a selection of\nmetrics popular in other studies. The results provide granular insights into\nthe strengths and weaknesses of existing metrics that not only rank the metrics\nbut highlight areas of potential improvements. We expect that the CAIR\nprinciples will foster agreement among researchers and organizations on which\nuniversal privacy evaluation metrics are appropriate for synthetic tabular\ndata.\n","authors":["Tobias Hyrup","Anton Danholt Lautrup","Arthur Zimek","Peter Schneider-Kamp"],"pdf_url":"https://arxiv.org/pdf/2312.12216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.10090v4","updated":"2023-12-19T14:55:15Z","published":"2023-11-16T18:58:43Z","title":"JaxMARL: Multi-Agent RL Environments in JAX","summary":" Benchmarks play an important role in the development of machine learning\nalgorithms. For example, research in reinforcement learning (RL) has been\nheavily influenced by available environments and benchmarks. However, RL\nenvironments are traditionally run on the CPU, limiting their scalability with\ntypical academic compute. Recent advancements in JAX have enabled the wider use\nof hardware acceleration to overcome these computational hurdles, enabling\nmassively parallel RL training pipelines and environments. This is particularly\nuseful for multi-agent reinforcement learning (MARL) research. First of all,\nmultiple agents must be considered at each environment step, adding\ncomputational burden, and secondly, the sample complexity is increased due to\nnon-stationarity, decentralised partial observability, or other MARL\nchallenges. In this paper, we present JaxMARL, the first open-source code base\nthat combines ease-of-use with GPU enabled efficiency, and supports a large\nnumber of commonly used MARL environments as well as popular baseline\nalgorithms. When considering wall clock time, our experiments show that per-run\nour JAX-based training pipeline is up to 12500x faster than existing\napproaches. This enables efficient and thorough evaluations, with the potential\nto alleviate the evaluation crisis of the field. We also introduce and\nbenchmark SMAX, a vectorised, simplified version of the popular StarCraft\nMulti-Agent Challenge, which removes the need to run the StarCraft II game\nengine. This not only enables GPU acceleration, but also provides a more\nflexible MARL environment, unlocking the potential for self-play,\nmeta-learning, and other future applications in MARL. We provide code at\nhttps://github.com/flairox/jaxmarl.\n","authors":["Alexander Rutherford","Benjamin Ellis","Matteo Gallici","Jonathan Cook","Andrei Lupu","Gardar Ingvarsson","Timon Willi","Akbir Khan","Christian Schroeder de Witt","Alexandra Souly","Saptarashmi Bandyopadhyay","Mikayel Samvelyan","Minqi Jiang","Robert Tjarko Lange","Shimon Whiteson","Bruno Lacerda","Nick Hawes","Tim Rocktaschel","Chris Lu","Jakob Nicolaus Foerster"],"pdf_url":"https://arxiv.org/pdf/2311.10090v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12206v1","updated":"2023-12-19T14:44:26Z","published":"2023-12-19T14:44:26Z","title":"Identification of Causal Structure in the Presence of Missing Data with\n Additive Noise Model","summary":" Missing data are an unavoidable complication frequently encountered in many\ncausal discovery tasks. When a missing process depends on the missing values\nthemselves (known as self-masking missingness), the recovery of the joint\ndistribution becomes unattainable, and detecting the presence of such\nself-masking missingness remains a perplexing challenge. Consequently, due to\nthe inability to reconstruct the original distribution and to discern the\nunderlying missingness mechanism, simply applying existing causal discovery\nmethods would lead to wrong conclusions. In this work, we found that the recent\nadvances additive noise model has the potential for learning causal structure\nunder the existence of the self-masking missingness. With this observation, we\naim to investigate the identification problem of learning causal structure from\nmissing data under an additive noise model with different missingness\nmechanisms, where the `no self-masking missingness' assumption can be\neliminated appropriately. Specifically, we first elegantly extend the scope of\nidentifiability of causal skeleton to the case with weak self-masking\nmissingness (i.e., no other variable could be the cause of self-masking\nindicators except itself). We further provide the sufficient and necessary\nidentification conditions of the causal direction under additive noise model\nand show that the causal structure can be identified up to an IN-equivalent\npattern. We finally propose a practical algorithm based on the above\ntheoretical results on learning the causal skeleton and causal direction.\nExtensive experiments on synthetic and real data demonstrate the efficiency and\neffectiveness of the proposed algorithms.\n","authors":["Jie Qiao","Zhengming Chen","Jianhua Yu","Ruichu Cai","Zhifeng Hao"],"pdf_url":"https://arxiv.org/pdf/2312.12206v1.pdf","comment":"Accepted by AAAI-2024"},{"id":"http://arxiv.org/abs/2210.01905v3","updated":"2023-12-19T14:39:40Z","published":"2022-10-04T20:56:24Z","title":"Polar Encoding: A Simple Baseline Approach for Classification with\n Missing Values","summary":" We propose polar encoding, a representation of categorical and numerical\n$[0,1]$-valued attributes with missing values to be used in a classification\ncontext. We argue that this is a good baseline approach, because it can be used\nwith any classification algorithm, preserves missingness information, is very\nsimple to apply and offers good performance. In particular, unlike the existing\nmissing-indicator approach, it does not require imputation, ensures that\nmissing values are equidistant from non-missing values, and lets decision tree\nalgorithms choose how to split missing values, thereby providing a practical\nrealisation of the \"missingness incorporated in attributes\" (MIA) proposal.\nFurthermore, we show that categorical and $[0,1]$-valued attributes can be\nviewed as special cases of a single attribute type, corresponding to the\nclassical concept of barycentric coordinates, and that this offers a natural\ninterpretation of polar encoding as a fuzzified form of one-hot encoding. With\nan experiment based on twenty real-life datasets with missing values, we show\nthat, in terms of the resulting classification performance, polar encoding\nperforms better than the state-of-the-art strategies \\e{multiple imputation by\nchained equations} (MICE) and \\e{multiple imputation with denoising\nautoencoders} (MIDAS) and -- depending on the classifier -- about as well or\nbetter than mean/mode imputation with missing-indicators.\n","authors":["Oliver Urs Lenz","Daniel Peralta","Chris Cornelis"],"pdf_url":"https://arxiv.org/pdf/2210.01905v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.06009v2","updated":"2023-12-19T14:29:53Z","published":"2022-06-13T09:55:04Z","title":"Relative Policy-Transition Optimization for Fast Policy Transfer","summary":" We consider the problem of policy transfer between two Markov Decision\nProcesses (MDPs). We introduce a lemma based on existing theoretical results in\nreinforcement learning to measure the relativity gap between two arbitrary\nMDPs, that is the difference between any two cumulative expected returns\ndefined on different policies and environment dynamics. Based on this lemma, we\npropose two new algorithms referred to as Relative Policy Optimization (RPO)\nand Relative Transition Optimization (RTO), which offer fast policy transfer\nand dynamics modelling, respectively. RPO transfers the policy evaluated in one\nenvironment to maximize the return in another, while RTO updates the\nparameterized dynamics model to reduce the gap between the dynamics of the two\nenvironments. Integrating the two algorithms results in the complete Relative\nPolicy-Transition Optimization (RPTO) algorithm, in which the policy interacts\nwith the two environments simultaneously, such that data collections from two\nenvironments, policy and transition updates are completed in one closed loop to\nform a principled learning framework for policy transfer. We demonstrate the\neffectiveness of RPTO on a set of MuJoCo continuous control tasks by creating\npolicy transfer problems via variant dynamics.\n","authors":["Jiawei Xu","Cheng Zhou","Yizheng Zhang","Baoxiang Wang","Lei Han"],"pdf_url":"https://arxiv.org/pdf/2206.06009v2.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12193v1","updated":"2023-12-19T14:27:26Z","published":"2023-12-19T14:27:26Z","title":"Gaussian process learning of nonlinear dynamics","summary":" One of the pivotal tasks in scientific machine learning is to represent\nunderlying dynamical systems from time series data. Many methods for such\ndynamics learning explicitly require the derivatives of state data, which are\nnot directly available and can be approximated conventionally by finite\ndifferences. However, the discrete approximations of time derivatives may\nresult in a poor estimation when state data are scarce and/or corrupted by\nnoise, thus compromising the predictiveness of the learned dynamical models. To\novercome this technical hurdle, we propose a new method that learns nonlinear\ndynamics through a Bayesian inference of characterizing model parameters. This\nmethod leverages a Gaussian process representation of states, and constructs a\nlikelihood function using the correlation between state data and their\nderivatives, yet prevents explicit evaluations of time derivatives. Through a\nBayesian scheme, a probabilistic estimate of the model parameters is given by\nthe posterior distribution, and thus a quantification is facilitated for\nuncertainties from noisy state data and the learning process. Specifically, we\nwill discuss the applicability of the proposed method to two typical scenarios\nfor dynamical systems: parameter identification and estimation with an affine\nstructure of the system, and nonlinear parametric approximation without prior\nknowledge.\n","authors":["Dongwei Ye","Mengwu Guo"],"pdf_url":"https://arxiv.org/pdf/2312.12193v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.16454v3","updated":"2023-12-19T14:27:21Z","published":"2023-03-29T04:43:03Z","title":"Conductivity Imaging from Internal Measurements with Mixed Least-Squares\n Deep Neural Networks","summary":" In this work we develop a novel approach using deep neural networks to\nreconstruct the conductivity distribution in elliptic problems from one\nmeasurement of the solution over the whole domain. The approach is based on a\nmixed reformulation of the governing equation and utilizes the standard\nleast-squares objective, with deep neural networks as ansatz functions to\napproximate the conductivity and flux simultaneously. We provide a thorough\nanalysis of the deep neural network approximations of the conductivity for both\ncontinuous and empirical losses, including rigorous error estimates that are\nexplicit in terms of the noise level, various penalty parameters and neural\nnetwork architectural parameters (depth, width and parameter bound). We also\nprovide multiple numerical experiments in two- and multi-dimensions to\nillustrate distinct features of the approach, e.g., excellent stability with\nrespect to data noise and capability of solving high-dimensional problems.\n","authors":["Bangti Jin","Xiyao Li","Qimeng Quan","Zhi Zhou"],"pdf_url":"https://arxiv.org/pdf/2303.16454v3.pdf","comment":"corrected a few typos"},{"id":"http://arxiv.org/abs/2312.12191v1","updated":"2023-12-19T14:26:23Z","published":"2023-12-19T14:26:23Z","title":"CUDC: A Curiosity-Driven Unsupervised Data Collection Method with\n Adaptive Temporal Distances for Offline Reinforcement Learning","summary":" Offline reinforcement learning (RL) aims to learn an effective policy from a\npre-collected dataset. Most existing works are to develop sophisticated\nlearning algorithms, with less emphasis on improving the data collection\nprocess. Moreover, it is even challenging to extend the single-task setting and\ncollect a task-agnostic dataset that allows an agent to perform multiple\ndownstream tasks. In this paper, we propose a Curiosity-driven Unsupervised\nData Collection (CUDC) method to expand feature space using adaptive temporal\ndistances for task-agnostic data collection and ultimately improve learning\nefficiency and capabilities for multi-task offline RL. To achieve this, CUDC\nestimates the probability of the k-step future states being reachable from the\ncurrent states, and adapts how many steps into the future that the dynamics\nmodel should predict. With this adaptive reachability mechanism in place, the\nfeature representation can be diversified, and the agent can navigate itself to\ncollect higher-quality data with curiosity. Empirically, CUDC surpasses\nexisting unsupervised methods in efficiency and learning performance in various\ndownstream offline RL tasks of the DeepMind control suite.\n","authors":["Chenyu Sun","Hangwei Qian","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2312.12191v1.pdf","comment":"Accepted at AAAI-24"},{"id":"http://arxiv.org/abs/2312.12190v1","updated":"2023-12-19T14:25:41Z","published":"2023-12-19T14:25:41Z","title":"Decentralised and collaborative machine learning framework for IoT","summary":" Decentralised machine learning has recently been proposed as a potential\nsolution to the security issues of the canonical federated learning approach.\nIn this paper, we propose a decentralised and collaborative machine learning\nframework specially oriented to resource-constrained devices, usual in IoT\ndeployments. With this aim we propose the following construction blocks. First,\nan incremental learning algorithm based on prototypes that was specifically\nimplemented to work in low-performance computing elements. Second, two\nrandom-based protocols to exchange the local models among the computing\nelements in the network. Finally, two algorithmics approaches for prediction\nand prototype creation. This proposal was compared to a typical centralized\nincremental learning approach in terms of accuracy, training time and\nrobustness with very promising results.\n","authors":["Martín González-Soto","Rebeca P. Díaz-Redondo","Manuel Fernández-Veiga","Bruno Rodríguez-Castro","Ana Fernández-Vilas"],"pdf_url":"https://arxiv.org/pdf/2312.12190v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12183v1","updated":"2023-12-19T14:15:20Z","published":"2023-12-19T14:15:20Z","title":"Poincaré Differential Privacy for Hierarchy-aware Graph Embedding","summary":" Hierarchy is an important and commonly observed topological property in\nreal-world graphs that indicate the relationships between supervisors and\nsubordinates or the organizational behavior of human groups. As hierarchy is\nintroduced as a new inductive bias into the Graph Neural Networks (GNNs) in\nvarious tasks, it implies latent topological relations for attackers to improve\ntheir inference attack performance, leading to serious privacy leakage issues.\nIn addition, existing privacy-preserving frameworks suffer from reduced\nprotection ability in hierarchical propagation due to the deficiency of\nadaptive upper-bound estimation of the hierarchical perturbation boundary. It\nis of great urgency to effectively leverage the hierarchical property of data\nwhile satisfying privacy guarantees. To solve the problem, we propose the\nPoincar\\'e Differential Privacy framework, named PoinDP, to protect the\nhierarchy-aware graph embedding based on hyperbolic geometry. Specifically,\nPoinDP first learns the hierarchy weights for each entity based on the\nPoincar\\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware\nSensitivity is designed to measure the sensitivity of the hierarchical\nstructure and adaptively allocate the privacy protection strength. Besides, the\nHyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian\nmechanism in Euclidean space to hyperbolic space to realize random\nperturbations that satisfy differential privacy under the hyperbolic space\nmetric. Extensive experiment results on five real-world datasets demonstrate\nthe proposed PoinDP's advantages of effective privacy protection while\nmaintaining good performance on the node classification task.\n","authors":["Yuecen Wei","Haonan Yuan","Xingcheng Fu","Qingyun Sun","Hao Peng","Xianxian Li","Chunming Hu"],"pdf_url":"https://arxiv.org/pdf/2312.12183v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17658v4","updated":"2023-12-19T14:14:44Z","published":"2023-10-18T15:24:34Z","title":"Is Channel Independent strategy optimal for Time Series Forecasting?","summary":" There has been an emergence of various models for long-term time series\nforecasting. Recent studies have demonstrated that a single linear layer, using\nChannel Dependent (CD) or Channel Independent (CI) modeling, can even\noutperform a large number of sophisticated models. However, current research\nprimarily considers CD and CI as two complementary yet mutually exclusive\napproaches, unable to harness these two extremes simultaneously. And it is also\na challenging issue that both CD and CI are static strategies that cannot be\ndetermined to be optimal for a specific dataset without extensive experiments.\nIn this paper, we reconsider whether the current CI strategy is the best\nsolution for time series forecasting. First, we propose a simple yet effective\nstrategy called CSC, which stands for $\\mathbf{C}$hannel\n$\\mathbf{S}$elf-$\\mathbf{C}$lustering strategy, for linear models. Our Channel\nSelf-Clustering (CSC) enhances CI strategy's performance improvements while\nreducing parameter size, for exmpale by over 10 times on electricity dataset,\nand significantly cutting training time. Second, we further propose Channel\nRearrangement (CR), a method for deep models inspired by the self-clustering.\nCR attains competitive performance against baselines. Finally, we also discuss\nwhether it is best to forecast the future values using the historical values of\nthe same channel as inputs. We hope our findings and methods could inspire new\nsolutions beyond CD/CI.\n","authors":["Yuan Peiwen","Zhu Changsheng"],"pdf_url":"https://arxiv.org/pdf/2310.17658v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15289v4","updated":"2023-12-19T14:14:22Z","published":"2023-09-26T21:56:03Z","title":"SEPT: Towards Efficient Scene Representation Learning for Motion\n Prediction","summary":" Motion prediction is crucial for autonomous vehicles to operate safely in\ncomplex traffic environments. Extracting effective spatiotemporal relationships\namong traffic elements is key to accurate forecasting. Inspired by the\nsuccessful practice of pretrained large language models, this paper presents\nSEPT, a modeling framework that leverages self-supervised learning to develop\npowerful spatiotemporal understanding for complex traffic scenes. Specifically,\nour approach involves three masking-reconstruction modeling tasks on scene\ninputs including agents' trajectories and road network, pretraining the scene\nencoder to capture kinematics within trajectory, spatial structure of road\nnetwork, and interactions among roads and agents. The pretrained encoder is\nthen finetuned on the downstream forecasting task. Extensive experiments\ndemonstrate that SEPT, without elaborate architectural design or manual feature\nengineering, achieves state-of-the-art performance on the Argoverse 1 and\nArgoverse 2 motion forecasting benchmarks, outperforming previous methods on\nall main metrics by a large margin.\n","authors":["Zhiqian Lan","Yuxuan Jiang","Yao Mu","Chen Chen","Shengbo Eben Li"],"pdf_url":"https://arxiv.org/pdf/2309.15289v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14901v2","updated":"2023-12-19T14:12:04Z","published":"2023-05-24T08:55:08Z","title":"Chain-of-Questions Training with Latent Answers for Robust Multistep\n Question Answering","summary":" We train a language model (LM) to robustly answer multistep questions by\ngenerating and answering sub-questions. We propose Chain-of-Questions, a\nframework that trains a model to generate sub-questions and sub-answers one at\na time by leveraging human annotated question decomposition meaning\nrepresentation (QDMR). The key technical challenge is that QDMR only contains\nsub-questions but not answers to those sub-questions, so we treat sub-answers\nas latent variables and optimize them using a novel dynamic mixture of Hard-EM\nand MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods\nby 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA\nadversarial set, thus demonstrating the effectiveness and robustness of our\nframework.\n","authors":["Wang Zhu","Jesse Thomason","Robin Jia"],"pdf_url":"https://arxiv.org/pdf/2305.14901v2.pdf","comment":"Accepted by the EMNLP 2023"},{"id":"http://arxiv.org/abs/2312.10130v2","updated":"2023-12-19T14:08:10Z","published":"2023-12-15T15:53:10Z","title":"Improving new physics searches with diffusion models for event\n observables and jet constituents","summary":" We introduce a new technique called Drapes to enhance the sensitivity in\nsearches for new physics at the LHC. By training diffusion models on side-band\ndata, we show how background templates for the signal region can be generated\neither directly from noise, or by partially applying the diffusion process to\nexisting data. In the partial diffusion case, data can be drawn from side-band\nregions, with the inverse diffusion performed for new target conditional\nvalues, or from the signal region, preserving the distribution over the\nconditional property that defines the signal region. We apply this technique to\nthe hunt for resonances using the LHCO di-jet dataset, and achieve\nstate-of-the-art performance for background template generation using high\nlevel input features. We also show how Drapes can be applied to low level\ninputs with jet constituents, reducing the model dependence on the choice of\ninput observables. Using jet constituents we can further improve sensitivity to\nthe signal process, but observe a loss in performance where the signal\nsignificance before applying any selection is below 4$\\sigma$.\n","authors":["Debajyoti Sengupta","Matthew Leigh","John Andrew Raine","Samuel Klein","Tobias Golling"],"pdf_url":"https://arxiv.org/pdf/2312.10130v2.pdf","comment":"34 pages, 19 figures"},{"id":"http://arxiv.org/abs/2312.10194v2","updated":"2023-12-19T14:02:20Z","published":"2023-12-15T20:41:09Z","title":"Pareto Envelope Augmented with Reinforcement Learning: Multi-objective\n reinforcement learning-based approach for Large-Scale Constrained Pressurized\n Water Reactor optimization","summary":" A novel method, the Pareto Envelope Augmented with Reinforcement Learning\n(PEARL), has been developed to address the challenges posed by multi-objective\nproblems, particularly in the field of engineering where the evaluation of\ncandidate solutions can be time-consuming. PEARL distinguishes itself from\ntraditional policy-based multi-objective Reinforcement Learning methods by\nlearning a single policy, eliminating the need for multiple neural networks to\nindependently solve simpler sub-problems. Several versions inspired from deep\nlearning and evolutionary techniques have been crafted, catering to both\nunconstrained and constrained problem domains. Curriculum Learning is harnessed\nto effectively manage constraints in these versions. PEARL's performance is\nfirst evaluated on classical multi-objective benchmarks. Additionally, it is\ntested on two practical PWR core Loading Pattern optimization problems to\nshowcase its real-world applicability. The first problem involves optimizing\nthe Cycle length and the rod-integrated peaking factor as the primary\nobjectives, while the second problem incorporates the mean average enrichment\nas an additional objective. Furthermore, PEARL addresses three types of\nconstraints related to boron concentration, peak pin burnup, and peak pin\npower. The results are systematically compared against a conventional approach,\nthe Non-dominated Sorting Genetic Algorithm. Notably, PEARL, specifically the\nPEARL-NdS variant, efficiently uncovers a Pareto front without necessitating\nadditional efforts from the algorithm designer, as opposed to a single\noptimization with scaled objectives. It also outperforms the classical approach\nacross multiple performance metrics, including the Hyper-volume.\n","authors":["Paul Seurin","Koroush Shirvan"],"pdf_url":"https://arxiv.org/pdf/2312.10194v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04690v2","updated":"2023-12-19T13:41:47Z","published":"2023-08-09T03:56:07Z","title":"Finite Element Operator Network for Solving Parametric PDEs","summary":" Partial differential equations (PDEs) underlie our understanding and\nprediction of natural phenomena across numerous fields, including physics,\nengineering, and finance. However, solving parametric PDEs is a complex task\nthat necessitates efficient numerical methods. In this paper, we propose a\nnovel approach for solving parametric PDEs using a Finite Element Operator\nNetwork (FEONet). Our proposed method leverages the power of deep learning in\nconjunction with traditional numerical methods, specifically the finite element\nmethod, to solve parametric PDEs in the absence of any paired input-output\ntraining data. We performed various experiments on several benchmark problems\nand confirmed that our approach has demonstrated excellent performance across\nvarious settings and environments, proving its versatility in terms of\naccuracy, generalization, and computational flexibility. Our FEONet framework\nshows potential for application in various fields where PDEs play a crucial\nrole in modeling complex domains with diverse boundary conditions and singular\nbehavior. Furthermore, we provide theoretical convergence analysis to support\nour approach, utilizing finite element approximation in numerical analysis.\n","authors":["Jae Yong Lee","Seungchan Ko","Youngjoon Hong"],"pdf_url":"https://arxiv.org/pdf/2308.04690v2.pdf","comment":"23 pages, 11 figures"},{"id":"http://arxiv.org/abs/2305.13030v4","updated":"2023-12-19T13:29:33Z","published":"2023-05-22T13:33:37Z","title":"Adaptive action supervision in reinforcement learning from real-world\n multi-agent demonstrations","summary":" Modeling of real-world biological multi-agents is a fundamental problem in\nvarious scientific and engineering fields. Reinforcement learning (RL) is a\npowerful framework to generate flexible and diverse behaviors in cyberspace;\nhowever, when modeling real-world biological multi-agents, there is a domain\ngap between behaviors in the source (i.e., real-world data) and the target\n(i.e., cyberspace for RL), and the source environment parameters are usually\nunknown. In this paper, we propose a method for adaptive action supervision in\nRL from real-world demonstrations in multi-agent scenarios. We adopt an\napproach that combines RL and supervised learning by selecting actions of\ndemonstrations in RL based on the minimum distance of dynamic time warping for\nutilizing the information of the unknown source dynamics. This approach can be\neasily applied to many existing neural network architectures and provide us\nwith an RL model balanced between reproducibility as imitation and\ngeneralization ability to obtain rewards in cyberspace. In the experiments,\nusing chase-and-escape and football tasks with the different dynamics between\nthe unknown source and target environments, we show that our approach achieved\na balance between the reproducibility and the generalization ability compared\nwith the baselines. In particular, we used the tracking data of professional\nfootball players as expert demonstrations in football and show successful\nperformances despite the larger gap between behaviors in the source and target\nenvironments than the chase-and-escape task.\n","authors":["Keisuke Fujii","Kazushi Tsutsui","Atom Scott","Hiroshi Nakahara","Naoya Takeishi","Yoshinobu Kawahara"],"pdf_url":"https://arxiv.org/pdf/2305.13030v4.pdf","comment":"14 pages, 5 figures, accepted in ICAART 2024 Oral"},{"id":"http://arxiv.org/abs/2312.12145v1","updated":"2023-12-19T13:28:34Z","published":"2023-12-19T13:28:34Z","title":"OVD-Explorer:Optimism Should Not Be the Sole Pursuit of Exploration in\n Noisy Environments","summary":" In reinforcement learning, the optimism in the face of uncertainty (OFU) is a\nmainstream principle for directing exploration towards less explored areas,\ncharacterized by higher uncertainty. However, in the presence of environmental\nstochasticity (noise), purely optimistic exploration may lead to excessive\nprobing of high-noise areas, consequently impeding exploration efficiency.\nHence, in exploring noisy environments, while optimism-driven exploration\nserves as a foundation, prudent attention to alleviating unnecessary\nover-exploration in high-noise areas becomes beneficial. In this work, we\npropose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a\nnoise-aware optimistic exploration for continuous control. OVD-Explorer\nproposes a new measurement of the policy's exploration ability considering\nnoise in optimistic perspectives, and leverages gradient ascent to drive\nexploration. Practically, OVD-Explorer can be easily integrated with continuous\ncontrol RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks\ndemonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic\nexploration.\n","authors":["Jinyi Liu","Zhi Wang","Yan Zheng","Jianye Hao","Chenjia Bai","Junjie Ye","Zhen Wang","Haiyin Piao","Yang Sun"],"pdf_url":"https://arxiv.org/pdf/2312.12145v1.pdf","comment":"Accepted by AAAI 2024, with appendix"},{"id":"http://arxiv.org/abs/2312.12141v1","updated":"2023-12-19T13:23:18Z","published":"2023-12-19T13:23:18Z","title":"Exploring the Residual Stream of Transformers","summary":" Transformer-based models have achieved great breakthroughs in recent years.\nHowever, there are many significant questions that have not been answered in\nthe field of explaining the reason why the models have powerful outputs. We do\nnot know how to locate the models' important parameters storing the knowledge\nfor predicting the next word, and whether these parameters are stored on the\nsame layer/module or different ones. Moreover, we do not understand the\nmechanism to merge the knowledge into the final embedding for next word\nprediction. In this paper, we explore the residual stream of transformers to\nincrease the interpretability. We find the mechanism behind residual connection\nis a direct addition function on before-softmax values, so the probabilities of\ntokens with larger before-softmax values will increase. Moreover, we prove that\nusing log probability increase as contribution scores is reasonable, and based\non this we can locate important parameters. Besides, we propose a method to\nanalyze how previous layers affect upper layers by comparing the inner\nproducts. The experimental results and case study show that our research can\nincrease the interpretability of transformer-based models. We will release our\ncode on https://github.com/zepingyu0512/residualstream.\n","authors":["Zeping Yu","Kailai Yang","Zhiwei Liu","Sophia Ananiadou"],"pdf_url":"https://arxiv.org/pdf/2312.12141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12137v1","updated":"2023-12-19T13:17:43Z","published":"2023-12-19T13:17:43Z","title":"Best Arm Identification with Fixed Budget: A Large Deviation Perspective","summary":" We consider the problem of identifying the best arm in stochastic Multi-Armed\nBandits (MABs) using a fixed sampling budget. Characterizing the minimal\ninstance-specific error probability for this problem constitutes one of the\nimportant remaining open problems in MABs. When arms are selected using a\nstatic sampling strategy, the error probability decays exponentially with the\nnumber of samples at a rate that can be explicitly derived via Large Deviation\ntechniques. Analyzing the performance of algorithms with adaptive sampling\nstrategies is however much more challenging. In this paper, we establish a\nconnection between the Large Deviation Principle (LDP) satisfied by the\nempirical proportions of arm draws and that satisfied by the empirical arm\nrewards. This connection holds for any adaptive algorithm, and is leveraged (i)\nto improve error probability upper bounds of some existing algorithms, such as\nthe celebrated \\sr (Successive Rejects) algorithm \\citep{audibert2010best}, and\n(ii) to devise and analyze new algorithms. In particular, we present \\sred\n(Continuous Rejects), a truly adaptive algorithm that can reject arms in {\\it\nany} round based on the observed empirical gaps between the rewards of various\narms. Applying our Large Deviation results, we prove that \\sred enjoys better\nperformance guarantees than existing algorithms, including \\sr. Extensive\nnumerical experiments confirm this observation.\n","authors":["Po-An Wang","Ruo-Chun Tzeng","Alexandre Proutiere"],"pdf_url":"https://arxiv.org/pdf/2312.12137v1.pdf","comment":"This work has been published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.12135v1","updated":"2023-12-19T13:14:52Z","published":"2023-12-19T13:14:52Z","title":"Object Detection for Automated Coronary Artery Using Deep Learning","summary":" In the era of digital medicine, medical imaging serves as a widespread\ntechnique for early disease detection, with a substantial volume of images\nbeing generated and stored daily in electronic patient records. X-ray\nangiography imaging is a standard and one of the most common methods for\nrapidly diagnosing coronary artery diseases. The notable achievements of recent\ndeep learning algorithms align with the increased use of electronic health\nrecords and diagnostic imaging. Deep neural networks, leveraging abundant data,\nadvanced algorithms, and powerful computational capabilities, prove highly\neffective in the analysis and interpretation of images. In this context, Object\ndetection methods have become a promising approach, particularly through\nconvolutional neural networks (CNN), streamlining medical image analysis by\neliminating manual feature extraction. This allows for direct feature\nextraction from images, ensuring high accuracy in results. Therefore, in our\npaper, we utilized the object detection method on X-ray angiography images to\nprecisely identify the location of coronary artery stenosis. As a result, this\nmodel enables automatic and real-time detection of stenosis locations,\nassisting in the crucial and sensitive decision-making process for healthcare\nprofessionals.\n","authors":["Hadis Keshavarz","Hossein Sadr"],"pdf_url":"https://arxiv.org/pdf/2312.12135v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10418v2","updated":"2023-12-19T13:11:49Z","published":"2023-12-16T11:13:40Z","title":"Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge\n Computing","summary":" Mobile edge computing (MEC) is a promising paradigm for real-time\napplications with intensive computational needs (e.g., autonomous driving), as\nit can reduce the processing delay. In this work, we focus on the timeliness of\ncomputational-intensive updates, measured by Age-ofInformation (AoI), and study\nhow to jointly optimize the task updating and offloading policies for AoI with\nfractional form. Specifically, we consider edge load dynamics and formulate a\ntask scheduling problem to minimize the expected time-average AoI. The\nuncertain edge load dynamics, the nature of the fractional objective, and\nhybrid continuous-discrete action space (due to the joint optimization) make\nthis problem challenging and existing approaches not directly applicable. To\nthis end, we propose a fractional reinforcement learning(RL) framework and\nprove its convergence. We further design a model-free fractional deep RL (DRL)\nalgorithm, where each device makes scheduling decisions with the hybrid action\nspace without knowing the system dynamics and decisions of other devices.\nExperimental results show that our proposed algorithms reduce the average AoI\nby up to 57.6% compared with several non-fractional benchmarks.\n","authors":["Lyudong Jin","Ming Tang","Meng Zhang","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.10418v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12133v1","updated":"2023-12-19T13:11:35Z","published":"2023-12-19T13:11:35Z","title":"Object-Aware Domain Generalization for Object Detection","summary":" Single-domain generalization (S-DG) aims to generalize a model to unseen\nenvironments with a single-source domain. However, most S-DG approaches have\nbeen conducted in the field of classification. When these approaches are\napplied to object detection, the semantic features of some objects can be\ndamaged, which can lead to imprecise object localization and misclassification.\nTo address these problems, we propose an object-aware domain generalization\n(OA-DG) method for single-domain generalization in object detection. Our method\nconsists of data augmentation and training strategy, which are called OA-Mix\nand OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level\ntransformation and object-aware mixing strategy. OA-Loss enables models to\nlearn domain-invariant representations for objects and backgrounds from the\noriginal and OA-Mixed images. Our proposed method outperforms state-of-the-art\nworks on standard benchmarks. Our code is available at\nhttps://github.com/WoojuLee24/OA-DG.\n","authors":["Wooju Lee","Dasol Hong","Hyungtae Lim","Hyun Myung"],"pdf_url":"https://arxiv.org/pdf/2312.12133v1.pdf","comment":"Accepted by AAAI-24. The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2310.00757v2","updated":"2023-12-19T13:10:23Z","published":"2023-10-01T18:27:59Z","title":"Mind the Gap: Federated Learning Broadens Domain Generalization in\n Diagnostic AI Models","summary":" Developing robust artificial intelligence (AI) models that generalize well to\nunseen datasets is challenging and usually requires large and variable\ndatasets, preferably from multiple institutions. In federated learning (FL), a\nmodel is trained collaboratively at numerous sites that hold local datasets\nwithout exchanging them. So far, the impact of training strategy, i.e., local\nversus collaborative, on the diagnostic on-domain and off-domain performance of\nAI models interpreting chest radiographs has not been assessed. Consequently,\nusing 610,000 chest radiographs from five institutions across the globe, we\nassessed diagnostic performance as a function of training strategy (i.e., local\nvs. collaborative), network architecture (i.e., convolutional vs.\ntransformer-based), generalization performance (i.e., on-domain vs.\noff-domain), imaging finding (i.e., cardiomegaly, pleural effusion, pneumonia,\natelectasis, consolidation, pneumothorax, and no abnormality), dataset size\n(i.e., from n=18,000 to 213,921 radiographs), and dataset diversity. Large\ndatasets not only showed minimal performance gains with FL but, in some\ninstances, even exhibited decreases. In contrast, smaller datasets revealed\nmarked improvements. Thus, on-domain performance was mainly driven by training\ndata size. However, off-domain performance leaned more on training diversity.\nWhen trained collaboratively across diverse external institutions, AI models\nconsistently surpassed models trained locally for off-domain tasks, emphasizing\nFL's potential in leveraging data diversity. In conclusion, FL can bolster\ndiagnostic privacy, reproducibility, and off-domain reliability of AI models\nand, potentially, optimize healthcare outcomes.\n","authors":["Soroosh Tayebi Arasteh","Christiane Kuhl","Marwin-Jonathan Saehn","Peter Isfort","Daniel Truhn","Sven Nebelung"],"pdf_url":"https://arxiv.org/pdf/2310.00757v2.pdf","comment":"Published in Nature Scientific Reports"},{"id":"http://arxiv.org/abs/2302.04977v3","updated":"2023-12-19T13:05:06Z","published":"2023-02-09T23:34:17Z","title":"Mithridates: Auditing and Boosting Backdoor Resistance of Machine\n Learning Pipelines","summary":" Machine learning (ML) models trained on data from potentially untrusted\nsources are vulnerable to poisoning. A small, maliciously crafted subset of the\ntraining inputs can cause the model to learn a \"backdoor\" task (e.g.,\nmisclassify inputs with a certain feature) in addition to its main task. Recent\nresearch proposed many hypothetical backdoor attacks whose efficacy heavily\ndepends on the configuration and training hyperparameters of the target model.\n Given the variety of potential backdoor attacks, ML engineers who are not\nsecurity experts have no way to measure how vulnerable their current training\npipelines are, nor do they have a practical way to compare training\nconfigurations so as to pick the more resistant ones. Deploying a defense\nrequires evaluating and choosing from among dozens of research papers and\nre-engineering the training pipeline.\n In this paper, we aim to provide ML engineers with pragmatic tools to audit\nthe backdoor resistance of their training pipelines and to compare different\ntraining configurations, to help choose one that best balances accuracy and\nsecurity.\n First, we propose a universal, attack-agnostic resistance metric based on the\nminimum number of training inputs that must be compromised before the model\nlearns any backdoor.\n Second, we design, implement, and evaluate Mithridates a multi-stage approach\nthat integrates backdoor resistance into the training-configuration search. ML\ndevelopers already rely on hyperparameter search to find configurations that\nmaximize the model's accuracy. Mithridates extends this standard tool to\nbalance accuracy and resistance without disruptive changes to the training\npipeline. We show that hyperparameters found by Mithridates increase resistance\nto multiple types of backdoor attacks by 3-5x with only a slight impact on\naccuracy. We also discuss extensions to AutoML and federated learning.\n","authors":["Eugene Bagdasaryan","Vitaly Shmatikov"],"pdf_url":"https://arxiv.org/pdf/2302.04977v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.05428v2","updated":"2023-12-19T13:00:21Z","published":"2023-01-25T03:21:42Z","title":"STERLING: Synergistic Representation Learning on Bipartite Graphs","summary":" A fundamental challenge of bipartite graph representation learning is how to\nextract informative node embeddings. Self-Supervised Learning (SSL) is a\npromising paradigm to address this challenge. Most recent bipartite graph SSL\nmethods are based on contrastive learning which learns embeddings by\ndiscriminating positive and negative node pairs. Contrastive learning usually\nrequires a large number of negative node pairs, which could lead to\ncomputational burden and semantic errors. In this paper, we introduce a novel\nsynergistic representation learning model (STERLING) to learn node embeddings\nwithout negative node pairs. STERLING preserves the unique local and global\nsynergies in bipartite graphs. The local synergies are captured by maximizing\nthe similarity of the inter-type and intra-type positive node pairs, and the\nglobal synergies are captured by maximizing the mutual information of\nco-clusters. Theoretical analysis demonstrates that STERLING could improve the\nconnectivity between different node types in the embedding space. Extensive\nempirical evaluation on various benchmark datasets and tasks demonstrates the\neffectiveness of STERLING for extracting node embeddings.\n","authors":["Baoyu Jing","Yuchen Yan","Kaize Ding","Chanyoung Park","Yada Zhu","Huan Liu","Hanghang Tong"],"pdf_url":"https://arxiv.org/pdf/2302.05428v2.pdf","comment":"Accepted by AAAI'2024"},{"id":"http://arxiv.org/abs/2304.05805v2","updated":"2023-12-19T12:59:42Z","published":"2023-04-12T12:29:58Z","title":"GDP nowcasting with artificial neural networks: How much does long-term\n memory matter?","summary":" In our study, we apply artificial neural networks (ANNs) to nowcast quarterly\nGDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare\nthe nowcasting performance of five different ANN architectures: the multilayer\nperceptron (MLP), the one-dimensional convolutional neural network (1D CNN),\nthe Elman recurrent neural network (RNN), the long short-term memory network\n(LSTM), and the gated recurrent unit (GRU). The empirical analysis presents the\nresults from two distinctively different evaluation periods. The first (2012:Q1\n-- 2019:Q4) is characterized by balanced economic growth, while the second\n(2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According\nto our results, longer input sequences result in more accurate nowcasts in\nperiods of balanced economic growth. However, this effect ceases above a\nrelatively low threshold value of around six quarters (eighteen months). During\nperiods of economic turbulence (e.g., during the COVID-19 recession), longer\ninput sequences do not help the models' predictive performance; instead, they\nseem to weaken their generalization capability. Combined results from the two\nevaluation periods indicate that architectural features enabling for long-term\nmemory do not result in more accurate nowcasts. On the other hand, the 1D CNN\nhas proved to be a highly suitable model for GDP nowcasting. The network has\nshown good nowcasting performance among the competitors during the first\nevaluation period and achieved the overall best accuracy during the second\nevaluation period. Consequently, first in the literature, we propose the\napplication of the 1D CNN for economic nowcasting.\n","authors":["Kristóf Németh","Dániel Hadházi"],"pdf_url":"https://arxiv.org/pdf/2304.05805v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2106.08901 by other authors"},{"id":"http://arxiv.org/abs/2312.12123v1","updated":"2023-12-19T12:56:56Z","published":"2023-12-19T12:56:56Z","title":"Probabilistic Prediction of Longitudinal Trajectory Considering Driving\n Heterogeneity with Interpretability","summary":" Automated vehicles are envisioned to navigate safely in complex mixed-traffic\nscenarios alongside human-driven vehicles. To promise a high degree of safety,\naccurately predicting the maneuvers of surrounding vehicles and their future\npositions is a critical task and attracts much attention. However, most\nexisting studies focused on reasoning about positional information based on\nobjective historical trajectories without fully considering the heterogeneity\nof driving behaviors. Therefore, this study proposes a trajectory prediction\nframework that combines Mixture Density Networks (MDN) and considers the\ndriving heterogeneity to provide probabilistic and personalized predictions.\nSpecifically, based on a certain length of historical trajectory data, the\nsituation-specific driving preferences of each driver are identified, where key\ndriving behavior feature vectors are extracted to characterize heterogeneity in\ndriving behavior among different drivers. With the inputs of the short-term\nhistorical trajectory data and key driving behavior feature vectors, a\nprobabilistic LSTMMD-DBV model combined with LSTM-based encoder-decoder\nnetworks and MDN layers is utilized to carry out personalized predictions.\nFinally, the SHapley Additive exPlanations (SHAP) method is employed to\ninterpret the trained model for predictions. The proposed framework is tested\nbased on a wide-range vehicle trajectory dataset. The results indicate that the\nproposed model can generate probabilistic future trajectories with remarkably\nimproved predictions compared to existing benchmark models. Moreover, the\nresults confirm that the additional input of driving behavior feature vectors\nrepresenting the heterogeneity of driving behavior could provide more\ninformation and thus contribute to improving the prediction accuracy.\n","authors":["Shuli Wang","Kun Gao","Lanfang Zhang","Yang Liu","Lei Chen"],"pdf_url":"https://arxiv.org/pdf/2312.12123v1.pdf","comment":"14 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.12115v1","updated":"2023-12-19T12:46:22Z","published":"2023-12-19T12:46:22Z","title":"Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor\n Selection","summary":" Machine learning techniques, such as deep learning and ensemble methods, are\nwidely used in various domains due to their ability to handle complex\nreal-world tasks. However, their black-box nature has raised multiple concerns\nabout the fairness, trustworthiness, and transparency of computer-assisted\ndecision-making. This has led to the emergence of local post-hoc explainability\nmethods, which offer explanations for individual decisions made by black-box\nalgorithms. Among these methods, Kernel SHAP is widely used due to its\nmodel-agnostic nature and its well-founded theoretical framework. Despite these\nstrengths, Kernel SHAP suffers from high instability: different executions of\nthe method with the same inputs can lead to significantly different\nexplanations, which diminishes the utility of post-hoc explainability. The\ncontribution of this paper is two-fold. On the one hand, we show that Kernel\nSHAP's instability is caused by its stochastic neighbor selection procedure,\nwhich we adapt to achieve full stability without compromising explanation\nfidelity. On the other hand, we show that by restricting the neighbors\ngeneration to perturbations of size 1 -- which we call the coalitions of Layer\n1 -- we obtain a novel feature-attribution method that is fully stable,\nefficient to compute, and still meaningful.\n","authors":["Gwladys Kelodjou","Laurence Rozé","Véronique Masson","Luis Galárraga","Romaric Gaudel","Maurice Tchuente","Alexandre Termier"],"pdf_url":"https://arxiv.org/pdf/2312.12115v1.pdf","comment":"To appear in AAAI-24"},{"id":"http://arxiv.org/abs/2303.16532v2","updated":"2023-12-19T12:43:34Z","published":"2023-03-29T08:39:36Z","title":"Futures Quantitative Investment with Heterogeneous Continual Graph\n Neural Network","summary":" This study aims to address the challenges of futures price prediction in\nhigh-frequency trading (HFT) by proposing a continuous learning factor\npredictor based on graph neural networks. The model integrates multi-factor\npricing theories with real-time market dynamics, effectively bypassing the\nlimitations of existing methods that lack financial theory guidance and ignore\nvarious trend signals and their interactions. We propose three heterogeneous\ntasks, including price moving average regression, price gap regression and\nchange-point detection to trace the short-, intermediate-, and long-term trend\nfactors present in the data. In addition, this study also considers the\ncross-sectional correlation characteristics of future contracts, where prices\nof different futures often show strong dynamic correlations. Each variable\n(future contract) depends not only on its historical values (temporal) but also\non the observation of other variables (cross-sectional). To capture these\ndynamic relationships more accurately, we resort to the spatio-temporal graph\nneural network (STGNN) to enhance the predictive power of the model. The model\nemploys a continuous learning strategy to simultaneously consider these tasks\n(factors). Additionally, due to the heterogeneity of the tasks, we propose to\ncalculate parameter importance with mutual information between original\nobservations and the extracted features to mitigate the catastrophic forgetting\n(CF) problem. Empirical tests on 49 commodity futures in China's futures market\ndemonstrate that the proposed model outperforms other state-of-the-art models\nin terms of prediction accuracy. Not only does this research promote the\nintegration of financial theory and deep learning, but it also provides a\nscientific basis for actual trading decisions.\n","authors":["Min Hu","Zhizhong Tan","Bin Liu","Guosheng Yin"],"pdf_url":"https://arxiv.org/pdf/2303.16532v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12113v1","updated":"2023-12-19T12:36:39Z","published":"2023-12-19T12:36:39Z","title":"Variational Mode Decomposition-Based Nonstationary Coherent Structure\n Analysis for Spatiotemporal Data","summary":" The modal analysis techniques face difficulties in handling nonstationary\nphenomena. This paper presents a variational mode decomposition-based\nnonstationary coherent structure (VMD-NCS) analysis that enables the extraction\nand analysis of coherent structures in case of nonstationary phenomena from\nhigh-dimensional spatiotemporal data. The VMD-NCS analysis decomposes the input\nspatiotemporal data into intrinsic coherent structures (ICSs) that represent\nnonstationary spatiotemporal patterns and exhibit coherence in both the spatial\nand temporal directions. Furthermore, unlike many conventional modal analysis\ntechniques, the proposed method accounts for the temporal changes in the\nspatial distribution with time. The performance of the VMD-NCS analysis was\nvalidated based on the transient growth phenomena in the flow around a\ncylinder. It was confirmed that the temporal changes in the spatial\ndistribution, depicting the transient growth of vortex shedding where\nfluctuations arising in the far-wake region gradually approach the near-wake\nregion, were represented as a single ICS. Further, in the analysis of the\nquasi-periodic flow field around a pitching airfoil, the temporal changes in\nthe spatial distribution and the amplitude of vortex shedding behind the\nairfoil, influenced by the pitching motion of the airfoil, were captured as a\nsingle ICS. Additionally, the impact of two parameters, adjusting the number of\nICSs ($K$) and the penalty factor related to the temporal coherence ($\\alpha$),\nwas investigated. The results revealed that $K$ has a significant impact on the\nVMD-NCS analysis results. In the case of a relatively high $K$, the VMD-NCS\nanalysis tends to extract more periodic spatiotemporal patterns resembling the\nresults of dynamic mode decomposition, whereas in the case of a small $K$, the\nanalysis tends to extract more nonstationary spatiotemporal patterns.\n","authors":["Yuya Ohmichi"],"pdf_url":"https://arxiv.org/pdf/2312.12113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12112v1","updated":"2023-12-19T12:34:46Z","published":"2023-12-19T12:34:46Z","title":"Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation\n in ultra low-data regimes","summary":" Machine Learning (ML) in low-data settings remains an underappreciated yet\ncrucial problem. This challenge is pronounced in low-to-middle income countries\nwhere access to large datasets is often limited or even absent. Hence, data\naugmentation methods to increase the sample size of datasets needed for ML are\nkey to unlocking the transformative potential of ML in data-deprived regions\nand domains. Unfortunately, the limited training set constrains traditional\ntabular synthetic data generators in their ability to generate a large and\ndiverse augmented dataset needed for ML tasks. To address this technical\nchallenge, we introduce CLLM, which leverages the prior knowledge of Large\nLanguage Models (LLMs) for data augmentation in the low-data regime. While\ndiverse, not all the data generated by LLMs will help increase utility for a\ndownstream task, as for any generative model. Consequently, we introduce a\nprincipled curation process, leveraging learning dynamics, coupled with\nconfidence and uncertainty metrics, to obtain a high-quality dataset.\nEmpirically, on multiple real-world datasets, we demonstrate the superior\nperformance of LLMs in the low-data regime compared to conventional generators.\nWe further show our curation mechanism improves the downstream performance for\nall generators, including LLMs. Additionally, we provide insights and\nunderstanding into the LLM generation and curation mechanism, shedding light on\nthe features that enable them to output high-quality augmented datasets. CLLM\npaves the way for wider usage of ML in data scarce domains and regions, by\nallying the strengths of LLMs with a robust data-centric approach.\n","authors":["Nabeel Seedat","Nicolas Huynh","Boris van Breugel","Mihaela van der Schaar"],"pdf_url":"https://arxiv.org/pdf/2312.12112v1.pdf","comment":"*Seedat & Huynh contributed equally"},{"id":"http://arxiv.org/abs/2310.18313v2","updated":"2023-12-19T12:27:58Z","published":"2023-10-27T17:59:51Z","title":"FP8-LM: Training FP8 Large Language Models","summary":" In this paper, we explore FP8 low-bit data formats for efficient training of\nlarge language models (LLMs). Our key insight is that most variables, such as\ngradients and optimizer states, in LLM training can employ low-precision data\nformats without compromising model accuracy and requiring no changes to\nhyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision\nframework for training LLMs. This framework offers three levels of FP8\nutilization to streamline mixed-precision and distributed parallel training for\nLLMs. It gradually incorporates 8-bit gradients, optimizer states, and\ndistributed learning in an incremental manner. Experiment results show that,\nduring the training of GPT-175B model on H100 GPU platform, our FP8\nmixed-precision training framework not only achieved a remarkable 39% reduction\nin real memory usage but also ran 75% faster than the widely adopted BF16\nframework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer\nEngine by 37%. This largely reduces the training costs for large foundation\nmodels. Furthermore, our FP8 mixed-precision training methodology is generic.\nIt can be seamlessly applied to other tasks such as LLM instruction tuning and\nreinforcement learning with human feedback, offering savings in fine-tuning\nexpenses. Our FP8 low-precision training framework is open-sourced at\n{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.\n","authors":["Houwen Peng","Kan Wu","Yixuan Wei","Guoshuai Zhao","Yuxiang Yang","Ze Liu","Yifan Xiong","Ziyue Yang","Bolin Ni","Jingcheng Hu","Ruihang Li","Miaosen Zhang","Chen Li","Jia Ning","Ruizhe Wang","Zheng Zhang","Shuguang Liu","Joe Chau","Han Hu","Peng Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.18313v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12102v1","updated":"2023-12-19T12:26:57Z","published":"2023-12-19T12:26:57Z","title":"I-CEE: Tailoring Explanations of Image Classifications Models to User\n Expertise","summary":" Effectively explaining decisions of black-box machine learning models is\ncritical to responsible deployment of AI systems that rely on them. Recognizing\ntheir importance, the field of explainable AI (XAI) provides several techniques\nto generate these explanations. Yet, there is relatively little emphasis on the\nuser (the explainee) in this growing body of work and most XAI techniques\ngenerate \"one-size-fits-all\" explanations. To bridge this gap and achieve a\nstep closer towards human-centered XAI, we present I-CEE, a framework that\nprovides Image Classification Explanations tailored to User Expertise. Informed\nby existing work, I-CEE explains the decisions of image classification models\nby providing the user with an informative subset of training data (i.e.,\nexample images), corresponding local explanations, and model decisions.\nHowever, unlike prior work, I-CEE models the informativeness of the example\nimages to depend on user expertise, resulting in different examples for\ndifferent users. We posit that by tailoring the example set to user expertise,\nI-CEE can better facilitate users' understanding and simulatability of the\nmodel. To evaluate our approach, we conduct detailed experiments in both\nsimulation and with human participants (N = 100) on multiple datasets.\nExperiments with simulated users show that I-CEE improves users' ability to\naccurately predict the model's decisions (simulatability) compared to\nbaselines, providing promising preliminary results. Experiments with human\nparticipants demonstrate that our method significantly improves user\nsimulatability accuracy, highlighting the importance of human-centered XAI\n","authors":["Yao Rong","Peizhu Qian","Vaibhav Unhelkar","Enkelejda Kasneci"],"pdf_url":"https://arxiv.org/pdf/2312.12102v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.06453v2","updated":"2023-12-19T12:13:25Z","published":"2023-09-12T08:16:58Z","title":"Narrowing the Gap between Supervised and Unsupervised Sentence\n Representation Learning with Large Language Model","summary":" Sentence Representation Learning (SRL) is a fundamental task in Natural\nLanguage Processing (NLP), with the Contrastive Learning of Sentence Embeddings\n(CSE) being the mainstream technique due to its superior performance. An\nintriguing phenomenon in CSE is the significant performance gap between\nsupervised and unsupervised methods, with their only difference lying in the\ntraining data. Previous works attribute this performance gap to differences in\ntwo representation properties (alignment and uniformity). However, since\nalignment and uniformity only measure the results, they fail to answer \"What\naspects of the training data contribute to the performance gap?\" and \"How can\nthe performance gap be narrowed?\", In this paper, we conduct empirical\nexperiments to answer these \"What\" and \"How\" questions. We first answer the\n\"What\" question by thoroughly comparing the behavior of supervised and\nunsupervised CSE during their respective training processes. From the\ncomparison, we identify the similarity pattern as a key factor to the\nperformance gap, and introduce a metric, called Relative Fitting Difficulty\n(RFD), to measure the complexity of the similarity pattern. Then, based on the\ninsights gained from the \"What\" question, we tackle the \"How\" question by\nincreasing the pattern complexity of the training data. We achieve this by\nleveraging the In-Context Learning (ICL) capability of the Large Language Model\n(LLM) to generate data that simulates complex patterns. By utilizing the\nhierarchical patterns in the LLM-generated data, we effectively narrow the gap\nbetween supervised and unsupervised CSE. We release our codes and appendix at\nhttps://github.com/BDBC-KG-NLP/NGCSE.\n","authors":["Mingxin Li","Richong Zhang","Zhijie Nie","Yongyi Mao"],"pdf_url":"https://arxiv.org/pdf/2309.06453v2.pdf","comment":"Accepted at AAAI24"},{"id":"http://arxiv.org/abs/2303.16737v2","updated":"2023-12-19T11:55:14Z","published":"2023-03-29T14:41:03Z","title":"Multi-Agent Reinforcement Learning with Action Masking for UAV-enabled\n Mobile Communications","summary":" Unmanned Aerial Vehicles (UAVs) are increasingly used as aerial base stations\nto provide ad hoc communications infrastructure. Building upon prior research\nefforts which consider either static nodes, 2D trajectories or single UAV\nsystems, this paper focuses on the use of multiple UAVs for providing wireless\ncommunication to mobile users in the absence of terrestrial communications\ninfrastructure. In particular, we jointly optimize UAV 3D trajectory and NOMA\npower allocation to maximize system throughput. Firstly, a weighted\nK-means-based clustering algorithm establishes UAV-user associations at regular\nintervals. The efficacy of training a novel Shared Deep Q-Network (SDQN) with\naction masking is then explored. Unlike training each UAV separately using DQN,\nthe SDQN reduces training time by using the experiences of multiple UAVs\ninstead of a single agent. We also show that SDQN can be used to train a\nmulti-agent system with differing action spaces. Simulation results confirm\nthat: 1) training a shared DQN outperforms a conventional DQN in terms of\nmaximum system throughput (+20%) and training time (-10%); 2) it can converge\nfor agents with different action spaces, yielding a 9% increase in throughput\ncompared to mutual learning algorithms; and 3) combining NOMA with an SDQN\narchitecture enables the network to achieve a better sum rate compared with\nexisting baseline schemes.\n","authors":["Danish Rizvi","David Boyle"],"pdf_url":"https://arxiv.org/pdf/2303.16737v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12068v1","updated":"2023-12-19T11:36:03Z","published":"2023-12-19T11:36:03Z","title":"PICNN: A Pathway towards Interpretable Convolutional Neural Networks","summary":" Convolutional Neural Networks (CNNs) have exhibited great performance in\ndiscriminative feature learning for complex visual tasks. Besides\ndiscrimination power, interpretability is another important yet under-explored\nproperty for CNNs. One difficulty in the CNN interpretability is that filters\nand image classes are entangled. In this paper, we introduce a novel pathway to\nalleviate the entanglement between filters and image classes. The proposed\npathway groups the filters in a late conv-layer of CNN into class-specific\nclusters. Clusters and classes are in a one-to-one relationship. Specifically,\nwe use the Bernoulli sampling to generate the filter-cluster assignment matrix\nfrom a learnable filter-class correspondence matrix. To enable end-to-end\noptimization, we develop a novel reparameterization trick for handling the\nnon-differentiable Bernoulli sampling. We evaluate the effectiveness of our\nmethod on ten widely used network architectures (including nine CNNs and a ViT)\nand five benchmark datasets. Experimental results have demonstrated that our\nmethod PICNN (the combination of standard CNNs with our proposed pathway)\nexhibits greater interpretability than standard CNNs while achieving higher or\ncomparable discrimination power.\n","authors":["Wengang Guo","Jiayi Yang","Huilin Yin","Qijun Chen","Wei Ye"],"pdf_url":"https://arxiv.org/pdf/2312.12068v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12067v1","updated":"2023-12-19T11:34:10Z","published":"2023-12-19T11:34:10Z","title":"Optimistic Policy Gradient in Multi-Player Markov Games with a Single\n Controller: Convergence Beyond the Minty Property","summary":" Policy gradient methods enjoy strong practical performance in numerous tasks\nin reinforcement learning. Their theoretical understanding in multiagent\nsettings, however, remains limited, especially beyond two-player competitive\nand potential Markov games. In this paper, we develop a new framework to\ncharacterize optimistic policy gradient methods in multi-player Markov games\nwith a single controller. Specifically, under the further assumption that the\ngame exhibits an equilibrium collapse, in that the marginals of coarse\ncorrelated equilibria (CCE) induce Nash equilibria (NE), we show convergence to\nstationary $\\epsilon$-NE in $O(1/\\epsilon^2)$ iterations, where $O(\\cdot)$\nsuppresses polynomial factors in the natural parameters of the game. Such an\nequilibrium collapse is well-known to manifest itself in two-player zero-sum\nMarkov games, but also occurs even in a class of multi-player Markov games with\nseparable interactions, as established by recent work. As a result, we bypass\nknown complexity barriers for computing stationary NE when either of our\nassumptions fails. Our approach relies on a natural generalization of the\nclassical Minty property that we introduce, which we anticipate to have further\napplications beyond Markov games.\n","authors":["Ioannis Anagnostides","Ioannis Panageas","Gabriele Farina","Tuomas Sandholm"],"pdf_url":"https://arxiv.org/pdf/2312.12067v1.pdf","comment":"To appear at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12065v1","updated":"2023-12-19T11:33:18Z","published":"2023-12-19T11:33:18Z","title":"PPO-Clip Attains Global Optimality: Towards Deeper Understandings of\n Clipping","summary":" Proximal Policy Optimization algorithm employing a clipped surrogate\nobjective (PPO-Clip) is a prominent exemplar of the policy optimization\nmethods. However, despite its remarkable empirical success, PPO-Clip lacks\ntheoretical substantiation to date. In this paper, we contribute to the field\nby establishing the first global convergence results of a PPO-Clip variant in\nboth tabular and neural function approximation settings. Our findings highlight\nthe $O(1/\\sqrt{T})$ min-iterate convergence rate specifically in the context of\nneural function approximation. We tackle the inherent challenges in analyzing\nPPO-Clip through three central concepts: (i) We introduce a generalized version\nof the PPO-Clip objective, illuminated by its connection with the hinge loss.\n(ii) Employing entropic mirror descent, we establish asymptotic convergence for\ntabular PPO-Clip with direct policy parameterization. (iii) Inspired by the\ntabular analysis, we streamline convergence analysis by introducing a two-step\npolicy improvement approach. This decouples policy search from complex neural\npolicy parameterization using a regression-based update scheme. Furthermore, we\ngain deeper insights into the efficacy of PPO-Clip by interpreting these\ngeneralized objectives. Our theoretical findings also mark the first\ncharacterization of the influence of the clipping mechanism on PPO-Clip\nconvergence. Importantly, the clipping range affects only the pre-constant of\nthe convergence rate.\n","authors":["Nai-Chieh Huang","Ping-Chun Hsieh","Kuo-Hao Ho","I-Chen Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12065v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12050v1","updated":"2023-12-19T11:14:37Z","published":"2023-12-19T11:14:37Z","title":"Extension of the Dip-test Repertoire -- Efficient and Differentiable\n p-value Calculation for Clustering","summary":" Over the last decade, the Dip-test of unimodality has gained increasing\ninterest in the data mining community as it is a parameter-free statistical\ntest that reliably rates the modality in one-dimensional samples. It returns a\nso called Dip-value and a corresponding probability for the sample's\nunimodality (Dip-p-value). These two values share a sigmoidal relationship.\nHowever, the specific transformation is dependent on the sample size. Many\nDip-based clustering algorithms use bootstrapped look-up tables translating\nDip- to Dip-p-values for a certain limited amount of sample sizes. We propose a\nspecifically designed sigmoid function as a substitute for these\nstate-of-the-art look-up tables. This accelerates computation and provides an\napproximation of the Dip- to Dip-p-value transformation for every single sample\nsize. Further, it is differentiable and can therefore easily be integrated in\nlearning schemes using gradient descent. We showcase this by exploiting our\nfunction in a novel subspace clustering algorithm called Dip'n'Sub. We\nhighlight in extensive experiments the various benefits of our proposal.\n","authors":["Lena G. M. Bauer","Collin Leiber","Christian Böhm","Claudia Plant"],"pdf_url":"https://arxiv.org/pdf/2312.12050v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12049v1","updated":"2023-12-19T11:11:03Z","published":"2023-12-19T11:11:03Z","title":"EncryIP: A Practical Encryption-Based Framework for Model Intellectual\n Property Protection","summary":" In the rapidly growing digital economy, protecting intellectual property (IP)\nassociated with digital products has become increasingly important. Within this\ncontext, machine learning (ML) models, being highly valuable digital assets,\nhave gained significant attention for IP protection. This paper introduces a\npractical encryption-based framework called \\textit{EncryIP}, which seamlessly\nintegrates a public-key encryption scheme into the model learning process. This\napproach enables the protected model to generate randomized and confused\nlabels, ensuring that only individuals with accurate secret keys, signifying\nauthorized users, can decrypt and reveal authentic labels. Importantly, the\nproposed framework not only facilitates the protected model to multiple\nauthorized users without requiring repetitive training of the original ML model\nwith IP protection methods but also maintains the model's performance without\ncompromising its accuracy. Compared to existing methods like watermark-based,\ntrigger-based, and passport-based approaches, \\textit{EncryIP} demonstrates\nsuperior effectiveness in both training protected models and efficiently\ndetecting the unauthorized spread of ML models.\n","authors":["Xin Mu","Yu Wang","Zhengan Huang","Junzuo Lai","Yehong Zhang","Hui Wang","Yue Yu"],"pdf_url":"https://arxiv.org/pdf/2312.12049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.16058v2","updated":"2023-12-19T11:00:18Z","published":"2022-10-28T11:11:04Z","title":"Goal Exploration Augmentation via Pre-trained Skills for Sparse-Reward\n Long-Horizon Goal-Conditioned Reinforcement Learning","summary":" Reinforcement learning (RL) often struggles to accomplish a sparse-reward\nlong-horizon task in a complex environment. Goal-conditioned reinforcement\nlearning (GCRL) has been employed to tackle this difficult problem via a\ncurriculum of easy-to-reach sub-goals. In GCRL, exploring novel sub-goals is\nessential for the agent to ultimately find the pathway to the desired goal. How\nto explore novel sub-goals efficiently is one of the most challenging issues in\nGCRL. Several goal exploration methods have been proposed to address this issue\nbut still struggle to find the desired goals efficiently. In this paper, we\npropose a novel learning objective by optimizing the entropy of both achieved\nand new goals to be explored for more efficient goal exploration in sub-goal\nselection based GCRL. To optimize this objective, we first explore and exploit\nthe frequently occurring goal-transition patterns mined in the environments\nsimilar to the current task to compose skills via skill learning. Then, the\npretrained skills are applied in goal exploration. Evaluation on a variety of\nspare-reward long-horizon benchmark tasks suggests that incorporating our\nmethod into several state-of-the-art GCRL baselines significantly boosts their\nexploration efficiency while improving or maintaining their performance. The\nsource code is available at: https://github.com/GEAPS/GEAPS.\n","authors":["Lisheng Wu","Ke Chen"],"pdf_url":"https://arxiv.org/pdf/2210.16058v2.pdf","comment":"Accepted for publication in Machine Learning (Springer): 35 pages, 15\n figures"},{"id":"http://arxiv.org/abs/2312.12044v1","updated":"2023-12-19T10:57:12Z","published":"2023-12-19T10:57:12Z","title":"XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX","summary":" We present XLand-MiniGrid, a suite of tools and grid-world environments for\nmeta-reinforcement learning research inspired by the diversity and depth of\nXLand and the simplicity and minimalism of MiniGrid. XLand-Minigrid is written\nin JAX, designed to be highly scalable, and can potentially run on GPU or TPU\naccelerators, democratizing large-scale experimentation with limited resources.\nTo demonstrate the generality of our library, we have implemented some\nwell-known single-task environments as well as new meta-learning environments\ncapable of generating $10^8$ distinct tasks. We have empirically shown that the\nproposed environments can scale up to $2^{13}$ parallel instances on the GPU,\nreaching tens of millions of steps per second.\n","authors":["Alexander Nikulin","Vladislav Kurenkov","Ilya Zisman","Artem Agarkov","Viacheslav Sinii","Sergey Kolesnikov"],"pdf_url":"https://arxiv.org/pdf/2312.12044v1.pdf","comment":"NeurIPS 2023, Workshop, Source code:\n https://github.com/corl-team/xland-minigrid"},{"id":"http://arxiv.org/abs/2306.01843v3","updated":"2023-12-19T10:43:16Z","published":"2023-06-02T18:03:03Z","title":"Lifting Architectural Constraints of Injective Flows","summary":" Normalizing Flows explicitly maximize a full-dimensional likelihood on the\ntraining data. However, real data is typically only supported on a\nlower-dimensional manifold leading the model to expend significant compute on\nmodeling noise. Injective Flows fix this by jointly learning a manifold and the\ndistribution on it. So far, they have been limited by restrictive architectures\nand/or high computational cost. We lift both constraints by a new efficient\nestimator for the maximum likelihood loss, compatible with free-form bottleneck\narchitectures. We further show that naively learning both the data manifold and\nthe distribution on it can lead to divergent solutions, and use this insight to\nmotivate a stable maximum likelihood training objective. We perform extensive\nexperiments on toy, tabular and image data, demonstrating the competitive\nperformance of the resulting model.\n","authors":["Peter Sorrenson","Felix Draxler","Armand Rousselot","Sander Hummerich","Lea Zimmermann","Ullrich Köthe"],"pdf_url":"https://arxiv.org/pdf/2306.01843v3.pdf","comment":"Resubmission of previous work: title and abstract have been changed\n and new content has been added"},{"id":"http://arxiv.org/abs/2312.09783v3","updated":"2023-12-19T10:36:41Z","published":"2023-12-15T13:36:54Z","title":"Keep the Faith: Faithful Explanations in Convolutional Neural Networks\n for Case-Based Reasoning","summary":" Explaining predictions of black-box neural networks is crucial when applied\nto decision-critical tasks. Thus, attribution maps are commonly used to\nidentify important image regions, despite prior work showing that humans prefer\nexplanations based on similar examples. To this end, ProtoPNet learns a set of\nclass-representative feature vectors (prototypes) for case-based reasoning.\nDuring inference, similarities of latent features to prototypes are linearly\nclassified to form predictions and attribution maps are provided to explain the\nsimilarity. In this work, we evaluate whether architectures for case-based\nreasoning fulfill established axioms required for faithful explanations using\nthe example of ProtoPNet. We show that such architectures allow the extraction\nof faithful explanations. However, we prove that the attribution maps used to\nexplain the similarities violate the axioms. We propose a new procedure to\nextract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually,\nthese explanations are Shapley values, calculated on the similarity scores of\neach prototype. They allow to faithfully answer which prototypes are present in\nan unseen image and quantify each pixel's contribution to that presence,\nthereby complying with all axioms. The theoretical violations of ProtoPNet\nmanifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs,\nRSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50,\nResNeXt50). Our experiments show a qualitative difference between the\nexplanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the\nexplanations with the Area Over the Perturbation Curve, on which ProtoPFaith\noutperforms ProtoPNet on all experiments by a factor $>10^3$.\n","authors":["Tom Nuno Wolf","Fabian Bongratz","Anne-Marie Rickmann","Sebastian Pölsterl","Christian Wachinger"],"pdf_url":"https://arxiv.org/pdf/2312.09783v3.pdf","comment":"To be published in proceedings of AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2312.11315v2","updated":"2023-12-19T10:31:08Z","published":"2023-12-18T16:10:18Z","title":"CaRe-CNN: Cascading Refinement CNN for Myocardial Infarct Segmentation\n with Microvascular Obstructions","summary":" Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely\nestablished to assess the viability of myocardial tissue of patients after\nacute myocardial infarction (MI). We propose the Cascading Refinement CNN\n(CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that\nexploits the hierarchical structure of such labeled cardiac data. Throughout\nthe three stages of the cascade, the label definition changes and CaRe-CNN\nlearns to gradually refine its intermediate predictions accordingly.\nFurthermore, to obtain more consistent qualitative predictions, we propose a\nseries of post-processing steps that take anatomical constraints into account.\nOur CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked\nsecond out of 18 participating teams. CaRe-CNN showed great improvements most\nnotably when segmenting the difficult but clinically most relevant myocardial\ninfarct tissue (MIT) as well as microvascular obstructions (MVO). When\ncomputing the average scores over all labels, our method obtained the best\nscore in eight out of ten metrics. Thus, accurate cardiac segmentation after\nacute MI via our CaRe-CNN allows generating patient-specific models of the\nheart serving as an important step towards personalized medicine.\n","authors":["Franz Thaler","Matthias A. F. Gsell","Gernot Plank","Martin Urschler"],"pdf_url":"https://arxiv.org/pdf/2312.11315v2.pdf","comment":"Accepted at VISIGRAPP 2024, 12 pages"},{"id":"http://arxiv.org/abs/2312.12028v1","updated":"2023-12-19T10:29:29Z","published":"2023-12-19T10:29:29Z","title":"EyePreserve: Identity-Preserving Iris Synthesis","summary":" Synthesis of same-identity biometric iris images, both for existing and\nnon-existing identities while preserving the identity across a wide range of\npupil sizes, is complex due to intricate iris muscle constriction mechanism,\nrequiring a precise model of iris non-linear texture deformations to be\nembedded into the synthesis pipeline. This paper presents the first method of\nfully data-driven, identity-preserving, pupil size-varying s ynthesis of iris\nimages. This approach is capable of synthesizing images of irises with\ndifferent pupil sizes representing non-existing identities as well as\nnon-linearly deforming the texture of iris images of existing subjects given\nthe segmentation mask of the target iris image. Iris recognition experiments\nsuggest that the proposed deformation model not only preserves the identity\nwhen changing the pupil size but offers better similarity between same-identity\niris samples with significant differences in pupil size, compared to\nstate-of-the-art linear and non-linear (bio-mechanical-based) iris deformation\nmodels. Two immediate applications of the proposed approach are: (a) synthesis\nof, or enhancement of the existing biometric datasets for iris recognition,\nmimicking those acquired with iris sensors, and (b) helping forensic human\nexperts in examining iris image pairs with significant differences in pupil\ndilation. Source codes and weights of the models are made available with the\npaper.\n","authors":["Siamul Karim Khan","Patrick Tinsley","Mahsa Mitcheff","Patrick Flynn","Kevin W. Bowyer","Adam Czajka"],"pdf_url":"https://arxiv.org/pdf/2312.12028v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12022v1","updated":"2023-12-19T10:18:57Z","published":"2023-12-19T10:18:57Z","title":"LightGCNet: A Lightweight Geometric Constructive Neural Network for\n Data-Driven Soft sensors","summary":" Data-driven soft sensors provide a potentially cost-effective and more\naccurate modeling approach to measure difficult-to-measure indices in\nindustrial processes compared to mechanistic approaches. Artificial\nintelligence (AI) techniques, such as deep learning, have become a popular soft\nsensors modeling approach in the area of machine learning and big data.\nHowever, soft sensors models based deep learning potentially lead to complex\nmodel structures and excessive training time. In addition, industrial processes\noften rely on distributed control systems (DCS) characterized by resource\nconstraints. Herein, guided by spatial geometric, a lightweight geometric\nconstructive neural network, namely LightGCNet, is proposed, which utilizes\ncompact angle constraint to assign the hidden parameters from dynamic\nintervals. At the same time, a node pool strategy and spatial geometric\nrelationships are used to visualize and optimize the process of assigning\nhidden parameters, enhancing interpretability. In addition, the universal\napproximation property of LightGCNet is proved by spatial geometric analysis.\nTwo versions algorithmic implementations of LightGCNet are presented in this\narticle. Simulation results concerning both benchmark datasets and the ore\ngrinding process indicate remarkable merits of LightGCNet in terms of small\nnetwork size, fast learning speed, and sound generalization.\n","authors":["Jing Nan","Yan Qin","Wei Dai","Chau Yuen"],"pdf_url":"https://arxiv.org/pdf/2312.12022v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2307.00185"},{"id":"http://arxiv.org/abs/2310.05161v4","updated":"2023-12-19T10:13:33Z","published":"2023-10-08T13:36:05Z","title":"Recurrent Neural Language Models as Probabilistic Finite-state Automata","summary":" Studying language models (LMs) in terms of well-understood formalisms allows\nus to precisely characterize their abilities and limitations. Previous work has\ninvestigated the representational capacity of recurrent neural network (RNN)\nLMs in terms of their capacity to recognize unweighted formal languages.\nHowever, LMs do not describe unweighted formal languages -- rather, they define\n\\emph{probability distributions} over strings. In this work, we study what\nclasses of such probability distributions RNN LMs can represent, which allows\nus to make more direct statements about their capabilities. We show that simple\nRNNs are equivalent to a subclass of probabilistic finite-state automata, and\ncan thus model a strict subset of probability distributions expressible by\nfinite-state models. Furthermore, we study the space complexity of representing\nfinite-state LMs with RNNs. We show that, to represent an arbitrary\ndeterministic finite-state LM with $N$ states over an alphabet $\\alphabet$, an\nRNN requires $\\Omega\\left(N |\\Sigma|\\right)$ neurons. These results present a\nfirst step towards characterizing the classes of distributions RNN LMs can\nrepresent and thus help us understand their capabilities and limitations.\n","authors":["Anej Svete","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2310.05161v4.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2312.12009v1","updated":"2023-12-19T09:58:54Z","published":"2023-12-19T09:58:54Z","title":"Active Preference Inference using Language Models and Probabilistic\n Reasoning","summary":" Actively inferring user preferences, for example by asking good questions, is\nimportant for any human-facing decision-making system. Active inference allows\nsuch systems to adapt and personalize themselves to nuanced individual\npreferences. To enable this ability for instruction-tuned large language models\n(LLMs), one may prompt them to ask users questions to infer their preferences,\ntransforming the language models into more robust, interactive systems.\nHowever, out of the box, these models are not efficient at extracting\npreferences: the questions they generate are not informative, requiring a high\nnumber of user interactions and impeding the usability of the downstream\nsystem. In this work, we introduce an inference-time algorithm that helps LLMs\nquickly infer preferences by using more informative questions. Our algorithm\nuses a probabilistic model whose conditional distributions are defined by\nprompting an LLM, and returns questions that optimize expected entropy and\nexpected model change. Results in a simplified interactive web shopping setting\nwith real product items show that an LLM equipped with our entropy reduction\nalgorithm outperforms baselines with the same underlying LLM on task\nperformance while using fewer user interactions.\n","authors":["Top Piriyakulkij","Volodymyr Kuleshov","Kevin Ellis"],"pdf_url":"https://arxiv.org/pdf/2312.12009v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12003v1","updated":"2023-12-19T09:51:02Z","published":"2023-12-19T09:51:02Z","title":"Modelling and characterization of fine Particulate Matter dynamics in\n Bujumbura using low cost sensors","summary":" Air pollution is a result of multiple sources including both natural and\nanthropogenic activities. The rapid urbanization of the cities such as\nBujumbura economic capital of Burundi, is one of these factors. The very first\ncharacterization of the spatio-temporal variability of PM2.5 in Bujumbura and\nthe forecasting of PM2.5 concentration have been conducted in this paper using\ndata collected during a year, from august 2022 to august 2023, by low cost\nsensors installed in Bujumbura city. For each commune, an hourly, daily and\nseasonal analysis were carried out and the results showed that the mass\nconcentrations of PM2.5 in the three municipalities differ from one commune to\nanother. The average hourly and annual PM2.5 concentrations exceed the World\nHealth Organization standards. The range is between 28.3 and 35.0 microgram/m3\n. In order to make prediction of PM2.5 concentration, an investigation of RNN\nwith Long Short Term Memory (LSTM) has been undertaken.\n","authors":["Egide Ndamuzi","Rachel Akimana","Paterne Gahungu","Elie Bimenyimana"],"pdf_url":"https://arxiv.org/pdf/2312.12003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.01190v3","updated":"2023-12-19T09:45:36Z","published":"2023-02-02T16:16:25Z","title":"On the Efficacy of Differentially Private Few-shot Image Classification","summary":" There has been significant recent progress in training differentially private\n(DP) models which achieve accuracy that approaches the best non-private models.\nThese DP models are typically pretrained on large public datasets and then\nfine-tuned on private downstream datasets that are relatively large and similar\nin distribution to the pretraining data. However, in many applications\nincluding personalization and federated learning, it is crucial to perform well\n(i) in the few-shot setting, as obtaining large amounts of labeled data may be\nproblematic; and (ii) on datasets from a wide variety of domains for use in\nvarious specialist settings. To understand under which conditions few-shot DP\ncan be effective, we perform an exhaustive set of experiments that reveals how\nthe accuracy and vulnerability to attack of few-shot DP image classification\nmodels are affected as the number of shots per class, privacy level, model\narchitecture, downstream dataset, and subset of learnable parameters in the\nmodel vary. We show that to achieve DP accuracy on par with non-private models,\nthe shots per class must be increased as the privacy level increases. We also\nshow that learning parameter-efficient FiLM adapters under DP is competitive\nwith learning just the final classifier layer or learning all of the network\nparameters. Finally, we evaluate DP federated learning systems and establish\nstate-of-the-art performance on the challenging FLAIR benchmark.\n","authors":["Marlon Tobaben","Aliaksandra Shysheya","John Bronskill","Andrew Paverd","Shruti Tople","Santiago Zanella-Beguelin","Richard E Turner","Antti Honkela"],"pdf_url":"https://arxiv.org/pdf/2302.01190v3.pdf","comment":"49 pages, 24 figures; published in TMLR 12/2023\n https://openreview.net/forum?id=hFsr59Imzm"},{"id":"http://arxiv.org/abs/2305.16901v2","updated":"2023-12-19T09:41:25Z","published":"2023-05-26T13:14:05Z","title":"Generalizing Adam to Manifolds for Efficiently Training Transformers","summary":" One of the primary reasons behind the success of neural networks has been the\nemergence of an array of new, highly-successful optimizers, perhaps most\nimportantly the Adam optimizer. It is wiedely used for training neural\nnetworks, yet notoriously hard to interpret. Lacking a clear physical\nintuition, Adam is difficult to generalize to manifolds. Some attempts have\nbeen made to directly apply parts of the Adam algorithm to manifolds or to find\nan underlying structure, but a full generalization has remained elusive. In\nthis work a new approach is presented that leverages the special structure of\nthe manifolds which are relevant for optimization of neural networks, such as\nthe Stiefel manifold, the symplectic Stiefel manifold, the Grassmann manifold\nand the symplectic Grassmann manifold: all of these are homogeneous spaces and\nas such admit a global tangent space representation. This global tangent space\nrepresentation is used to perform all of the steps in the Adam optimizer. The\nresulting algorithm is then applied to train a transformer for which\northogonality constraints are enforced up to machine precision and we observe\nsignificant speed-ups in the training process. Optimization of neural networks\nwhere they weights do not lie on a manifold is identified as a special case of\nthe presented framkework. This allows for a flexible implementation in which\nthe learning rate is adapted simultaneously for all parameters, irrespective of\nwhether they are an element of a general manifold or a vector space.\n","authors":["Benedikt Brantner"],"pdf_url":"https://arxiv.org/pdf/2305.16901v2.pdf","comment":"19 pages, 4 figures, was presented at Enumath2023"},{"id":"http://arxiv.org/abs/2210.07780v3","updated":"2023-12-19T09:31:06Z","published":"2022-10-14T13:09:11Z","title":"Federated Best Arm Identification with Heterogeneous Clients","summary":" We study best arm identification in a federated multi-armed bandit setting\nwith a central server and multiple clients, when each client has access to a\n{\\em subset} of arms and each arm yields independent Gaussian observations. The\ngoal is to identify the best arm of each client subject to an upper bound on\nthe error probability; here, the best arm is one that has the largest {\\em\naverage} value of the means averaged across all clients having access to the\narm. Our interest is in the asymptotics as the error probability vanishes. We\nprovide an asymptotic lower bound on the growth rate of the expected stopping\ntime of any algorithm. Furthermore, we show that for any algorithm whose upper\nbound on the expected stopping time matches with the lower bound up to a\nmultiplicative constant ({\\em almost-optimal} algorithm), the ratio of any two\nconsecutive communication time instants must be {\\em bounded}, a result that is\nof independent interest. We thereby infer that an algorithm can communicate no\nmore sparsely than at exponential time instants in order to be almost-optimal.\nFor the class of almost-optimal algorithms, we present the first-of-its-kind\nasymptotic lower bound on the expected number of {\\em communication rounds}\nuntil stoppage. We propose a novel algorithm that communicates at exponential\ntime instants, and demonstrate that it is asymptotically almost-optimal.\n","authors":["Zhirui Chen","P. N. Karthik","Vincent Y. F. Tan","Yeow Meng Chee"],"pdf_url":"https://arxiv.org/pdf/2210.07780v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.09010v4","updated":"2023-12-19T09:18:30Z","published":"2022-12-18T04:44:38Z","title":"Risk-Sensitive Reinforcement Learning with Exponential Criteria","summary":" While reinforcement learning has shown experimental success in a number of\napplications, it is known to be sensitive to noise and perturbations in the\nparameters of the system, leading to high variance in the total reward amongst\ndifferent episodes in slightly different environments. To introduce robustness,\nas well as sample efficiency, risk-sensitive reinforcement learning methods are\nbeing thoroughly studied. In this work, we provide a definition of robust\nreinforcement learning policies and formulate a risk-sensitive reinforcement\nlearning problem to approximate them, by solving an optimization problem with\nrespect to a modified objective based on exponential criteria. In particular,\nwe study a model-free risk-sensitive variation of the widely-used Monte Carlo\nPolicy Gradient algorithm and introduce a novel risk-sensitive online\nActor-Critic algorithm based on solving a multiplicative Bellman equation using\nstochastic approximation updates. Analytical results suggest that the use of\nexponential criteria generalizes commonly used ad-hoc regularization\napproaches, improves sample efficiency, and introduces robustness with respect\nto perturbations in the model parameters and the environment. The\nimplementation, performance, and robustness properties of the proposed methods\nare evaluated in simulated experiments.\n","authors":["Erfaun Noorani","Christos Mavridis","John Baras"],"pdf_url":"https://arxiv.org/pdf/2212.09010v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11976v1","updated":"2023-12-19T09:18:12Z","published":"2023-12-19T09:18:12Z","title":"When Model Meets New Normals: Test-time Adaptation for Unsupervised\n Time-series Anomaly Detection","summary":" Time-series anomaly detection deals with the problem of detecting anomalous\ntimesteps by learning normality from the sequence of observations. However, the\nconcept of normality evolves over time, leading to a \"new normal problem\",\nwhere the distribution of normality can be changed due to the distribution\nshifts between training and test data. This paper highlights the prevalence of\nthe new normal problem in unsupervised time-series anomaly detection studies.\nTo tackle this issue, we propose a simple yet effective test-time adaptation\nstrategy based on trend estimation and a self-supervised approach to learning\nnew normalities during inference. Extensive experiments on real-world\nbenchmarks demonstrate that incorporating the proposed strategy into the\nanomaly detector consistently improves the model's performance compared to the\nbaselines, leading to robustness to the distribution shifts.\n","authors":["Dongmin Kim","Sunghyun Park","Jaegul Choo"],"pdf_url":"https://arxiv.org/pdf/2312.11976v1.pdf","comment":"Accepted to AAAI 2024, 17 pages, https://github.com/carrtesy/M2N2"},{"id":"http://arxiv.org/abs/2312.11973v1","updated":"2023-12-19T09:11:49Z","published":"2023-12-19T09:11:49Z","title":"Continual Learning: Forget-free Winning Subnetworks for Video\n Representations","summary":" Inspired by the Regularized Lottery Ticket Hypothesis (RLTH), which\nhighlights the presence of competitive subnetworks within dense networks for\ncontinual learning tasks, we introduce Winning Subnetworks (WSN). This approach\nutilizes reused weights in dense networks to enhance learning in Task\nIncremental Learning (TIL) scenarios. To mitigate overfitting in Few-Shot Class\nIncremental Learning (FSCIL), we have developed WSN variants referred to as the\nSoft subnetwork (SoftNet). Furthermore, addressing WSN's limitation of sparse\nreused weights in Video Incremental Learning (VIL), we propose the Fourier\nSubneural Operator (FSO). The FSO, operating in Fourier space, adaptively and\ncompactly encodes videos, discovering reusable subnetworks with diverse\nbandwidths. We have applied FSO's Fourier representations to various continual\nlearning contexts, including VIL, TIL, and FSCIL. Our extensive experiments\nacross these scenarios demonstrate FSO's remarkable efficacy in continual\nlearning, significantly enhancing task performance at various convolutional\nrepresentational levels: it boosts performance in the higher layers for TIL and\nFSCIL and the lower layers for VIL.\n","authors":["Haeyong Kang","Jaehong Yoon","Sung Ju Hwang","Chang D. Yoo"],"pdf_url":"https://arxiv.org/pdf/2312.11973v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2303.14962,\n arXiv:2306.11305"},{"id":"http://arxiv.org/abs/2207.08012v5","updated":"2023-12-19T09:05:55Z","published":"2022-07-16T20:37:46Z","title":"Meta-Referential Games to Learn Compositional Learning Behaviours","summary":" Human beings use compositionality to generalise from past experiences to\nnovel experiences. We assume a separation of our experiences into fundamental\natomic components that can be recombined in novel ways to support our ability\nto engage with novel experiences. We frame this as the ability to learn to\ngeneralise compositionally, and we will refer to behaviours making use of this\nability as compositional learning behaviours (CLBs). A central problem to\nlearning CLBs is the resolution of a binding problem (BP). While it is another\nfeat of intelligence that human beings perform with ease, it is not the case\nfor state-of-the-art artificial agents. Thus, in order to build artificial\nagents able to collaborate with human beings, we propose to develop a novel\nbenchmark to investigate agents' abilities to exhibit CLBs by solving a\ndomain-agnostic version of the BP. We take inspiration from the language\nemergence and grounding framework of referential games and propose a\nmeta-learning extension of referential games, entitled Meta-Referential Games,\nand use this framework to build our benchmark, the Symbolic Behaviour Benchmark\n(S2B). We provide baseline results and error analysis showing that our\nbenchmark is a compelling challenge that we hope will spur the research\ncommunity towards developing more capable artificial agents.\n","authors":["Kevin Denamganaï","Sondess Missaoui","James Alfred Walker"],"pdf_url":"https://arxiv.org/pdf/2207.08012v5.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2312.11969v1","updated":"2023-12-19T09:04:26Z","published":"2023-12-19T09:04:26Z","title":"GroupMixNorm Layer for Learning Fair Models","summary":" Recent research has identified discriminatory behavior of automated\nprediction algorithms towards groups identified on specific protected\nattributes (e.g., gender, ethnicity, age group, etc.). When deployed in\nreal-world scenarios, such techniques may demonstrate biased predictions\nresulting in unfair outcomes. Recent literature has witnessed algorithms for\nmitigating such biased behavior mostly by adding convex surrogates of fairness\nmetrics such as demographic parity or equalized odds in the loss function,\nwhich are often not easy to estimate. This research proposes a novel\nin-processing based GroupMixNorm layer for mitigating bias from deep learning\nmodels. The GroupMixNorm layer probabilistically mixes group-level feature\nstatistics of samples across different groups based on the protected attribute.\nThe proposed method improves upon several fairness metrics with minimal impact\non overall accuracy. Analysis on benchmark tabular and image datasets\ndemonstrates the efficacy of the proposed method in achieving state-of-the-art\nperformance. Further, the experimental analysis also suggests the robustness of\nthe GroupMixNorm layer against new protected attributes during inference and\nits utility in eliminating bias from a pre-trained network.\n","authors":["Anubha Pandey","Aditi Rai","Maneet Singh","Deepak Bhatt","Tanmoy Bhowmik"],"pdf_url":"https://arxiv.org/pdf/2312.11969v1.pdf","comment":"12 pages, 6 figures, Pacific-Asia Conference on Knowledge Discovery\n and Data Mining (PAKDD) 2023"},{"id":"http://arxiv.org/abs/2210.15657v3","updated":"2023-12-19T08:59:26Z","published":"2022-10-25T10:20:27Z","title":"Detecting fake accounts through Generative Adversarial Network in online\n social media","summary":" Online social media is integral to human life, facilitating messaging,\ninformation sharing, and confidential communication while preserving privacy.\nPlatforms like Twitter, Instagram, and Facebook exemplify this phenomenon.\nHowever, users face challenges due to network anomalies, often stemming from\nmalicious activities such as identity theft for financial gain or harm. This\npaper proposes a novel method using user similarity measures and the Generative\nAdversarial Network (GAN) algorithm to identify fake user accounts in the\nTwitter dataset. Despite the problem's complexity, the method achieves an AUC\nrate of 80\\% in classifying and detecting fake accounts. Notably, the study\nbuilds on previous research, highlighting advancements and insights into the\nevolving landscape of anomaly detection in online social networks.\n","authors":["Jinus Bordbar","Mohammadreza Mohammadrezaie","Saman Ardalan","Mohammad Ebrahim Shiri"],"pdf_url":"https://arxiv.org/pdf/2210.15657v3.pdf","comment":"need more investigation on the paper"},{"id":"http://arxiv.org/abs/2212.01071v4","updated":"2023-12-19T08:58:50Z","published":"2022-12-02T10:22:18Z","title":"Fake detection in imbalance dataset by Semi-supervised learning with GAN","summary":" As social media continues to grow rapidly, the prevalence of harassment on\nthese platforms has also increased. This has piqued the interest of researchers\nin the field of fake detection. Social media data, often forms complex graphs\nwith numerous nodes, posing several challenges. These challenges and\nlimitations include dealing with a significant amount of irrelevant features in\nmatrices and addressing issues such as high data dispersion and an imbalanced\nclass distribution within the dataset. To overcome these challenges and\nlimitations, researchers have employed auto-encoders and a combination of\nsemi-supervised learning with a GAN algorithm, referred to as SGAN. Our\nproposed method utilizes auto-encoders for feature extraction and incorporates\nSGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN\ncompensates for the limited availability of labeled data, making efficient use\nof the limited number of labeled instances. Multiple evaluation metrics were\nemployed, including the Confusion Matrix and the ROC curve. The dataset was\ndivided into training and testing sets, with 100 labeled samples for training\nand 1,000 samples for testing. The novelty of our research lies in applying\nSGAN to address the issue of imbalanced datasets in fake account detection. By\noptimizing the use of a smaller number of labeled instances and reducing the\nneed for extensive computational power, our method offers a more efficient\nsolution. Additionally, our study contributes to the field by achieving an 81%\naccuracy in detecting fake accounts using only 100 labeled samples. This\ndemonstrates the potential of SGAN as a powerful tool for handling minority\nclasses and addressing big data challenges in fake account detection.\n","authors":["Jinus Bordbar","Saman Ardalan","Mohammadreza Mohammadrezaie","Zahra Ghasemi"],"pdf_url":"https://arxiv.org/pdf/2212.01071v4.pdf","comment":"need more investigation on results"},{"id":"http://arxiv.org/abs/2312.11952v1","updated":"2023-12-19T08:53:00Z","published":"2023-12-19T08:53:00Z","title":"Automatic Parameter Selection for Non-Redundant Clustering","summary":" High-dimensional datasets often contain multiple meaningful clusterings in\ndifferent subspaces. For example, objects can be clustered either by color,\nweight, or size, revealing different interpretations of the given dataset. A\nvariety of approaches are able to identify such non-redundant clusterings.\nHowever, most of these methods require the user to specify the expected number\nof subspaces and clusters for each subspace. Stating these values is a\nnon-trivial problem and usually requires detailed knowledge of the input\ndataset. In this paper, we propose a framework that utilizes the Minimum\nDescription Length Principle (MDL) to detect the number of subspaces and\nclusters per subspace automatically. We describe an efficient procedure that\ngreedily searches the parameter space by splitting and merging subspaces and\nclusters within subspaces. Additionally, an encoding strategy is introduced\nthat allows us to detect outliers in each subspace. Extensive experiments show\nthat our approach is highly competitive to state-of-the-art methods.\n","authors":["Collin Leiber","Dominik Mautz","Claudia Plant","Christian Böhm"],"pdf_url":"https://arxiv.org/pdf/2312.11952v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.01242v2","updated":"2023-12-19T08:52:02Z","published":"2023-02-02T17:24:43Z","title":"Neuro-Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and\n Concept Rehearsal","summary":" We introduce Neuro-Symbolic Continual Learning, where a model has to solve a\nsequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to\nhigh-level concepts and compute predictions by reasoning consistently with\nprior knowledge. Our key observation is that neuro-symbolic tasks, although\ndifferent, often share concepts whose semantics remains stable over time.\nTraditional approaches fall short: existing continual strategies ignore\nknowledge altogether, while stock neuro-symbolic architectures suffer from\ncatastrophic forgetting. We show that leveraging prior knowledge by combining\nneuro-symbolic architectures with continual strategies does help avoid\ncatastrophic forgetting, but also that doing so can yield models affected by\nreasoning shortcuts. These undermine the semantics of the acquired concepts,\neven when detailed prior knowledge is provided upfront and inference is exact,\nand in turn continual performance. To overcome these issues, we introduce COOL,\na COncept-level cOntinual Learning strategy tailored for neuro-symbolic\ncontinual problems that acquires high-quality concepts and remembers them over\ntime. Our experiments on three novel benchmarks highlights how COOL attains\nsustained high performance on neuro-symbolic continual learning tasks in which\nother strategies fail.\n","authors":["Emanuele Marconato","Gianpaolo Bontempo","Elisa Ficarra","Simone Calderara","Andrea Passerini","Stefano Teso"],"pdf_url":"https://arxiv.org/pdf/2302.01242v2.pdf","comment":"40th International Conference on Machine Learning (ICML 2023)"},{"id":"http://arxiv.org/abs/2010.10258v3","updated":"2023-12-19T08:45:50Z","published":"2020-10-19T03:01:33Z","title":"Hierarchical Autoregressive Modeling for Neural Video Compression","summary":" Recent work by Marino et al. (2020) showed improved performance in sequential\ndensity estimation by combining masked autoregressive flows with hierarchical\nlatent variable models. We draw a connection between such autoregressive\ngenerative models and the task of lossy video compression. Specifically, we\nview recent neural video compression methods (Lu et al., 2019; Yang et al.,\n2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal\nautoregressive transform, and propose avenues for enhancement based on this\ninsight. Comprehensive evaluations on large-scale video data show improved\nrate-distortion performance over both state-of-the-art neural and conventional\nvideo compression methods.\n","authors":["Ruihan Yang","Yibo Yang","Joseph Marino","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2010.10258v3.pdf","comment":"Published as a conference paper at ICLR 2021"},{"id":"http://arxiv.org/abs/2308.12681v2","updated":"2023-12-19T08:43:57Z","published":"2023-08-24T09:40:37Z","title":"LR-XFL: Logical Reasoning-based Explainable Federated Learning","summary":" Federated learning (FL) is an emerging approach for training machine learning\nmodels collaboratively while preserving data privacy. The need for privacy\nprotection makes it difficult for FL models to achieve global transparency and\nexplainability. To address this limitation, we incorporate logic-based\nexplanations into FL by proposing the Logical Reasoning-based eXplainable\nFederated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local\nlogic rules based on their local data and send them, along with model updates,\nto the FL server. The FL server connects the local logic rules through a proper\nlogical connector that is derived based on properties of client data, without\nrequiring access to the raw data. In addition, the server also aggregates the\nlocal model updates with weight values determined by the quality of the\nclients' local data as reflected by their uploaded logic rules. The results\nshow that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and\n5.41% in terms of classification accuracy, rule accuracy and rule fidelity,\nrespectively. The explicit rule evaluation and expression under LR-XFL enable\nhuman experts to validate and correct the rules on the server side, hence\nimproving the global FL model's robustness to errors. It has the potential to\nenhance the transparency of FL models for areas like healthcare and finance\nwhere both data privacy and explainability are important.\n","authors":["Yanci Zhang","Han Yu"],"pdf_url":"https://arxiv.org/pdf/2308.12681v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11939v1","updated":"2023-12-19T08:38:03Z","published":"2023-12-19T08:38:03Z","title":"Time-Series Contrastive Learning against False Negatives and Class\n Imbalance","summary":" As an exemplary self-supervised approach for representation learning,\ntime-series contrastive learning has exhibited remarkable advancements in\ncontemporary research. While recent contrastive learning strategies have\nfocused on how to construct appropriate positives and negatives, in this study,\nwe conduct theoretical analysis and find they have overlooked the fundamental\nissues: false negatives and class imbalance inherent in the InfoNCE loss-based\nframework. Therefore, we introduce a straightforward modification grounded in\nthe SimCLR framework, universally adaptable to models engaged in the instance\ndiscrimination task. By constructing instance graphs to facilitate interactive\nlearning among instances, we emulate supervised contrastive learning via the\nmultiple-instances discrimination task, mitigating the harmful impact of false\nnegatives. Moreover, leveraging the graph structure and few-labeled data, we\nperform semi-supervised consistency classification and enhance the\nrepresentative ability of minority classes. We compared our method with the\nmost popular time-series contrastive learning methods on four real-world\ntime-series datasets and demonstrated our significant advantages in overall\nperformance.\n","authors":["Xiyuan Jin","Jing Wang","Lei Liu","Youfang Lin"],"pdf_url":"https://arxiv.org/pdf/2312.11939v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09844v2","updated":"2023-12-19T08:27:44Z","published":"2023-12-15T14:49:41Z","title":"Small Dataset, Big Gains: Enhancing Reinforcement Learning by Offline\n Pre-Training with Model Based Augmentation","summary":" Offline reinforcement learning leverages pre-collected datasets of\ntransitions to train policies. It can serve as effective initialization for\nonline algorithms, enhancing sample efficiency and speeding up convergence.\nHowever, when such datasets are limited in size and quality, offline\npre-training can produce sub-optimal policies and lead to degraded online\nreinforcement learning performance. In this paper we propose a model-based data\naugmentation strategy to maximize the benefits of offline reinforcement\nlearning pre-training and reduce the scale of data needed to be effective. Our\napproach leverages a world model of the environment trained on the offline\ndataset to augment states during offline pre-training. We evaluate our approach\non a variety of MuJoCo robotic tasks and our results show it can jump-start\nonline fine-tuning and substantially reduce - in some cases by an order of\nmagnitude - the required number of environment interactions.\n","authors":["Girolamo Macaluso","Alessandro Sestini","Andrew D. Bagdanov"],"pdf_url":"https://arxiv.org/pdf/2312.09844v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11934v1","updated":"2023-12-19T08:20:19Z","published":"2023-12-19T08:20:19Z","title":"Identification of Causal Structure with Latent Variables Based on Higher\n Order Cumulants","summary":" Causal discovery with latent variables is a crucial but challenging task.\nDespite the emergence of numerous methods aimed at addressing this challenge,\nthey are not fully identified to the structure that two observed variables are\ninfluenced by one latent variable and there might be a directed edge in\nbetween. Interestingly, we notice that this structure can be identified through\nthe utilization of higher-order cumulants. By leveraging the higher-order\ncumulants of non-Gaussian data, we provide an analytical solution for\nestimating the causal coefficients or their ratios. With the estimated (ratios\nof) causal coefficients, we propose a novel approach to identify the existence\nof a causal edge between two observed variables subject to latent variable\ninfluence. In case when such a causal edge exits, we introduce an asymmetry\ncriterion to determine the causal direction. The experimental results\ndemonstrate the effectiveness of our proposed method.\n","authors":["Wei Chen","Zhiyi Huang","Ruichu Cai","Zhifeng Hao","Kun Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.11934v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11933v1","updated":"2023-12-19T08:20:09Z","published":"2023-12-19T08:20:09Z","title":"Dynamic Frequency Domain Graph Convolutional Network for Traffic\n Forecasting","summary":" Complex spatial dependencies in transportation networks make traffic\nprediction extremely challenging. Much existing work is devoted to learning\ndynamic graph structures among sensors, and the strategy of mining spatial\ndependencies from traffic data, known as data-driven, tends to be an intuitive\nand effective approach. However, Time-Shift of traffic patterns and noise\ninduced by random factors hinder data-driven spatial dependence modeling. In\nthis paper, we propose a novel dynamic frequency domain graph convolution\nnetwork (DFDGCN) to capture spatial dependencies. Specifically, we mitigate the\neffects of time-shift by Fourier transform, and introduce the identity\nembedding of sensors and time embedding when capturing data for graph learning\nsince traffic data with noise is not entirely reliable. The graph is combined\nwith static predefined and self-adaptive graphs during graph convolution to\npredict future traffic data through classical causal convolutions. Extensive\nexperiments on four real-world datasets demonstrate that our model is effective\nand outperforms the baselines.\n","authors":["Yujie Li","Zezhi Shao","Yongjun Xu","Qiang Qiu","Zhaogang Cao","Fei Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11929v1","updated":"2023-12-19T08:15:22Z","published":"2023-12-19T08:15:22Z","title":"Transformer Network for Multi-Person Tracking and Re-Identification in\n Unconstrained Environment","summary":" Multi-object tracking (MOT) has profound applications in a variety of fields,\nincluding surveillance, sports analytics, self-driving, and cooperative\nrobotics. Despite considerable advancements, existing MOT methodologies tend to\nfalter when faced with non-uniform movements, occlusions, and\nappearance-reappearance scenarios of the objects. Recognizing this inadequacy,\nwe put forward an integrated MOT method that not only marries object detection\nand identity linkage within a singular, end-to-end trainable framework but also\nequips the model with the ability to maintain object identity links over long\nperiods of time. Our proposed model, named STMMOT, is built around four key\nmodules: 1) candidate proposal generation, which generates object proposals via\na vision-transformer encoder-decoder architecture that detects the object from\neach frame in the video; 2) scale variant pyramid, a progressive pyramid\nstructure to learn the self-scale and cross-scale similarities in multi-scale\nfeature maps; 3) spatio-temporal memory encoder, extracting the essential\ninformation from the memory associated with each object under tracking; and 4)\nspatio-temporal memory decoder, simultaneously resolving the tasks of object\ndetection and identity association for MOT. Our system leverages a robust\nspatio-temporal memory module that retains extensive historical observations\nand effectively encodes them using an attention-based aggregator. The\nuniqueness of STMMOT lies in representing objects as dynamic query embeddings\nthat are updated continuously, which enables the prediction of object states\nwith attention mechanisms and eradicates the need for post-processing.\n","authors":["Hamza Mukhtar","Muhammad Usman Ghani Khan"],"pdf_url":"https://arxiv.org/pdf/2312.11929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10276v2","updated":"2023-12-19T08:12:51Z","published":"2023-12-16T00:50:17Z","title":"Asymmetric Norms to Approximate the Minimum Action Distance","summary":" This paper presents a state representation for reward-free Markov decision\nprocesses. The idea is to learn, in a self-supervised manner, an embedding\nspace where distances between pairs of embedded states correspond to the\nminimum number of actions needed to transition between them. Unlike previous\nmethods, our approach incorporates an asymmetric norm parametrization, enabling\naccurate approximations of minimum action distances in environments with\ninherent asymmetry. We show how this representation can be leveraged to learn\ngoal-conditioned policies, providing a notion of similarity between states and\ngoals and a useful heuristic distance to guide planning. To validate our\napproach, we conduct empirical experiments on both symmetric and asymmetric\nenvironments. Our results show that our asymmetric norm parametrization\nperforms comparably to symmetric norms in symmetric environments and surpasses\nsymmetric norms in asymmetric environments.\n","authors":["Lorenzo Steccanella","Anders Jonsson"],"pdf_url":"https://arxiv.org/pdf/2312.10276v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11927v1","updated":"2023-12-19T08:09:36Z","published":"2023-12-19T08:09:36Z","title":"Empowering Dual-Level Graph Self-Supervised Pretraining with Motif\n Discovery","summary":" While self-supervised graph pretraining techniques have shown promising\nresults in various domains, their application still experiences challenges of\nlimited topology learning, human knowledge dependency, and incompetent\nmulti-level interactions. To address these issues, we propose a novel solution,\nDual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which\nintroduces a unique dual-level pretraining structure that orchestrates\nnode-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM\nautonomously uncovers significant graph motifs through an edge pooling module,\naligning learned motif similarities with graph kernel-based similarities. A\ncross-matching task enables sophisticated node-motif interactions and novel\nrepresentation learning. Extensive experiments on 15 datasets validate DGPM's\neffectiveness and generalizability, outperforming state-of-the-art methods in\nunsupervised representation learning and transfer learning settings. The\nautonomously discovered motifs demonstrate the potential of DGPM to enhance\nrobustness and interpretability.\n","authors":["Pengwei Yan","Kaisong Song","Zhuoren Jiang","Yangyang Kang","Tianqianjin Lin","Changlong Sun","Xiaozhong Liu"],"pdf_url":"https://arxiv.org/pdf/2312.11927v1.pdf","comment":"14 pages, 6 figures, accepted by AAAI'24"},{"id":"http://arxiv.org/abs/2312.11926v1","updated":"2023-12-19T08:07:41Z","published":"2023-12-19T08:07:41Z","title":"Big Learning Expectation Maximization","summary":" Mixture models serve as one fundamental tool with versatile applications.\nHowever, their training techniques, like the popular Expectation Maximization\n(EM) algorithm, are notoriously sensitive to parameter initialization and often\nsuffer from bad local optima that could be arbitrarily worse than the optimal.\nTo address the long-lasting bad-local-optima challenge, we draw inspiration\nfrom the recent ground-breaking foundation models and propose to leverage their\nunderlying big learning principle to upgrade the EM. Specifically, we present\nthe Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs\njoint, marginal, and orthogonally transformed marginal matchings between data\nand model distributions. Through simulated experiments, we empirically show\nthat the BigLearn-EM is capable of delivering the optimal with high\nprobability; comparisons on benchmark clustering datasets further demonstrate\nits effectiveness and advantages over existing techniques. The code is\navailable at\nhttps://github.com/YulaiCong/Big-Learning-Expectation-Maximization.\n","authors":["Yulai Cong","Sijia Li"],"pdf_url":"https://arxiv.org/pdf/2312.11926v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11918v1","updated":"2023-12-19T07:56:25Z","published":"2023-12-19T07:56:25Z","title":"A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on\n NVIDIA Hopper Architecture using the CUTLASS Library","summary":" We provide an optimized implementation of the forward pass of\nFlashAttention-2, a popular memory-aware scaled dot-product attention\nalgorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture\nand written using the open-source CUTLASS library. In doing so, we explain the\nchallenges and techniques involved in fusing online-softmax with back-to-back\nGEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and\nWarpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and\ntransforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations,\nand choosing optimal tile sizes for the Q, K and V attention matrices while\nbalancing the register pressure and shared memory utilization. In head-to-head\nbenchmarks on a single H100 PCIe GPU for some common choices of\nhyperparameters, we observe 20-50% higher FLOPs/s over a version of\nFlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.\n","authors":["Ganesh Bikshandi","Jay Shah"],"pdf_url":"https://arxiv.org/pdf/2312.11918v1.pdf","comment":"13 pages, comments welcome"},{"id":"http://arxiv.org/abs/2303.10343v2","updated":"2023-12-19T07:44:31Z","published":"2023-03-18T06:13:30Z","title":"Supervision Interpolation via LossMix: Generalizing Mixup for Object\n Detection and Beyond","summary":" The success of data mixing augmentations in image classification tasks has\nbeen well-received. However, these techniques cannot be readily applied to\nobject detection due to challenges such as spatial misalignment,\nforeground/background distinction, and plurality of instances. To tackle these\nissues, we first introduce a novel conceptual framework called Supervision\nInterpolation (SI), which offers a fresh perspective on interpolation-based\naugmentations by relaxing and generalizing Mixup. Based on SI, we propose\nLossMix, a simple yet versatile and effective regularization that enhances the\nperformance and robustness of object detectors and more. Our key insight is\nthat we can effectively regularize the training on mixed data by interpolating\ntheir loss errors instead of ground truth labels. Empirical results on the\nPASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently\noutperform state-of-the-art methods widely adopted for detection. Furthermore,\nby jointly leveraging LossMix with unsupervised domain adaptation, we\nsuccessfully improve existing approaches and set a new state of the art for\ncross-domain object detection.\n","authors":["Thanh Vu","Baochen Sun","Bodi Yuan","Alex Ngai","Yueqi Li","Jan-Michael Frahm"],"pdf_url":"https://arxiv.org/pdf/2303.10343v2.pdf","comment":"AAAI-24 Camera Ready Version, with supplementary material, 15 pages"},{"id":"http://arxiv.org/abs/2309.11518v2","updated":"2023-12-19T07:40:45Z","published":"2023-09-19T09:17:07Z","title":"Ad-load Balancing via Off-policy Learning in a Content Marketplace","summary":" Ad-load balancing is a critical challenge in online advertising systems,\nparticularly in the context of social media platforms, where the goal is to\nmaximize user engagement and revenue while maintaining a satisfactory user\nexperience. This requires the optimization of conflicting objectives, such as\nuser satisfaction and ads revenue. Traditional approaches to ad-load balancing\nrely on static allocation policies, which fail to adapt to changing user\npreferences and contextual factors. In this paper, we present an approach that\nleverages off-policy learning and evaluation from logged bandit feedback. We\nstart by presenting a motivating analysis of the ad-load balancing problem,\nhighlighting the conflicting objectives between user satisfaction and ads\nrevenue. We emphasize the nuances that arise due to user heterogeneity and the\ndependence on the user's position within a session. Based on this analysis, we\ndefine the problem as determining the optimal ad-load for a particular feed\nfetch. To tackle this problem, we propose an off-policy learning framework that\nleverages unbiased estimators such as Inverse Propensity Scoring (IPS) and\nDoubly Robust (DR) to learn and estimate the policy values using offline\ncollected stochastic data. We present insights from online A/B experiments\ndeployed at scale across over 80 million users generating over 200 million\nsessions, where we find statistically significant improvements in both user\nsatisfaction metrics and ads revenue for the platform.\n","authors":["Hitesh Sagtani","Madan Jhawar","Rishabh Mehrotra","Olivier Jeunen"],"pdf_url":"https://arxiv.org/pdf/2309.11518v2.pdf","comment":"Early version presented at the CONSEQUENCES '23 workshop at RecSys\n '23, final version appearing at WSDM '24"},{"id":"http://arxiv.org/abs/2303.06999v3","updated":"2023-12-19T07:30:25Z","published":"2023-03-13T10:54:52Z","title":"Identifying Label Errors in Object Detection Datasets by Loss Inspection","summary":" Labeling datasets for supervised object detection is a dull and\ntime-consuming task. Errors can be easily introduced during annotation and\noverlooked during review, yielding inaccurate benchmarks and performance\ndegradation of deep neural networks trained on noisy labels. In this work, we\nfor the first time introduce a benchmark for label error detection methods on\nobject detection datasets as well as a label error detection method and a\nnumber of baselines. We simulate four different types of randomly introduced\nlabel errors on train and test sets of well-labeled object detection datasets.\nFor our label error detection method we assume a two-stage object detector to\nbe given and consider the sum of both stages' classification and regression\nlosses. The losses are computed with respect to the predictions and the noisy\nlabels including simulated label errors, aiming at detecting the latter. We\ncompare our method to three baselines: a naive one without deep learning, the\nobject detector's score and the entropy of the classification softmax\ndistribution. We outperform all baselines and demonstrate that among the\nconsidered methods, ours is the only one that detects label errors of all four\ntypes efficiently. Furthermore, we detect real label errors a) on commonly used\ntest datasets in object detection and b) on a proprietary dataset. In both\ncases we achieve low false positives rates, i.e., we detect label errors with a\nprecision for a) of up to 71.5% and for b) with 97%.\n","authors":["Marius Schubert","Tobias Riedlinger","Karsten Kahl","Daniel Kröll","Sebastian Schoenen","Siniša Šegvić","Matthias Rottmann"],"pdf_url":"https://arxiv.org/pdf/2303.06999v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11905v1","updated":"2023-12-19T07:23:49Z","published":"2023-12-19T07:23:49Z","title":"Convergence Visualizer of Decentralized Federated Distillation with\n Reduced Communication Costs","summary":" Federated learning (FL) achieves collaborative learning without the need for\ndata sharing, thus preventing privacy leakage. To extend FL into a fully\ndecentralized algorithm, researchers have applied distributed optimization\nalgorithms to FL by considering machine learning (ML) tasks as parameter\noptimization problems. Conversely, the consensus-based multi-hop federated\ndistillation (CMFD) proposed in the authors' previous work makes neural network\n(NN) models get close with others in a function space rather than in a\nparameter space. Hence, this study solves two unresolved challenges of CMFD:\n(1) communication cost reduction and (2) visualization of model convergence.\nBased on a proposed dynamic communication cost reduction method (DCCR), the\namount of data transferred in a network is reduced; however, with a slight\ndegradation in the prediction accuracy. In addition, a technique for\nvisualizing the distance between the NN models in a function space is also\nproposed. The technique applies a dimensionality reduction technique by\napproximating infinite-dimensional functions as numerical vectors to visualize\nthe trajectory of how the models change by the distributed learning algorithm.\n","authors":["Akihito Taya","Yuuki Nishiyama","Kaoru Sezaki"],"pdf_url":"https://arxiv.org/pdf/2312.11905v1.pdf","comment":"(c) 2023 IEEE. Personal use of this material is permitted. Permission\n from IEEE must be obtained for all other uses, in any current or future\n media, including reprinting/republishing this material for advertising or\n promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2311.15570v2","updated":"2023-12-19T07:12:21Z","published":"2023-11-27T06:38:07Z","title":"UFDA: Universal Federated Domain Adaptation with Practical Assumptions","summary":" Conventional Federated Domain Adaptation (FDA) approaches usually demand an\nabundance of assumptions, which makes them significantly less feasible for\nreal-world situations and introduces security hazards. This paper relaxes the\nassumptions from previous FDAs and studies a more practical scenario named\nUniversal Federated Domain Adaptation (UFDA). It only requires the black-box\nmodel and the label set information of each source domain, while the label sets\nof different source domains could be inconsistent, and the target-domain label\nset is totally blind. Towards a more effective solution for our newly proposed\nUFDA scenario, we propose a corresponding methodology called Hot-Learning with\nContrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain\nshifts and category gaps problems by using one-hot outputs from the black-box\nmodels of various source domains. Moreover, to better distinguish the shared\nand unknown classes, we further present a cluster-level strategy named\nMutual-Voting Decision (MVD) to extract robust consensus knowledge across peer\nclasses from both source and target domains. Extensive experiments on three\nbenchmark datasets demonstrate that our method achieves comparable performance\nfor our UFDA scenario with much fewer assumptions, compared to previous\nmethodologies with comprehensive additional assumptions.\n","authors":["Xinhui Liu","Zhenghao Chen","Luping Zhou","Dong Xu","Wei Xi","Gairui Bai","Yihan Zhao","Jizhong Zhao"],"pdf_url":"https://arxiv.org/pdf/2311.15570v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.11903v1","updated":"2023-12-19T07:06:32Z","published":"2023-12-19T07:06:32Z","title":"Sign Language Conversation Interpretation Using Wearable Sensors and\n Machine Learning","summary":" The count of people suffering from various levels of hearing loss reached\n1.57 billion in 2019. This huge number tends to suffer on many personal and\nprofessional levels and strictly needs to be included with the rest of society\nhealthily. This paper presents a proof of concept of an automatic sign language\nrecognition system based on data obtained using a wearable device of 3 flex\nsensors. The system is designed to interpret a selected set of American Sign\nLanguage (ASL) dynamic words by collecting data in sequences of the performed\nsigns and using machine learning methods. The built models achieved\nhigh-quality performances, such as Random Forest with 99% accuracy, Support\nVector Machine (SVM) with 99%, and two K-Nearest Neighbor (KNN) models with\n98%. This indicates many possible paths toward the development of a full-scale\nsystem.\n","authors":["Basma Kalandar","Ziemowit Dworakowski"],"pdf_url":"https://arxiv.org/pdf/2312.11903v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11898v1","updated":"2023-12-19T06:47:22Z","published":"2023-12-19T06:47:22Z","title":"Short-Term Multi-Horizon Line Loss Rate Forecasting of a Distribution\n Network Using Attention-GCN-LSTM","summary":" Accurately predicting line loss rates is vital for effective line loss\nmanagement in distribution networks, especially over short-term multi-horizons\nranging from one hour to one week. In this study, we propose\nAttention-GCN-LSTM, a novel method that combines Graph Convolutional Networks\n(GCN), Long Short-Term Memory (LSTM), and a three-level attention mechanism to\naddress this challenge. By capturing spatial and temporal dependencies, our\nmodel enables accurate forecasting of line loss rates across multiple horizons.\nThrough comprehensive evaluation using real-world data from 10KV feeders, our\nAttention-GCN-LSTM model consistently outperforms existing algorithms,\nexhibiting superior performance in terms of prediction accuracy and\nmulti-horizon forecasting. This model holds significant promise for enhancing\nline loss management in distribution networks.\n","authors":["Jie Liu","Yijia Cao","Yong Li","Yixiu Guo","Wei Deng"],"pdf_url":"https://arxiv.org/pdf/2312.11898v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02651v2","updated":"2023-12-19T06:44:03Z","published":"2023-10-04T08:19:04Z","title":"Hire When You Need to: Gradual Participant Recruitment for Auction-based\n Federated Learning","summary":" The success of Federated Learning (FL) depends on the quantity and quality of\nthe data owners (DOs) as well as their motivation to join FL model training.\nReputation-based FL participant selection methods have been proposed. However,\nthey still face the challenges of the cold start problem and potential\nselection bias towards highly reputable DOs. Such a bias can result in lower\nreputation DOs being prematurely excluded from future FL training rounds,\nthereby reducing the diversity of training data and the generalizability of the\nresulting models. To address these challenges, we propose the Gradual\nParticipant Selection scheme for Auction-based Federated Learning (GPS-AFL).\nUnlike existing AFL incentive mechanisms which generally assume that all DOs\nrequired for an FL task must be selected in one go, GPS-AFL gradually selects\nthe required DOs over multiple rounds of training as more information is\nrevealed through repeated interactions. It is designed to strike a balance\nbetween cost saving and performance enhancement, while mitigating the drawbacks\nof selection bias in reputation-based FL. Extensive experiments based on\nreal-world datasets demonstrate the significant advantages of GPS-AFL, which\nreduces costs by 33.65% and improved total utility by 2.91%, on average\ncompared to the best-performing state-of-the-art approach.\n","authors":["Xavier Tan","Han Yu"],"pdf_url":"https://arxiv.org/pdf/2310.02651v2.pdf","comment":"9 Pages, 3 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.15516v3","updated":"2023-12-19T06:39:27Z","published":"2023-10-24T04:50:32Z","title":"Graph Attention-based Deep Reinforcement Learning for solving the\n Chinese Postman Problem with Load-dependent costs","summary":" Recently, Deep reinforcement learning (DRL) models have shown promising\nresults in solving routing problems. However, most DRL solvers are commonly\nproposed to solve node routing problems, such as the Traveling Salesman Problem\n(TSP). Meanwhile, there has been limited research on applying neural methods to\narc routing problems, such as the Chinese Postman Problem (CPP), since they\noften feature irregular and complex solution spaces compared to TSP. To fill\nthese gaps, this paper proposes a novel DRL framework to address the CPP with\nload-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc\nrouting problem with load constraints. The novelty of our method is two-fold.\nFirst, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential\nmodel. Subsequently, we introduce an autoregressive model based on DRL, namely\nArc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge\neffectively. Such a framework allows the DRL model to work efficiently and\nscalably to arc routing problems. Furthermore, we propose a new bio-inspired\nmeta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC.\nExtensive experiments show that Arc-DRL outperforms existing meta-heuristic\nmethods such as Iterative Local Search (ILS) and Variable Neighborhood Search\n(VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for\nCPP-LC regarding both solution quality and running time; while the EA gives the\nbest solution quality with much more running time. We release our C++\nimplementations for metaheuristics such as EA, ILS and VNS along with the code\nfor data generation and our generated data at\nhttps://github.com/HySonLab/Chinese_Postman_Problem\n","authors":["Cong Dao Tran","Truong Son Hy"],"pdf_url":"https://arxiv.org/pdf/2310.15516v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11894v1","updated":"2023-12-19T06:38:18Z","published":"2023-12-19T06:38:18Z","title":"3D-LFM: Lifting Foundation Model","summary":" The lifting of 3D structure and camera from 2D landmarks is at the\ncornerstone of the entire discipline of computer vision. Traditional methods\nhave been confined to specific rigid objects, such as those in\nPerspective-n-Point (PnP) problems, but deep learning has expanded our\ncapability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL)\nwith resilience to noise, occlusions, and perspective distortions. All these\ntechniques, however, have been limited by the fundamental need to establish\ncorrespondences across the 3D training data -- significantly limiting their\nutility to applications where one has an abundance of \"in-correspondence\" 3D\ndata. Our approach harnesses the inherent permutation equivariance of\ntransformers to manage varying number of points per 3D data instance,\nwithstands occlusions, and generalizes to unseen categories. We demonstrate\nstate of the art performance across 2D-3D lifting task benchmarks. Since our\napproach can be trained across such a broad class of structures we refer to it\nsimply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.\n","authors":["Mosam Dabhi","Laszlo A. Jeni","Simon Lucey"],"pdf_url":"https://arxiv.org/pdf/2312.11894v1.pdf","comment":"Project page is available at https://3dlfm.github.io"},{"id":"http://arxiv.org/abs/2312.11026v2","updated":"2023-12-19T06:32:32Z","published":"2023-12-18T08:59:31Z","title":"MISA: Unveiling the Vulnerabilities in Split Federated Learning","summary":" \\textit{Federated learning} (FL) and \\textit{split learning} (SL) are\nprevailing distributed paradigms in recent years. They both enable shared\nglobal model training while keeping data localized on users' devices. The\nformer excels in parallel execution capabilities, while the latter enjoys low\ndependence on edge computing resources and strong privacy protection.\n\\textit{Split federated learning} (SFL) combines the strengths of both FL and\nSL, making it one of the most popular distributed architectures. Furthermore, a\nrecent study has claimed that SFL exhibits robustness against poisoning\nattacks, with a fivefold improvement compared to FL in terms of robustness.\n In this paper, we present a novel poisoning attack known as MISA. It poisons\nboth the top and bottom models, causing a \\textbf{\\underline{misa}}lignment in\nthe global model, ultimately leading to a drastic accuracy collapse. This\nattack unveils the vulnerabilities in SFL, challenging the conventional belief\nthat SFL is robust against poisoning attacks. Extensive experiments demonstrate\nthat our proposed MISA poses a significant threat to the availability of SFL,\nunderscoring the imperative for academia and industry to accord this matter due\nattention.\n","authors":["Wei Wan","Yuxuan Ning","Shengshan Hu","Lulu Xue","Minghui Li","Leo Yu Zhang","Hai Jin"],"pdf_url":"https://arxiv.org/pdf/2312.11026v2.pdf","comment":"This paper has been accepted by the IEEE International Conference on\n Acoustics, Speech, and Signal Processing (ICASSP 2024)"},{"id":"http://arxiv.org/abs/2312.11891v1","updated":"2023-12-19T06:28:32Z","published":"2023-12-19T06:28:32Z","title":"Hierarchical and Incremental Structural Entropy Minimization for\n Unsupervised Social Event Detection","summary":" As a trending approach for social event detection, graph neural network\n(GNN)-based methods enable a fusion of natural language semantics and the\ncomplex social network structural information, thus showing SOTA performance.\nHowever, GNN-based methods can miss useful message correlations. Moreover, they\nrequire manual labeling for training and predetermining the number of events\nfor prediction. In this work, we address social event detection via graph\nstructural entropy (SE) minimization. While keeping the merits of the GNN-based\nmethods, the proposed framework, HISEvent, constructs more informative message\ngraphs, is unsupervised, and does not require the number of events given a\npriori. Specifically, we incrementally explore the graph neighborhoods using\n1-dimensional (1D) SE minimization to supplement the existing message graph\nwith edges between semantically related messages. We then detect events from\nthe message graph by hierarchically minimizing 2-dimensional (2D) SE. Our\nproposed 1D and 2D SE minimization algorithms are customized for social event\ndetection and effectively tackle the efficiency problem of the existing SE\nminimization algorithms. Extensive experiments show that HISEvent consistently\noutperforms GNN-based methods and achieves the new SOTA for social event\ndetection under both closed- and open-set settings while being efficient and\nrobust.\n","authors":["Yuwei Cao","Hao Peng","Zhengtao Yu","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2312.11891v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11882v1","updated":"2023-12-19T06:16:13Z","published":"2023-12-19T06:16:13Z","title":"ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for\n Accelerating Language Models Inference","summary":" Early Exiting is one of the most popular methods to achieve efficient\ninference. Current early exiting methods adopt the (weighted) sum of the cross\nentropy loss of all internal classifiers during training, imposing all these\nclassifiers to predict all instances correctly. However, during inference, as\nlong as one internal classifier predicts an instance correctly, it can\naccelerate without losing accuracy. Thus, there is a notable gap between\ntraining and inference. We propose ConsistentEE, an early exiting method that\nis consistent in training and inference. ConsistentEE formulates the early\nexiting process as a reinforcement learning problem. A policy network is added\nto decide whether an instance should exit or continue. The training objective\nof ConsistentEE only require each instance to be predicted correctly by one\ninternal classifier. Additionally, we introduce the concept Memorize Layer to\nmeasure the hardness of an instance. We incorporate memorized layer into reward\nfunction design, which allows ``easy'' instances to focus more on acceleration\nwhile ``hard'' instances to focus more on accuracy. Experimental results show\nthat our method outperforms other baselines on various natural language\nunderstanding and generation tasks.\n","authors":["Ziqian Zeng","Yihuai Hong","Hongliang Dai","Huiping Zhuang","Cen Chen"],"pdf_url":"https://arxiv.org/pdf/2312.11882v1.pdf","comment":"Accepted in AAAI24"},{"id":"http://arxiv.org/abs/2312.11880v1","updated":"2023-12-19T06:13:58Z","published":"2023-12-19T06:13:58Z","title":"Point Cloud Segmentation Using Transfer Learning with RandLA-Net: A Case\n Study on Urban Areas","summary":" Urban environments are characterized by complex structures and diverse\nfeatures, making accurate segmentation of point cloud data a challenging task.\nThis paper presents a comprehensive study on the application of RandLA-Net, a\nstate-of-the-art neural network architecture, for the 3D segmentation of\nlarge-scale point cloud data in urban areas. The study focuses on three major\nChinese cities, namely Chengdu, Jiaoda, and Shenzhen, leveraging their unique\ncharacteristics to enhance segmentation performance.\n To address the limited availability of labeled data for these specific urban\nareas, we employed transfer learning techniques. We transferred the learned\nweights from the Sensat Urban and Toronto 3D datasets to initialize our\nRandLA-Net model. Additionally, we performed class remapping to adapt the model\nto the target urban areas, ensuring accurate segmentation results.\n The experimental results demonstrate the effectiveness of the proposed\napproach achieving over 80\\% F1 score for each areas in 3D point cloud\nsegmentation. The transfer learning strategy proves to be crucial in overcoming\ndata scarcity issues, providing a robust solution for urban point cloud\nanalysis. The findings contribute to the advancement of point cloud\nsegmentation methods, especially in the context of rapidly evolving Chinese\nurban areas.\n","authors":["Alperen Enes Bayar","Ufuk Uyan","Elif Toprak","Cao Yuheng","Tang Juncheng","Ahmet Alp Kindiroglu"],"pdf_url":"https://arxiv.org/pdf/2312.11880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11875v1","updated":"2023-12-19T06:06:30Z","published":"2023-12-19T06:06:30Z","title":"Sparse is Enough in Fine-tuning Pre-trained Large Language Model","summary":" With the prevalence of pre-training-fine-tuning paradigm, how to efficiently\nadapt the pre-trained model to the downstream tasks has been an intriguing\nissue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for\nlow-cost adaptation, including Adapters, Bia-only, and the recently widely used\nLow-Rank Adaptation. Although these methods have demonstrated their\neffectiveness to some extent and have been widely applied, the underlying\nprinciples are still unclear. In this paper, we reveal the transition of loss\nlandscape in the downstream domain from random initialization to pre-trained\ninitialization, that is, from low-amplitude oscillation to high-amplitude\noscillation. The parameter gradients exhibit a property akin to sparsity, where\na small fraction of components dominate the total gradient norm, for instance,\n1% of the components account for 99% of the gradient. This property ensures\nthat the pre-trained model can easily find a flat minimizer which guarantees\nthe model's ability to generalize even with a low number of trainable\nparameters. Based on this, we propose a gradient-based sparse fine-tuning\nalgorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its\neffectiveness on a range of tasks including the GLUE Benchmark and\nInstruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.\n","authors":["Weixi Song","Zuchao Li","Lefei Zhang","Hai Zhao","Bo Du"],"pdf_url":"https://arxiv.org/pdf/2312.11875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.10985v2","updated":"2023-12-19T05:51:09Z","published":"2021-12-21T05:07:54Z","title":"Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding","summary":" Drawing on theoretical insights, we advocate an error-based thresholding\n(EBT) mechanism for learned ISTA (LISTA), which utilizes a function of the\nlayer-wise reconstruction error to suggest a specific threshold for each\nobservation in the shrinkage function of each layer. We show that the proposed\nEBT mechanism well disentangles the learnable parameters in the shrinkage\nfunctions from the reconstruction errors, endowing the obtained models with\nimproved adaptivity to possible data variations. With rigorous analyses, we\nfurther show that the proposed EBT also leads to a faster convergence on the\nbasis of LISTA or its variants, in addition to its higher adaptivity. Extensive\nexperimental results confirm our theoretical analyses and verify the\neffectiveness of our methods.\n","authors":["Ziang Li","Kailun Wu","Yiwen Guo","Changshui Zhang"],"pdf_url":"https://arxiv.org/pdf/2112.10985v2.pdf","comment":"Accepted in ICASSP2024"},{"id":"http://arxiv.org/abs/2312.09131v2","updated":"2023-12-19T05:40:27Z","published":"2023-12-14T17:01:58Z","title":"Physics-Informed Neural Network Lyapunov Functions: PDE\n Characterization, Learning, and Verification","summary":" We provide a systematic investigation of using physics-informed neural\nnetworks to compute Lyapunov functions. We encode Lyapunov conditions as a\npartial differential equation (PDE) and use this for training neural network\nLyapunov functions. We analyze the analytical properties of the solutions to\nthe Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov\nequation in training neural Lyapunov functions can lead to approximate regions\nof attraction close to the true domain of attraction. We also examine\napproximation errors and the convergence of neural approximations to the unique\nsolution of Zubov's equation. We then provide sufficient conditions for the\nlearned neural Lyapunov functions that can be readily verified by\nsatisfiability modulo theories (SMT) solvers, enabling formal verification of\nboth local stability analysis and region-of-attraction estimates in the large.\nThrough a number of nonlinear examples, ranging from low to high dimensions, we\ndemonstrate that the proposed framework can outperform traditional\nsums-of-squares (SOS) Lyapunov functions obtained using semidefinite\nprogramming (SDP).\n","authors":["Jun Liu","Yiming Meng","Maxwell Fitzsimmons","Ruikun Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.09131v2.pdf","comment":"The current version has been submitted for publication"},{"id":"http://arxiv.org/abs/2311.05144v2","updated":"2023-12-19T05:21:31Z","published":"2023-11-09T04:49:41Z","title":"Counter-Empirical Attacking based on Adversarial Reinforcement Learning\n for Time-Relevant Scoring System","summary":" Scoring systems are commonly seen for platforms in the era of big data. From\ncredit scoring systems in financial services to membership scores in E-commerce\nshopping platforms, platform managers use such systems to guide users towards\nthe encouraged activity pattern, and manage resources more effectively and more\nefficiently thereby. To establish such scoring systems, several \"empirical\ncriteria\" are firstly determined, followed by dedicated top-down design for\neach factor of the score, which usually requires enormous effort to adjust and\ntune the scoring function in the new application scenario. What's worse, many\nfresh projects usually have no ground-truth or any experience to evaluate a\nreasonable scoring system, making the designing even harder. To reduce the\neffort of manual adjustment of the scoring function in every new scoring\nsystem, we innovatively study the scoring system from the preset empirical\ncriteria without any ground truth, and propose a novel framework to improve the\nsystem from scratch. In this paper, we propose a \"counter-empirical attacking\"\nmechanism that can generate \"attacking\" behavior traces and try to break the\nempirical rules of the scoring system. Then an adversarial \"enhancer\" is\napplied to evaluate the scoring system and find the improvement strategy. By\ntraining the adversarial learning problem, a proper scoring function can be\nlearned to be robust to the attacking activity traces that are trying to\nviolate the empirical criteria. Extensive experiments have been conducted on\ntwo scoring systems including a shared computing resource platform and a\nfinancial credit system. The experimental results have validated the\neffectiveness of our proposed framework.\n","authors":["Xiangguo Sun","Hong Cheng","Hang Dong","Bo Qiao","Si Qin","Qingwei Lin"],"pdf_url":"https://arxiv.org/pdf/2311.05144v2.pdf","comment":"Accepted by TKDE"},{"id":"http://arxiv.org/abs/2312.11863v1","updated":"2023-12-19T05:17:27Z","published":"2023-12-19T05:17:27Z","title":"Neural Network Approximation for Pessimistic Offline Reinforcement\n Learning","summary":" Deep reinforcement learning (RL) has shown remarkable success in specific\noffline decision-making scenarios, yet its theoretical guarantees are still\nunder development. Existing works on offline RL theory primarily emphasize a\nfew trivial settings, such as linear MDP or general function approximation with\nstrong assumptions and independent data, which lack guidance for practical use.\nThe coupling of deep learning and Bellman residuals makes this problem\nchallenging, in addition to the difficulty of data dependence. In this paper,\nwe establish a non-asymptotic estimation error of pessimistic offline RL using\ngeneral neural network approximation with $\\mathcal{C}$-mixing data regarding\nthe structure of networks, the dimension of datasets, and the concentrability\nof data coverage, under mild assumptions. Our result shows that the estimation\nerror consists of two parts: the first converges to zero at a desired rate on\nthe sample size with partially controllable concentrability, and the second\nbecomes negligible if the residual constraint is tight. This result\ndemonstrates the explicit efficiency of deep adversarial offline RL frameworks.\nWe utilize the empirical process tool for $\\mathcal{C}$-mixing sequences and\nthe neural network approximation theory for the H\\\"{o}lder class to achieve\nthis. We also develop methods to bound the Bellman estimation error caused by\nfunction approximation with empirical Bellman constraint perturbations.\nAdditionally, we present a result that lessens the curse of dimensionality\nusing data with low intrinsic dimensionality and function classes with low\ncomplexity. Our estimation provides valuable insights into the development of\ndeep offline RL and guidance for algorithm model design.\n","authors":["Di Wu","Yuling Jiao","Li Shen","Haizhao Yang","Xiliang Lu"],"pdf_url":"https://arxiv.org/pdf/2312.11863v1.pdf","comment":"Full version of the paper accepted to the 38th Annual AAAI Conference\n on Artificial Intelligence (AAAI 2024)"},{"id":"http://arxiv.org/abs/2312.11862v1","updated":"2023-12-19T05:14:31Z","published":"2023-12-19T05:14:31Z","title":"Topo-MLP : A Simplicial Network Without Message Passing","summary":" Due to their ability to model meaningful higher order relations among a set\nof entities, higher order network models have emerged recently as a powerful\nalternative for graph-based network models which are only capable of modeling\nbinary relationships. Message passing paradigm is still dominantly used to\nlearn representations even for higher order network models. While powerful,\nmessage passing can have disadvantages during inference, particularly when the\nhigher order connectivity information is missing or corrupted. To overcome such\nlimitations, we propose Topo-MLP, a purely MLP-based simplicial neural network\nalgorithm to learn the representation of elements in a simplicial complex\nwithout explicitly relying on message passing. Our framework utilizes a novel\nHigher Order Neighborhood Contrastive (HONC) loss which implicitly incorporates\nthe simplicial structure into representation learning. Our proposed model's\nsimplicity makes it faster during inference. Moreover, we show that our model\nis robust when faced with missing or corrupted connectivity structure.\n","authors":["Karthikeyan Natesan Ramamurthy","Aldo Guzmán-Sáenz","Mustafa Hajij"],"pdf_url":"https://arxiv.org/pdf/2312.11862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11861v1","updated":"2023-12-19T05:13:16Z","published":"2023-12-19T05:13:16Z","title":"MG-Skip: Random Multi-Gossip Skipping Method for Nonsmooth Distributed\n Optimization","summary":" Distributed optimization methods with probabilistic local updates have\nrecently gained attention for their provable ability to communication\nacceleration. Nevertheless, this capability is effective only when the loss\nfunction is smooth and the network is sufficiently well-connected. In this\npaper, we propose the first linear convergent method MG-Skip with probabilistic\nlocal updates for nonsmooth distributed optimization. Without any extra\ncondition for the network connectivity, MG-Skip allows for the multiple-round\ngossip communication to be skipped in most iterations, while its iteration\ncomplexity is $\\mathcal{O}\\left(\\kappa \\log \\frac{1}{\\epsilon}\\right)$ and\ncommunication complexity is only\n$\\mathcal{O}\\left(\\sqrt{\\frac{\\kappa}{(1-\\rho)}} \\log\n\\frac{1}{\\epsilon}\\right)$, where $\\kappa$ is the condition number of the loss\nfunction and $\\rho$ reflects the connectivity of the network topology. To the\nbest of our knowledge, MG-Skip achieves the best communication complexity when\nthe loss function has the smooth (strongly convex)+nonsmooth (convex) composite\nform.\n","authors":["Luyao Guo","Luqing Wang","Xinli Shi","Jinde Cao"],"pdf_url":"https://arxiv.org/pdf/2312.11861v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11858v1","updated":"2023-12-19T04:58:37Z","published":"2023-12-19T04:58:37Z","title":"SimCalib: Graph Neural Network Calibration based on Similarity between\n Nodes","summary":" Graph neural networks (GNNs) have exhibited impressive performance in\nmodeling graph data as exemplified in various applications. Recently, the GNN\ncalibration problem has attracted increasing attention, especially in\ncost-sensitive scenarios. Previous work has gained empirical insights on the\nissue, and devised effective approaches for it, but theoretical supports still\nfall short. In this work, we shed light on the relationship between GNN\ncalibration and nodewise similarity via theoretical analysis. A novel\ncalibration framework, named SimCalib, is accordingly proposed to consider\nsimilarity between nodes at global and local levels. At the global level, the\nMahalanobis distance between the current node and class prototypes is\nintegrated to implicitly consider similarity between the current node and all\nnodes in the same class. At the local level, the similarity of node\nrepresentation movement dynamics, quantified by nodewise homophily and relative\ndegree, is considered. Informed about the application of nodewise movement\npatterns in analyzing nodewise behavior on the over-smoothing problem, we\nempirically present a possible relationship between over-smoothing and GNN\ncalibration problem. Experimentally, we discover a correlation between nodewise\nsimilarity and model calibration improvement, in alignment with our theoretical\nresults. Additionally, we conduct extensive experiments investigating different\ndesign factors and demonstrate the effectiveness of our proposed SimCalib\nframework for GNN calibration by achieving state-of-the-art performance on 14\nout of 16 benchmarks.\n","authors":["Boshi Tang","Zhiyong Wu","Xixin Wu","Qiaochu Huang","Jun Chen","Shun Lei","Helen Meng"],"pdf_url":"https://arxiv.org/pdf/2312.11858v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13976v3","updated":"2023-12-19T04:44:35Z","published":"2023-08-27T00:31:04Z","title":"Label Denoising through Cross-Model Agreement","summary":" Learning from corrupted labels is very common in real-world machine-learning\napplications. Memorizing such noisy labels could affect the learning of the\nmodel, leading to sub-optimal performances. In this work, we propose a novel\nframework to learn robust machine-learning models from noisy labels. Through an\nempirical study, we find that different models make relatively similar\npredictions on clean examples, while the predictions on noisy examples vary\nmuch more across different models. Motivated by this observation, we propose\n\\em denoising with cross-model agreement \\em (DeCA) which aims to minimize the\nKL-divergence between the true label distributions parameterized by two machine\nlearning models while maximizing the likelihood of data observation. We employ\nthe proposed DeCA on both the binary label scenario and the multiple label\nscenario. For the binary label scenario, we select implicit feedback\nrecommendation as the downstream task and conduct experiments with four\nstate-of-the-art recommendation models on four datasets. For the multiple-label\nscenario, the downstream application is image classification on two benchmark\ndatasets. Experimental results demonstrate that the proposed methods\nsignificantly improve the model performance compared with normal training and\nother denoising methods on both binary and multiple-label scenarios.\n","authors":["Yu Wang","Xin Xin","Zaiqiao Meng","Joemon Jose","Fuli Feng"],"pdf_url":"https://arxiv.org/pdf/2308.13976v3.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2105.09605"},{"id":"http://arxiv.org/abs/2312.09323v3","updated":"2023-12-19T04:31:21Z","published":"2023-12-07T19:58:37Z","title":"Perspectives on the State and Future of Deep Learning -- 2023","summary":" The goal of this series is to chronicle opinions and issues in the field of\nmachine learning as they stand today and as they change over time. The plan is\nto host this survey periodically until the AI singularity\npaperclip-frenzy-driven doomsday, keeping an updated list of topical questions\nand interviewing new community members for each edition. In this issue, we\nprobed people's opinions on interpretable AI, the value of benchmarking in\nmodern NLP, the state of progress towards understanding deep learning, and the\nfuture of academia.\n","authors":["Micah Goldblum","Anima Anandkumar","Richard Baraniuk","Tom Goldstein","Kyunghyun Cho","Zachary C Lipton","Melanie Mitchell","Preetum Nakkiran","Max Welling","Andrew Gordon Wilson"],"pdf_url":"https://arxiv.org/pdf/2312.09323v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11846v1","updated":"2023-12-19T04:26:12Z","published":"2023-12-19T04:26:12Z","title":"Initializing Services in Interactive ML Systems for Diverse Users","summary":" This paper studies ML systems that interactively learn from users across\nmultiple subpopulations with heterogeneous data distributions. The primary\nobjective is to provide specialized services for different user groups while\nalso predicting user preferences. Once the users select a service based on how\nwell the service anticipated their preference, the services subsequently adapt\nand refine themselves based on the user data they accumulate, resulting in an\niterative, alternating minimization process between users and services\n(learning dynamics). Employing such tailored approaches has two main\nchallenges: (i) Unknown user preferences: Typically, data on user preferences\nare unavailable without interaction, and uniform data collection across a large\nand diverse user base can be prohibitively expensive. (ii) Suboptimal Local\nSolutions: The total loss (sum of loss functions across all users and all\nservices) landscape is not convex even if the individual losses on a single\nservice are convex, making it likely for the learning dynamics to get stuck in\nlocal minima. The final outcome of the aforementioned learning dynamics is thus\nstrongly influenced by the initial set of services offered to users, and is not\nguaranteed to be close to the globally optimal outcome. In this work, we\npropose a randomized algorithm to adaptively select very few users to collect\npreference data from, while simultaneously initializing a set of services. We\nprove that under mild assumptions on the loss functions, the expected total\nloss achieved by the algorithm right after initialization is within a factor of\nthe globally optimal total loss with complete user preference data, and this\nfactor scales only logarithmically in the number of services. Our theory is\ncomplemented by experiments on real as well as semi-synthetic datasets.\n","authors":["Avinandan Bose","Mihaela Curmei","Daniel L. Jiang","Jamie Morgenstern","Sarah Dean","Lillian J. Ratliff","Maryam Fazel"],"pdf_url":"https://arxiv.org/pdf/2312.11846v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11835v1","updated":"2023-12-19T04:03:47Z","published":"2023-12-19T04:03:47Z","title":"Provably Convergent Federated Trilevel Learning","summary":" Trilevel learning, also called trilevel optimization (TLO), has been\nrecognized as a powerful modelling tool for hierarchical decision process and\nwidely applied in many machine learning applications, such as robust neural\narchitecture search, hyperparameter optimization, and domain adaptation.\nTackling TLO problems has presented a great challenge due to their nested\ndecision-making structure. In addition, existing works on TLO face the\nfollowing key challenges: 1) they all focus on the non-distributed setting,\nwhich may lead to privacy breach; 2) they do not offer any non-asymptotic\nconvergence analysis which characterizes how fast an algorithm converges. To\naddress the aforementioned challenges, this paper proposes an asynchronous\nfederated trilevel optimization method to solve TLO problems. The proposed\nmethod utilizes $\\mu$-cuts to construct a hyper-polyhedral approximation for\nthe TLO problem and solve it in an asynchronous manner. We demonstrate that the\nproposed $\\mu$-cuts are applicable to not only convex functions but also a wide\nrange of non-convex functions that meet the $\\mu$-weakly convex assumption.\nFurthermore, we theoretically analyze the non-asymptotic convergence rate for\nthe proposed method by showing its iteration complexity to obtain\n$\\epsilon$-stationary point is upper bounded by\n$\\mathcal{O}(\\frac{1}{\\epsilon^2})$. Extensive experiments on real-world\ndatasets have been conducted to elucidate the superiority of the proposed\nmethod, e.g., it has a faster convergence rate with a maximum acceleration of\napproximately 80$\\%$.\n","authors":["Yang Jiao","Kai Yang","Tiancheng Wu","Chengtao Jian","Jianwei Huang"],"pdf_url":"https://arxiv.org/pdf/2312.11835v1.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11834v1","updated":"2023-12-19T04:02:50Z","published":"2023-12-19T04:02:50Z","title":"Multi-agent reinforcement learning using echo-state network and its\n application to pedestrian dynamics","summary":" In recent years, simulations of pedestrians using the multi-agent\nreinforcement learning (MARL) have been studied. This study considered the\nroads on a grid-world environment, and implemented pedestrians as MARL agents\nusing an echo-state network and the least squares policy iteration method.\nUnder this environment, the ability of these agents to learn to move forward by\navoiding other agents was investigated. Specifically, we considered two types\nof tasks: the choice between a narrow direct route and a broad detour, and the\nbidirectional pedestrian flow in a corridor. The simulations results indicated\nthat the learning was successful when the density of the agents was not that\nhigh.\n","authors":["Hisato Komatsu"],"pdf_url":"https://arxiv.org/pdf/2312.11834v1.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2004.08697v7","updated":"2023-12-19T03:58:16Z","published":"2020-04-18T20:09:34Z","title":"CausalVAE: Structured Causal Disentanglement in Variational Autoencoder","summary":" Learning disentanglement aims at finding a low dimensional representation\nwhich consists of multiple explanatory and generative factors of the\nobservational data. The framework of variational autoencoder (VAE) is commonly\nused to disentangle independent factors from observations. However, in real\nscenarios, factors with semantics are not necessarily independent. Instead,\nthere might be an underlying causal structure which renders these factors\ndependent. We thus propose a new VAE based framework named CausalVAE, which\nincludes a Causal Layer to transform independent exogenous factors into causal\nendogenous ones that correspond to causally related concepts in data. We\nfurther analyze the model identifiabitily, showing that the proposed model\nlearned from observations recovers the true one up to a certain degree by\nproviding supervision signals (e.g. feature labels). Experiments are conducted\non various datasets, including synthetic and real word benchmark CelebA.\nResults show that the causal representations learned by CausalVAE are\nsemantically interpretable, and their causal relationship as a Directed Acyclic\nGraph (DAG) is identified with good accuracy. Furthermore, we demonstrate that\nthe proposed CausalVAE model is able to generate counterfactual data through\n\"do-operation\" to the causal factors.\n","authors":["Mengyue Yang","Furui Liu","Zhitang Chen","Xinwei Shen","Jianye Hao","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2004.08697v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11832v1","updated":"2023-12-19T03:48:39Z","published":"2023-12-19T03:48:39Z","title":"The Validity of a Machine Learning-Based Video Game in the Objective\n Screening of Attention Deficit Hyperactivity Disorder in Children Aged 5 to\n 12 Years","summary":" Objective: Early identification of ADHD is necessary to provide the\nopportunity for timely treatment. However, screening the symptoms of ADHD on a\nlarge scale is not easy. This study aimed to validate a video game (FishFinder)\nfor the screening of ADHD using objective measurement of the core symptoms of\nthis disorder. Method: The FishFinder measures attention and impulsivity\nthrough in-game performance and evaluates the child's hyperactivity using\nsmartphone motion sensors. This game was tested on 26 children with ADHD and 26\nhealthy children aged 5 to 12 years. A Support Vector Machine was employed to\ndetect children with ADHD. results: This system showed 92.3% accuracy, 90%\nsensitivity, and 93.7% specificity using a combination of in-game and movement\nfeatures. Conclusions: The FishFinder demonstrated a strong ability to identify\nADHD in children. So, this game can be used as an affordable, accessible, and\nenjoyable method for the objective screening of ADHD.\n","authors":["Zeinab Zakani","Hadi Moradi","Sogand Ghasemzadeh","Maryam Riazi","Fatemeh Mortazavi"],"pdf_url":"https://arxiv.org/pdf/2312.11832v1.pdf","comment":"30 pages, 4 figures, 11 tables"},{"id":"http://arxiv.org/abs/2312.11831v1","updated":"2023-12-19T03:45:27Z","published":"2023-12-19T03:45:27Z","title":"Locally-Minimal Probabilistic Explanations","summary":" Formal abductive explanations offer crucial guarantees of rigor and so are of\ninterest in high-stakes uses of machine learning (ML). One drawback of\nabductive explanations is explanation size, justified by the cognitive limits\nof human decision-makers. Probabilistic abductive explanations (PAXps) address\nthis limitation, but their theoretical and practical complexity makes their\nexact computation most often unrealistic. This paper proposes novel efficient\nalgorithms for the computation of locally-minimal PXAps, which offer\nhigh-quality approximations of PXAps in practice. The experimental results\ndemonstrate the practical efficiency of the proposed algorithms.\n","authors":["Yacine Izza","Kuldeep S. Meel","Joao Marques-Silva"],"pdf_url":"https://arxiv.org/pdf/2312.11831v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.11651v2","updated":"2023-12-19T03:35:36Z","published":"2023-09-20T21:32:58Z","title":"Drift Control of High-Dimensional RBM: A Computational Method Based on\n Neural Networks","summary":" Motivated by applications in queueing theory, we consider a stochastic\ncontrol problem whose state space is the $d$-dimensional positive orthant. The\ncontrolled process $Z$ evolves as a reflected Brownian motion whose covariance\nmatrix is exogenously specified, as are its directions of reflection from the\northant's boundary surfaces. A system manager chooses a drift vector\n$\\theta(t)$ at each time $t$ based on the history of $Z$, and the cost rate at\ntime $t$ depends on both $Z(t)$ and $\\theta(t)$. In our initial problem\nformulation, the objective is to minimize expected discounted cost over an\ninfinite planning horizon, after which we treat the corresponding ergodic\ncontrol problem. Extending earlier work by Han et al. (Proceedings of the\nNational Academy of Sciences, 2018, 8505-8510), we develop and illustrate a\nsimulation-based computational method that relies heavily on deep neural\nnetwork technology. For test problems studied thus far, our method is accurate\nto within a fraction of one percent, and is computationally feasible in\ndimensions up to at least $d=30$.\n","authors":["Baris Ata","J. Michael Harrison","Nian Si"],"pdf_url":"https://arxiv.org/pdf/2309.11651v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.09532v3","updated":"2023-12-19T03:35:13Z","published":"2023-02-19T10:34:08Z","title":"Pseudo Contrastive Learning for Graph-based Semi-supervised Learning","summary":" Pseudo Labeling is a technique used to improve the performance of\nsemi-supervised Graph Neural Networks (GNNs) by generating additional\npseudo-labels based on confident predictions. However, the quality of generated\npseudo-labels has been a longstanding concern due to the sensitivity of the\nclassification objective with respect to the given labels. To avoid the\nuntrustworthy classification supervision indicating ``a node belongs to a\nspecific class,'' we favor the fault-tolerant contrasting supervision\ndemonstrating ``two nodes do not belong to the same class.'' Thus, the problem\nof generating high-quality pseudo-labels is then transformed into a relaxed\nversion, i.e., identifying reliable negative pairs. To achieve this, we propose\na general framework for GNNs, termed Pseudo Contrastive Learning (PCL). It\nseparates two nodes whose positive and negative pseudo-labels target the same\nclass. To incorporate topological knowledge into learning, we devise a\ntopologically weighted contrastive loss that spends more effort separating\nnegative pairs with smaller topological distances. Experimentally, we apply PCL\nto various GNNs, which consistently outperform their counterparts using other\npopular general techniques on five real-world graphs.\n","authors":["Weigang Lu","Ziyu Guan","Wei Zhao","Yaming Yang","Yuanhai Lv","Lining Xing","Baosheng Yu","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2302.09532v3.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2312.11822v1","updated":"2023-12-19T03:27:38Z","published":"2023-12-19T03:27:38Z","title":"Classification of complex local environments in systems of particle\n shapes through shape-symmetry encoded data augmentation","summary":" Detecting and analyzing the local environment is crucial for investigating\nthe dynamical processes of crystal nucleation and shape colloidal particle\nself-assembly. Recent developments in machine learning provide a promising\navenue for better order parameters in complex systems that are challenging to\nstudy using traditional approaches. However, the application of machine\nlearning to self-assembly on systems of particle shapes is still underexplored.\nTo address this gap, we propose a simple, physics-agnostic, yet powerful\napproach that involves training a multilayer perceptron (MLP) as a local\nenvironment classifier for systems of particle shapes, using input features\nsuch as particle distances and orientations. Our MLP classifier is trained in a\nsupervised manner with a shape symmetry-encoded data augmentation technique\nwithout the need for any conventional roto-translations invariant symmetry\nfunctions. We evaluate the performance of our classifiers on four different\nscenarios involving self-assembly of cubic structures, 2-dimensional and\n3-dimensional patchy particle shape systems, hexagonal bipyramids with varying\naspect ratios, and truncated shapes with different degrees of truncation. The\nproposed training process and data augmentation technique are both\nstraightforward and flexible, enabling easy application of the classifier to\nother processes involving particle orientations. Our work thus presents a\nvaluable tool for investigating self-assembly processes on systems of particle\nshapes, with potential applications in structure identification of any\nparticle-based or molecular system where orientations can be defined.\n","authors":[" Shih-Kuang"," Lee","Sun-Ting Tsai","Sharon Glotzer"],"pdf_url":"https://arxiv.org/pdf/2312.11822v1.pdf","comment":"14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2312.11819v1","updated":"2023-12-19T03:24:55Z","published":"2023-12-19T03:24:55Z","title":"An Adaptive Placement and Parallelism Framework for Accelerating RLHF\n Training","summary":" Recently, ChatGPT or InstructGPT like large language models (LLM) has made a\nsignificant impact in the AI world. These models are incredibly versatile,\ncapable of performing language tasks on par or even exceeding the capabilities\nof human experts. Many works have attempted to reproduce the complex\nInstructGPT's RLHF (Reinforcement Learning with Human Feedback) training\npipeline. However, the mainstream distributed RLHF training methods typically\nadopt a fixed model placement strategy, referred to as the Flattening strategy.\nThis strategy treats all four models involved in RLHF as a single entity and\nplaces them on all devices, regardless of their differences. Unfortunately,\nthis strategy exacerbates the generation bottlenecks in the RLHF training and\ndegrades the overall training efficiency. To address these issues, we propose\nan adaptive model placement framework that offers two flexible model placement\nstrategies. These strategies allow for the agile allocation of models across\ndevices in a fine-grained manner. The Interleaving strategy helps reduce memory\nredundancy and communication costs during RLHF training. On the other hand, the\nSeparation strategy improves the throughput of model training by separating the\ntraining and generation stages of the RLHF pipeline. Notably, this framework\nseamlessly integrates with other mainstream techniques for acceleration and\nenables automatic hyperparameter search. Extensive experiments have\ndemonstrated that our Interleaving and Separation strategies can achieve\nnotable improvements up to 11x, compared to the current state-of-the-art (SOTA)\napproaches. These experiments encompassed a wide range of training scenarios,\ninvolving models of varying sizes and devices of different scales. The results\nhighlight the effectiveness and superiority of our approaches in accelerating\nthe training of distributed RLHF.\n","authors":["Youshao Xiao","Weichang Wu","Zhenglei Zhou","Fagui Mao","Shangchun Zhao","Lin Ju","Lei Liang","Xiaolu Zhang","Jun Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.11819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12670v1","updated":"2023-12-19T23:56:49Z","published":"2023-12-19T23:56:49Z","title":"On the Role of Server Momentum in Federated Learning","summary":" Federated Averaging (FedAvg) is known to experience convergence issues when\nencountering significant clients system heterogeneity and data heterogeneity.\nServer momentum has been proposed as an effective mitigation. However, existing\nserver momentum works are restrictive in the momentum formulation, do not\nproperly schedule hyperparameters and focus only on system homogeneous\nsettings, which leaves the role of server momentum still an under-explored\nproblem. In this paper, we propose a general framework for server momentum,\nthat (a) covers a large class of momentum schemes that are unexplored in\nfederated learning (FL), (b) enables a popular stagewise hyperparameter\nscheduler, (c) allows heterogeneous and asynchronous local computing. We\nprovide rigorous convergence analysis for the proposed framework. To our best\nknowledge, this is the first work that thoroughly analyzes the performances of\nserver momentum with a hyperparameter scheduler and system heterogeneity.\nExtensive experiments validate the effectiveness of our proposed framework.\n","authors":["Jianhui Sun","Xidong Wu","Heng Huang","Aidong Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12670v1.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2305.12554v2","updated":"2023-12-19T23:52:51Z","published":"2023-05-21T19:31:56Z","title":"Towards Consistent Stochastic Human Motion Prediction via Motion\n Diffusion","summary":" Stochastic Human Motion Prediction (HMP) aims to predict multiple possible\nupcoming pose sequences based on past human motion trajectories. Although\nprevious approaches have shown impressive performance, they face several\nissues, including complex training processes and a tendency to generate\npredictions that are often inconsistent with the provided history, and\nsometimes even becoming entirely unreasonable. To overcome these issues, we\npropose DiffMotion, an end-to-end diffusion-based stochastic HMP framework.\nDiffMotion's motion predictor is composed of two modules, including (1) a\nTransformer-based network for initial motion reconstruction from corrupted\nmotion, and (2) a Graph Convolutional Network (GCN) to refine the generated\nmotion considering past observations. Our method, facilitated by this novel\nTransformer-GCN module design and a proposed variance scheduler, excels in\npredicting accurate, realistic, and consistent motions, while maintaining an\nappropriate level of diversity. Our results on benchmark datasets show that\nDiffMotion significantly outperforms previous methods in terms of both accuracy\nand fidelity, while demonstrating superior robustness.\n","authors":["Jiarui Sun","Girish Chowdhary"],"pdf_url":"https://arxiv.org/pdf/2305.12554v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12668v1","updated":"2023-12-19T23:48:43Z","published":"2023-12-19T23:48:43Z","title":"Convolutional Channel-wise Competitive Learning for the Forward-Forward\n Algorithm","summary":" The Forward-Forward (FF) Algorithm has been recently proposed to alleviate\nthe issues of backpropagation (BP) commonly used to train deep neural networks.\nHowever, its current formulation exhibits limitations such as the generation of\nnegative data, slower convergence, and inadequate performance on complex tasks.\nIn this paper, we take the main ideas of FF and improve them by leveraging\nchannel-wise competitive learning in the context of convolutional neural\nnetworks for image classification tasks. A layer-wise loss function is\nintroduced that promotes competitive learning and eliminates the need for\nnegative data construction. To enhance both the learning of compositional\nfeatures and feature space partitioning, a channel-wise feature separator and\nextractor block is proposed that complements the competitive learning process.\nOur method outperforms recent FF-based models on image classification tasks,\nachieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST,\nFashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the\nperformance gap between FF learning and BP methods, indicating the potential of\nour proposed approach to learn useful representations in a layer-wise modular\nfashion, enabling more efficient and flexible learning.\n","authors":["Andreas Papachristodoulou","Christos Kyrkou","Stelios Timotheou","Theocharis Theocharides"],"pdf_url":"https://arxiv.org/pdf/2312.12668v1.pdf","comment":"To be published in AAAI 2024, 11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.08174v2","updated":"2023-12-19T23:46:35Z","published":"2023-12-13T14:34:12Z","title":"Double Machine Learning for Static Panel Models with Fixed Effects","summary":" Machine Learning (ML) algorithms are powerful data-driven tools for\napproximating highdimensional or non-linear nuisance functions which are useful\nin practice because the true functional form of the predictors is ex-ante\nunknown. In this paper, we develop estimators of policy interventions from\npanel data which allow for non-linear effects of the confounding regressors,\nand investigate the performance of these estimators using three well-known ML\nalgorithms, specifically, LASSO, classification and regression trees, and\nrandom forests. We use Double Machine Learning (DML) (Chernozhukov et al.,\n2018) for the estimation of causal effects of homogeneous treatments with\nunobserved individual heterogeneity (fixed effects) and no unobserved\nconfounding by extending Robinson (1988)'s partially linear regression model.\nWe develop three alternative approaches for handling unobserved individual\nheterogeneity based on extending the within-group estimator, first-difference\nestimator, and correlated random effect estimator (Mundlak, 1978) for\nnon-linear models. Using Monte Carlo simulations, we find that conventional\nleast squares estimators can perform well even if the data generating process\nis nonlinear, but there are substantial performance gains in terms of bias\nreduction under a process where the true effect of the regressors is non-linear\nand discontinuous. However, for the same scenarios, we also find - despite\nextensive hyperparameter tuning - inference to be problematic for both\ntree-based learners because these lead to highly non-normal estimator\ndistributions and the estimator variance being severely under-estimated. This\ncontradicts the performance of trees in other circumstances and requires\nfurther investigation. Finally, we provide an illustrative example of DML for\nobservational panel data showing the impact of the introduction of the national\nminimum wage in the UK.\n","authors":["Paul Clarke","Annalivia Polselli"],"pdf_url":"https://arxiv.org/pdf/2312.08174v2.pdf","comment":"20 pages, 5 tables, 5 figure, 2 appendices"},{"id":"http://arxiv.org/abs/2312.12667v1","updated":"2023-12-19T23:42:20Z","published":"2023-12-19T23:42:20Z","title":"Discovering Malicious Signatures in Software from Structural\n Interactions","summary":" Malware represents a significant security concern in today's digital\nlandscape, as it can destroy or disable operating systems, steal sensitive user\ninformation, and occupy valuable disk space. However, current malware detection\nmethods, such as static-based and dynamic-based approaches, struggle to\nidentify newly developed (``zero-day\") malware and are limited by customized\nvirtual machine (VM) environments. To overcome these limitations, we propose a\nnovel malware detection approach that leverages deep learning, mathematical\ntechniques, and network science. Our approach focuses on static and dynamic\nanalysis and utilizes the Low-Level Virtual Machine (LLVM) to profile\napplications within a complex network. The generated network topologies are\ninput into the GraphSAGE architecture to efficiently distinguish between benign\nand malicious software applications, with the operation names denoted as node\nfeatures. Importantly, the GraphSAGE models analyze the network's topological\ngeometry to make predictions, enabling them to detect state-of-the-art malware\nand prevent potential damage during execution in a VM. To evaluate our\napproach, we conduct a study on a dataset comprising source code from 24,376\napplications, specifically written in C/C++, sourced directly from\nwidely-recognized malware and various types of benign software. The results\nshow a high detection performance with an Area Under the Receiver Operating\nCharacteristic Curve (AUROC) of 99.85%. Our approach marks a substantial\nimprovement in malware detection, providing a notably more accurate and\nefficient solution when compared to current state-of-the-art malware detection\nmethods.\n","authors":["Chenzhong Yin","Hantang Zhang","Mingxi Cheng","Xiongye Xiao","Xinghe Chen","Xin Ren","Paul Bogdan"],"pdf_url":"https://arxiv.org/pdf/2312.12667v1.pdf","comment":"ICASSP 2024, Accepted"},{"id":"http://arxiv.org/abs/2209.04587v5","updated":"2023-12-19T23:40:27Z","published":"2022-09-10T04:01:23Z","title":"Multipoint-BAX: A New Approach for Efficiently Tuning Particle\n Accelerator Emittance via Virtual Objectives","summary":" Although beam emittance is critical for the performance of high-brightness\naccelerators, optimization is often time limited as emittance calculations,\ncommonly done via quadrupole scans, are typically slow. Such calculations are a\ntype of $\\textit{multipoint query}$, i.e. each query requires multiple\nsecondary measurements. Traditional black-box optimizers such as Bayesian\noptimization are slow and inefficient when dealing with such objectives as they\nmust acquire the full series of measurements, but return only the emittance,\nwith each query. We propose a new information-theoretic algorithm,\nMultipoint-BAX, for black-box optimization on multipoint queries, which queries\nand models individual beam-size measurements using techniques from Bayesian\nAlgorithm Execution (BAX). Our method avoids the slow multipoint query on the\naccelerator by acquiring points through a $\\textit{virtual objective}$, i.e.\ncalculating the emittance objective from a fast learned model rather than\ndirectly from the accelerator. We use Multipoint-BAX to minimize emittance at\nthe Linac Coherent Light Source (LCLS) and the Facility for Advanced\nAccelerator Experimental Tests II (FACET-II). In simulation, our method is\n20$\\times$ faster and more robust to noise compared to existing methods. In\nlive tests, it matched the hand-tuned emittance at FACET-II and achieved a 24%\nlower emittance than hand-tuning at LCLS. Our method represents a conceptual\nshift for optimizing multipoint queries, and we anticipate that it can be\nreadily adapted to similar problems in particle accelerators and other\nscientific instruments.\n","authors":["Sara A. Miskovich","Willie Neiswanger","William Colocho","Claudio Emma","Jacqueline Garrahan","Timothy Maxwell","Christopher Mayes","Stefano Ermon","Auralee Edelen","Daniel Ratner"],"pdf_url":"https://arxiv.org/pdf/2209.04587v5.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2312.12436v1","updated":"2023-12-19T18:59:22Z","published":"2023-12-19T18:59:22Z","title":"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise","summary":" The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Chaoyou Fu","Renrui Zhang","Haojia Lin","Zihan Wang","Timin Gao","Yongdong Luo","Yubo Huang","Zhengye Zhang","Longtian Qiu","Gaoxiang Ye","Yunhang Shen","Mengdan Zhang","Peixian Chen","Sirui Zhao","Xiawu Zheng","Shaohui Lin","Deqiang Jiang","Di Yin","Peng Gao","Ke Li","Xing Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2312.12436v1.pdf","comment":"Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2301.11145v2","updated":"2023-12-19T17:09:04Z","published":"2023-01-26T14:52:30Z","title":"Learning from Mistakes: Self-Regularizing Hierarchical Representations\n in Point Cloud Semantic Segmentation","summary":" Recent advances in autonomous robotic technologies have highlighted the\ngrowing need for precise environmental analysis. LiDAR semantic segmentation\nhas gained attention to accomplish fine-grained scene understanding by acting\ndirectly on raw content provided by sensors. Recent solutions showed how\ndifferent learning techniques can be used to improve the performance of the\nmodel, without any architectural or dataset change. Following this trend, we\npresent a coarse-to-fine setup that LEArns from classification mistaKes (LEAK)\nderived from a standard model. First, classes are clustered into macro groups\naccording to mutual prediction errors; then, the learning process is\nregularized by: (1) aligning class-conditional prototypical feature\nrepresentation for both fine and coarse classes, (2) weighting instances with a\nper-class fairness index. Our LEAK approach is very general and can be\nseamlessly applied on top of any segmentation architecture; indeed,\nexperimental results showed that it enables state-of-the-art performances on\ndifferent architectures, datasets and tasks, while ensuring more balanced\nclass-wise results and faster convergence.\n","authors":["Elena Camuffo","Umberto Michieli","Simone Milani"],"pdf_url":"https://arxiv.org/pdf/2301.11145v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10493v2","updated":"2023-12-19T15:55:23Z","published":"2023-12-16T16:14:50Z","title":"Debiasing Multimodal Sarcasm Detection with Contrastive Learning","summary":" Despite commendable achievements made by existing work, prevailing multimodal\nsarcasm detection studies rely more on textual content over visual information.\nIt unavoidably induces spurious correlations between textual words and labels,\nthereby significantly hindering the models' generalization capability. To\naddress this problem, we define the task of out-of-distribution (OOD)\nmultimodal sarcasm detection, which aims to evaluate models' generalizability\nwhen the word distribution is different in training and testing settings.\nMoreover, we propose a novel debiasing multimodal sarcasm detection framework\nwith contrastive learning, which aims to mitigate the harmful effect of biased\ntextual factors for robust OOD generalization. In particular, we first design\ncounterfactual data augmentation to construct the positive samples with\ndissimilar word biases and negative samples with similar word biases.\nSubsequently, we devise an adapted debiasing contrastive learning mechanism to\nempower the model to learn robust task-relevant features and alleviate the\nadverse effect of biased words. Extensive experiments show the superiority of\nthe proposed framework.\n","authors":["Mengzhao Jia","Can Xie","Liqiang Jing"],"pdf_url":"https://arxiv.org/pdf/2312.10493v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12174v1","updated":"2023-12-19T14:05:15Z","published":"2023-12-19T14:05:15Z","title":"Low-Consumption Partial Transcoding by HEVC","summary":" A transcoding scheme for the High Efficiency Video Coding (HEVC) is proposed\nthat allows any partial frame modification to be followed by a partial\nre-compression of only the modified areas, while guaranteeing identical\nreconstruction of non-modified areas. To this end, first, syntax elements of\nall Coding Units (CU) in the frame are parsed and decoded according to their\nscan order. Then CUs that are collocated with a replaced area are re-encoded\nwith new content to generate a partial set of new syntax elements. In order to\navoid spatial propagation of the decoding mismatch due to the new content, CUs\non the border of the replaced area are losslessly coded such that\nreconstruction of immediately neighboring CUs in the scan order are protected\nfrom the modification. The proposed method has been implemented on top of the\nHEVC test Model (HM) in All-Intra (AI) coding configuration and experiments\nshow that, depending on the test parameters, it can offer both a bitrate saving\n(up to 4% in terms of BD-BR) and a transcoding acceleration (up to 83%)\ncompared to a full transcoding scheme.\n","authors":["Mohsen Abdoli","Félix Henry","Gordon Clare"],"pdf_url":"https://arxiv.org/pdf/2312.12174v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12150v1","updated":"2023-12-19T13:33:54Z","published":"2023-12-19T13:33:54Z","title":"Comparative Study of Hardware and Software Power Measurements in Video\n Compression","summary":" The environmental impact of video streaming services has been discussed as\npart of the strategies towards sustainable information and communication\ntechnologies. A first step towards that is the energy profiling and assessment\nof energy consumption of existing video technologies. This paper presents a\ncomprehensive study of power measurement techniques in video compression,\ncomparing the use of hardware and software power meters. An experimental\nmethodology to ensure reliability of measurements is introduced. Key findings\ndemonstrate the high correlation of hardware and software based energy\nmeasurements for two video codecs across different spatial and temporal\nresolutions at a lower computational overhead.\n","authors":["Angeliki Katsenou","Xinyi Wang","Daniel Schien","David Bull"],"pdf_url":"https://arxiv.org/pdf/2312.12150v1.pdf","comment":"5 pages"},{"id":"http://arxiv.org/abs/2312.06171v2","updated":"2023-12-19T12:36:47Z","published":"2023-12-11T07:20:42Z","title":"Jointly Explicit and Implicit Cross-Modal Interaction Network for\n Anterior Chamber Inflammation Diagnosis","summary":" Uveitis demands the precise diagnosis of anterior chamber inflammation (ACI)\nfor optimal treatment. However, current diagnostic methods only rely on a\nlimited single-modal disease perspective, which leads to poor performance. In\nthis paper, we investigate a promising yet challenging way to fuse multimodal\ndata for ACI diagnosis. Notably, existing fusion paradigms focus on empowering\nimplicit modality interactions (i.e., self-attention and its variants), but\nneglect to inject explicit modality interactions, especially from clinical\nknowledge and imaging property. To this end, we propose a jointly Explicit and\nimplicit Cross-Modal Interaction Network (EiCI-Net) for Anterior Chamber\nInflammation Diagnosis that uses anterior segment optical coherence tomography\n(AS-OCT) images, slit-lamp images, and clinical data jointly. Specifically, we\nfirst develop CNN-Based Encoders and Tabular Processing Module (TPM) to extract\nefficient feature representations in different modalities. Then, we devise an\nExplicit Cross-Modal Interaction Module (ECIM) to generate attention maps as a\nkind of explicit clinical knowledge based on the tabular feature maps, then\nintegrated them into the slit-lamp feature maps, allowing the CNN-Based Encoder\nto focus on more effective informativeness of the slit-lamp images. After that,\nthe Implicit Cross-Modal Interaction Module (ICIM), a transformer-based\nnetwork, further implicitly enhances modality interactions. Finally, we\nconstruct a considerable real-world dataset from our collaborative hospital and\nconduct sufficient experiments to demonstrate the superior performance of our\nproposed EiCI-Net compared with the state-of-the-art classification methods in\nvarious metrics.\n","authors":["Qian Shao","Ye Dai","Haochao Ying","Kan Xu","Jinhong Wang","Wei Chi","Jian Wu"],"pdf_url":"https://arxiv.org/pdf/2312.06171v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07594v2","updated":"2023-12-19T03:44:25Z","published":"2023-11-10T09:51:24Z","title":"How to Bridge the Gap between Modalities: A Comprehensive Survey on\n Multimodal Large Language Model","summary":" This review paper explores Multimodal Large Language Models (MLLMs), which\nintegrate Large Language Models (LLMs) like GPT-4 to handle multimodal data\nsuch as text and vision. MLLMs demonstrate capabilities like generating image\nnarratives and answering image-based questions, bridging the gap towards\nreal-world human-computer interactions and hinting at a potential pathway to\nartificial general intelligence. However, MLLMs still face challenges in\nprocessing the semantic gap in multimodality, which may lead to erroneous\ngeneration, posing potential risks to society. Choosing the appropriate\nmodality alignment method is crucial, as improper methods might require more\nparameters with limited performance improvement. This paper aims to explore\nmodality alignment methods for LLMs and their existing capabilities.\nImplementing modality alignment allows LLMs to address environmental issues and\nenhance accessibility. The study surveys existing modal alignment methods in\nMLLMs into four groups: (1) Multimodal Converters that change data into\nsomething LLMs can understand; (2) Multimodal Perceivers to improve how LLMs\nperceive different types of data; (3) Tools Assistance for changing data into\none common format, usually text; and (4) Data-Driven methods that teach LLMs to\nunderstand specific types of data in a dataset. This field is still in a phase\nof exploration and experimentation, and we will organize and update various\nexisting research methods for multimodal information alignment.\n","authors":["Shezheng Song","Xiaopeng Li","Shasha Li","Shan Zhao","Jie Yu","Jun Ma","Xiaoguang Mao","Weimin Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.07594v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11793v1","updated":"2023-12-19T02:09:38Z","published":"2023-12-19T02:09:38Z","title":"An effective image copy-move forgery detection using entropy image","summary":" Image forensics has become increasingly important in our daily lives. As a\nfundamental type of forgeries, Copy-Move Forgery Detection (CMFD) has received\nsignificant attention in the academic community. Keypoint-based algorithms,\nparticularly those based on SIFT, have achieved good results in CMFD. However,\nthe most of keypoint detection algorithms often fail to generate sufficient\nmatches when tampered patches are present in smooth areas. To tackle this\nproblem, we introduce entropy images to determine the coordinates and scales of\nkeypoints, resulting significantly increasing the number of keypoints.\nFurthermore, we develop an entropy level clustering algorithm to avoid\nincreased matching complexity caused by non-ideal distribution of grayscale\nvalues in keypoints. Experimental results demonstrate that our algorithm\nachieves a good balance between performance and time efficiency.\n","authors":["Zhaowei Lu","Li Jiang"],"pdf_url":"https://arxiv.org/pdf/2312.11793v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06958v2","updated":"2023-12-19T20:34:57Z","published":"2023-10-10T19:21:41Z","title":"Comparing the robustness of modern no-reference image- and video-quality\n metrics to adversarial attacks","summary":" Nowadays neural-network-based image- and video-quality metrics show better\nperformance compared to traditional methods. However, they also became more\nvulnerable to adversarial attacks that increase metrics' scores without\nimproving visual quality. The existing benchmarks of quality metrics compare\ntheir performance in terms of correlation with subjective quality and\ncalculation time. However, the adversarial robustness of image-quality metrics\nis also an area worth researching. In this paper, we analyse modern metrics'\nrobustness to different adversarial attacks. We adopted adversarial attacks\nfrom computer vision tasks and compared attacks' efficiency against 15\nno-reference image/video-quality metrics. Some metrics showed high resistance\nto adversarial attacks which makes their usage in benchmarks safer than\nvulnerable metrics. The benchmark accepts new metrics submissions for\nresearchers who want to make their metrics more robust to attacks or to find\nsuch metrics for their needs. Try our benchmark using pip install\nrobustness-benchmark.\n","authors":["Anastasia Antsiferova","Khaled Abud","Aleksandr Gushchin","Ekaterina Shumitskaya","Sergey Lavrushkin","Dmitriy Vatolin"],"pdf_url":"https://arxiv.org/pdf/2310.06958v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12490v1","updated":"2023-12-19T17:55:16Z","published":"2023-12-19T17:55:16Z","title":"InstructVideo: Instructing Video Diffusion Models with Human Feedback","summary":" Diffusion models have emerged as the de facto paradigm for video generation.\nHowever, their reliance on web-scale data of varied quality often yields\nresults that are visually unappealing and misaligned with the textual prompts.\nTo tackle this problem, we propose InstructVideo to instruct text-to-video\ndiffusion models with human feedback by reward fine-tuning. InstructVideo has\ntwo key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by\ngenerating through the full DDIM sampling chain, we recast reward fine-tuning\nas editing. By leveraging the diffusion process to corrupt a sampled video,\nInstructVideo requires only partial inference of the DDIM sampling chain,\nreducing fine-tuning cost while improving fine-tuning efficiency. 2) To\nmitigate the absence of a dedicated video reward model for human preferences,\nwe repurpose established image reward models, e.g., HPSv2. To this end, we\npropose Segmental Video Reward, a mechanism to provide reward signals based on\nsegmental sparse sampling, and Temporally Attenuated Reward, a method that\nmitigates temporal modeling degradation during fine-tuning. Extensive\nexperiments, both qualitative and quantitative, validate the practicality and\nefficacy of using image reward models in InstructVideo, significantly enhancing\nthe visual quality of generated videos without compromising generalization\ncapabilities. Code and models will be made publicly available.\n","authors":["Hangjie Yuan","Shiwei Zhang","Xiang Wang","Yujie Wei","Tao Feng","Yining Pan","Yingya Zhang","Ziwei Liu","Samuel Albanie","Dong Ni"],"pdf_url":"https://arxiv.org/pdf/2312.12490v1.pdf","comment":"Project page: https://instructvideo.github.io/"}]},"2023-12-20T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2312.13264v1","updated":"2023-12-20T18:41:44Z","published":"2023-12-20T18:41:44Z","title":"dIR -- Discrete Information Retrieval: Conversational Search over\n Unstructured (and Structured) Data with Large Language Models","summary":" Data is stored in both structured and unstructured form. Querying both, to\npower natural language conversations, is a challenge. This paper introduces\ndIR, Discrete Information Retrieval, providing a unified interface to query\nboth free text and structured knowledge. Specifically, a Large Language Model\n(LLM) transforms text into expressive representation. After the text is\nextracted into columnar form, it can then be queried via a text-to-SQL Semantic\nParser, with an LLM converting natural language into SQL. Where desired, such\nconversation may be effected by a multi-step reasoning conversational agent. We\nvalidate our approach via a proprietary question/answer data set, concluding\nthat dIR makes a whole new class of queries on free text possible when compared\nto traditionally fine-tuned dense-embedding-model-based Information Retrieval\n(IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR\ncan succeed where no other method stands a chance.\n","authors":["Pablo M. Rodriguez Bertorello","Jean Rodmond Junior Laguerre"],"pdf_url":"https://arxiv.org/pdf/2312.13264v1.pdf","comment":"8 pages, 5 figures, Association for Computational Linguistics"},{"id":"http://arxiv.org/abs/2312.12037v2","updated":"2023-12-20T17:42:18Z","published":"2023-12-19T10:46:13Z","title":"Founder-GPT: Self-play to evaluate the Founder-Idea fit","summary":" This research introduces an innovative evaluation method for the\n\"founder-idea\" fit in early-stage startups, utilizing advanced large language\nmodel techniques to assess founders' profiles against their startup ideas to\nenhance decision-making. Embeddings, self-play, tree-of-thought, and\ncritique-based refinement techniques show early promising results that each\nidea's success patterns are unique and they should be evaluated based on the\ncontext of the founder's background.\n","authors":["Sichao Xiong","Yigit Ihlamur"],"pdf_url":"https://arxiv.org/pdf/2312.12037v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.03327v7","updated":"2023-12-20T17:41:51Z","published":"2022-01-10T13:04:39Z","title":"Latency Adjustable Transformer Encoder for Language Understanding","summary":" Adjusting the latency, power, and accuracy of natural language understanding\nmodels is a desirable objective of an efficient architecture. This paper\nproposes an efficient Transformer architecture that adjusts the inference\ncomputational cost adaptively with a desired inference latency speedup. In\nfine-tuning phase, the proposed method detects less important hidden sequence\nelements (word-vectors) and eliminates them in each encoder layer using a\nproposed Attention Context Contribution (ACC) metric. After the fine-tuning\nphase, with the novel offline-tuning property, the inference latency of the\nmodel can be adjusted in a wide range of inference speedup selections without\nany further training. The proposed method is applied to the BERT-base and GPT-2\nmodels for evaluation. Extensive experiments show that most of the word-vectors\nin higher Transformer layers have less contribution to the subsequent layers;\nhence, they can be eliminated to improve the inference latency. Experimental\nresults on extensive sentiment analysis, classification, text generation tasks\nand regression benchmarks like GLUE showed that the method is effective in\nvarious datasets with minimal impact on global context. The proposed method\nmathematically and experimentally improves the inference latency of BERT-base\nand GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and\npassable perplexity on average. The suggested approach posits that in Large\nLanguage Models (LLMs), although the complete network is necessary for\ntraining, it can be truncated during the fine-tuning phase.\n","authors":["Sajjad Kachuee","Mohammad Sharifkhani"],"pdf_url":"https://arxiv.org/pdf/2201.03327v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13219v1","updated":"2023-12-20T17:38:04Z","published":"2023-12-20T17:38:04Z","title":"Interactive Visual Task Learning for Robots","summary":" We present a framework for robots to learn novel visual concepts and tasks\nvia in-situ linguistic interactions with human users. Previous approaches have\neither used large pre-trained visual models to infer novel objects zero-shot,\nor added novel concepts along with their attributes and representations to a\nconcept hierarchy. We extend the approaches that focus on learning visual\nconcept hierarchies by enabling them to learn novel concepts and solve unseen\nrobotics tasks with them. To enable a visual concept learner to solve robotics\ntasks one-shot, we developed two distinct techniques. Firstly, we propose a\nnovel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which\naugments information of a novel concept to its parent nodes within a concept\nhierarchy. This information propagation allows all concepts in a hierarchy to\nupdate as novel concepts are taught in a continual learning setting. Secondly,\nwe represent a visual task as a scene graph with language annotations, allowing\nus to create novel permutations of a demonstrated task zero-shot in-situ. We\npresent two sets of results. Firstly, we compare Hi-Viscont with the baseline\nmodel (FALCON) on visual question answering(VQA) in three domains. While being\ncomparable to the baseline model on leaf level concepts, Hi-Viscont achieves an\nimprovement of over 9% on non-leaf concepts on average. We compare our model's\nperformance against the baseline FALCON model. Our framework achieves 33%\nimprovements in success rate metric, and 19% improvements in the object level\naccuracy compared to the baseline model. With both of these results we\ndemonstrate the ability of our model to learn tasks and concepts in a continual\nlearning setting on the robot.\n","authors":["Weiwei Gu","Anant Sah","Nakul Gopalan"],"pdf_url":"https://arxiv.org/pdf/2312.13219v1.pdf","comment":"In Proceedings of The 38th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2312.13211v1","updated":"2023-12-20T17:27:25Z","published":"2023-12-20T17:27:25Z","title":"DSFormer: Effective Compression of Text-Transformers by Dense-Sparse\n Weight Factorization","summary":" With the tremendous success of large transformer models in natural language\nunderstanding, down-sizing them for cost-effective deployments has become\ncritical. Recent studies have explored the low-rank weight factorization\ntechniques which are efficient to train, and apply out-of-the-box to any\ntransformer architecture. Unfortunately, the low-rank assumption tends to be\nover-restrictive and hinders the expressiveness of the compressed model. This\npaper proposes, DSFormer, a simple alternative factorization scheme which\nexpresses a target weight matrix as the product of a small dense and a\nsemi-structured sparse matrix. The resulting approximation is more faithful to\nthe weight distribution in transformers and therefore achieves a stronger\nefficiency-accuracy trade-off. Another concern with existing factorizers is\ntheir dependence on a task-unaware initialization step which degrades the\naccuracy of the resulting model. DSFormer addresses this issue through a novel\nStraight-Through Factorizer (STF) algorithm that jointly learns all the weight\nfactorizations to directly maximize the final task accuracy. Extensive\nexperiments on multiple natural language understanding benchmarks demonstrate\nthat DSFormer obtains up to 40% better compression than the state-of-the-art\nlow-rank factorizers, leading semi-structured sparsity baselines and popular\nknowledge distillation approaches. Our approach is also orthogonal to\nmainstream compressors and offers up to 50% additional compression when added\nto popular distilled, layer-shared and quantized transformers. We empirically\nevaluate the benefits of STF over conventional optimization practices.\n","authors":["Rahul Chand","Yashoteja Prabhu","Pratyush Kumar"],"pdf_url":"https://arxiv.org/pdf/2312.13211v1.pdf","comment":"9 page main paper. 1 page appendix"},{"id":"http://arxiv.org/abs/2312.13208v1","updated":"2023-12-20T17:25:23Z","published":"2023-12-20T17:25:23Z","title":"LlaMaVAE: Guiding Large Language Model Generation via Continuous Latent\n Sentence Spaces","summary":" Deep generative neural networks, such as Variational AutoEncoders (VAEs),\noffer an opportunity to better understand and control language models from the\nperspective of sentence-level latent spaces. To combine the controllability of\nVAE latent spaces with the state-of-the-art performance of recent large\nlanguage models (LLMs), we present in this work LlaMaVAE, which combines\nexpressive encoder and decoder models (sentenceT5 and LlaMA) with a VAE\narchitecture, aiming to provide better text generation control to LLMs. In\naddition, to conditionally guide the VAE generation, we investigate a new\napproach based on flow-based invertible neural networks (INNs) named Invertible\nCVAE. Experimental results reveal that LlaMaVAE can outperform the previous\nstate-of-the-art VAE language model, Optimus, across various tasks, including\nlanguage modelling, semantic textual similarity and definition modelling.\nQualitative analysis on interpolation and traversal experiments also indicates\nan increased degree of semantic clustering and geometric consistency, which\nenables better generation control.\n","authors":["Yingji Zhang","Danilo S. Carvalho","Ian Pratt-Hartmann","André Freitas"],"pdf_url":"https://arxiv.org/pdf/2312.13208v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.03087v2","updated":"2023-12-20T17:24:33Z","published":"2022-10-06T17:46:00Z","title":"Iterative Vision-and-Language Navigation","summary":" We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for\nevaluating language-guided agents navigating in a persistent environment over\ntime. Existing Vision-and-Language Navigation (VLN) benchmarks erase the\nagent's memory at the beginning of every episode, testing the ability to\nperform cold-start navigation with no prior information. However, deployed\nrobots occupy the same environment for long periods of time. The IVLN paradigm\naddresses this disparity by training and evaluating VLN agents that maintain\nmemory across tours of scenes that consist of up to 100 ordered\ninstruction-following Room-to-Room (R2R) episodes, each defined by an\nindividual language instruction and a target path. We present discrete and\ncontinuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours\neach in 80 indoor scenes. We find that extending the implicit memory of\nhigh-performing transformer VLN agents is not sufficient for IVLN, but agents\nthat build maps can benefit from environment persistence, motivating a renewed\nfocus on map-building agents in VLN.\n","authors":["Jacob Krantz","Shurjo Banerjee","Wang Zhu","Jason Corso","Peter Anderson","Stefan Lee","Jesse Thomason"],"pdf_url":"https://arxiv.org/pdf/2210.03087v2.pdf","comment":"Accepted by CVPR 2023"},{"id":"http://arxiv.org/abs/2305.16307v3","updated":"2023-12-20T17:08:28Z","published":"2023-05-25T17:57:43Z","title":"IndicTrans2: Towards High-Quality and Accessible Machine Translation\n Models for all 22 Scheduled Indian Languages","summary":" India has a rich linguistic landscape with languages from 4 major language\nfamilies spoken by over a billion people. 22 of these languages are listed in\nthe Constitution of India (referred to as scheduled languages) are the focus of\nthis work. Given the linguistic diversity, high-quality and accessible Machine\nTranslation (MT) systems are essential in a country like India. Prior to this\nwork, there was (i) no parallel training data spanning all 22 languages, (ii)\nno robust benchmarks covering all these languages and containing content\nrelevant to India, and (iii) no existing translation models which support all\nthe 22 scheduled languages of India. In this work, we aim to address this gap\nby focusing on the missing pieces required for enabling wide, easy, and open\naccess to good machine translation systems for all 22 scheduled Indian\nlanguages. We identify four key areas of improvement: curating and creating\nlarger training datasets, creating diverse and high-quality benchmarks,\ntraining multilingual models, and releasing models with open access. Our first\ncontribution is the release of the Bharat Parallel Corpus Collection (BPCC),\nthe largest publicly available parallel corpora for Indic languages. BPCC\ncontains a total of 230M bitext pairs, of which a total of 126M were newly\nadded, including 644K manually translated sentence pairs created as part of\nthis work. Our second contribution is the release of the first n-way parallel\nbenchmark covering all 22 Indian languages, featuring diverse domains,\nIndian-origin content, and source-original test sets. Next, we present\nIndicTrans2, the first model to support all 22 languages, surpassing existing\nmodels on multiple existing and new benchmarks created as a part of this work.\nLastly, to promote accessibility and collaboration, we release our models and\nassociated data with permissive licenses at\nhttps://github.com/AI4Bharat/IndicTrans2.\n","authors":["Jay Gala","Pranjal A. Chitale","Raghavan AK","Varun Gumma","Sumanth Doddapaneni","Aswanth Kumar","Janki Nawale","Anupama Sujatha","Ratish Puduppully","Vivek Raghavan","Pratyush Kumar","Mitesh M. Khapra","Raj Dabre","Anoop Kunchukuttan"],"pdf_url":"https://arxiv.org/pdf/2305.16307v3.pdf","comment":"Accepted at TMLR"},{"id":"http://arxiv.org/abs/2312.13193v1","updated":"2023-12-20T17:05:46Z","published":"2023-12-20T17:05:46Z","title":"HCDIR: End-to-end Hate Context Detection, and Intensity Reduction model\n for online comments","summary":" Warning: This paper contains examples of the language that some people may\nfind offensive.\n Detecting and reducing hateful, abusive, offensive comments is a critical and\nchallenging task on social media. Moreover, few studies aim to mitigate the\nintensity of hate speech. While studies have shown that context-level semantics\nare crucial for detecting hateful comments, most of this research focuses on\nEnglish due to the ample datasets available. In contrast, low-resource\nlanguages, like Indian languages, remain under-researched because of limited\ndatasets. Contrary to hate speech detection, hate intensity reduction remains\nunexplored in high-resource and low-resource languages. In this paper, we\npropose a novel end-to-end model, HCDIR, for Hate Context Detection, and Hate\nIntensity Reduction in social media posts. First, we fine-tuned several\npre-trained language models to detect hateful comments to ascertain the\nbest-performing hateful comments detection model. Then, we identified the\ncontextual hateful words. Identification of such hateful words is justified\nthrough the state-of-the-art explainable learning model, i.e., Integrated\nGradient (IG). Lastly, the Masked Language Modeling (MLM) model has been\nemployed to capture domain-specific nuances to reduce hate intensity. We masked\nthe 50\\% hateful words of the comments identified as hateful and predicted the\nalternative words for these masked terms to generate convincing sentences. An\noptimal replacement for the original hate comments from the feasible sentences\nis preferred. Extensive experiments have been conducted on several recent\ndatasets using automatic metric-based evaluation (BERTScore) and thorough human\nevaluation. To enhance the faithfulness in human evaluation, we arranged a\ngroup of three human annotators with varied expertise.\n","authors":["Neeraj Kumar Singh","Koyel Ghosh","Joy Mahapatra","Utpal Garain","Apurbalal Senapati"],"pdf_url":"https://arxiv.org/pdf/2312.13193v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11517v2","updated":"2023-12-20T16:43:54Z","published":"2023-12-12T19:34:23Z","title":"Unlocking Musculoskeletal Disorder Risk Factors: NLP-Based\n Classification and Mode-Based Ranking","summary":" This research delves into the intricate landscape of Musculoskeletal Disorder\n(MSD) risk factors, employing a novel fusion of Natural Language Processing\n(NLP) techniques and mode-based ranking methodologies. The primary objective is\nto advance the comprehension of MSD risk factors, their classification, and\ntheir relative severity, facilitating more targeted preventive and management\ninterventions. The study utilizes eight diverse models, integrating pre-trained\ntransformers, cosine similarity, and various distance metrics to classify risk\nfactors into personal, biomechanical, workplace, psychological, and\norganizational classes. Key findings reveal that the BERT model with cosine\nsimilarity attains an overall accuracy of 28%, while the sentence transformer,\ncoupled with Euclidean, Bray-Curtis, and Minkowski distances, achieves a\nflawless accuracy score of 100%. In tandem with the classification efforts, the\nresearch employs a mode-based ranking approach on survey data to discern the\nseverity hierarchy of MSD risk factors. Intriguingly, the rankings align\nprecisely with the previous literature, reaffirming the consistency and\nreliability of the approach. ``Working posture\" emerges as the most severe risk\nfactor, emphasizing the critical role of proper posture in preventing MSDs. The\ncollective perceptions of survey participants underscore the significance of\nfactors like \"Job insecurity,\" \"Effort reward imbalance,\" and \"Poor employee\nfacility\" in contributing to MSD risks. The convergence of rankings provides\nactionable insights for organizations aiming to reduce the prevalence of MSDs.\nThe study concludes with implications for targeted interventions,\nrecommendations for improving workplace conditions, and avenues for future\nresearch.\n","authors":["Md Abrar Jahin","Subrata Talapatra"],"pdf_url":"https://arxiv.org/pdf/2312.11517v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13179v1","updated":"2023-12-20T16:40:33Z","published":"2023-12-20T16:40:33Z","title":"Contextual Code Switching for Machine Translation using Language Models","summary":" Large language models (LLMs) have exerted a considerable impact on diverse\nlanguage-related tasks in recent years. Their demonstrated state-of-the-art\nperformance is achieved through methodologies such as zero-shot or few-shot\nprompting. These models undergo training on extensive datasets that encompass\nsegments of the Internet and subsequently undergo fine-tuning tailored to\nspecific tasks. Notably, they exhibit proficiency in tasks such as translation,\nsummarization, question answering, and creative writing, even in the absence of\nexplicit training for those particular tasks. While they have shown substantial\nimprovement in the multilingual tasks their performance in the code switching,\nespecially for machine translation remains relatively uncharted. In this paper,\nwe present an extensive study on the code switching task specifically for the\nmachine translation task comparing multiple LLMs. Our results indicate that\ndespite the LLMs having promising results in the certain tasks, the models with\nrelatively lesser complexity outperform the multilingual large language models\nin the machine translation task. We posit that the efficacy of multilingual\nlarge language models in contextual code switching is constrained by their\ntraining methodologies. In contrast, relatively smaller models, when trained\nand fine-tuned on bespoke datasets, may yield superior results in comparison to\nthe majority of multilingual models.\n","authors":["Arshad Kaji","Manan Shah"],"pdf_url":"https://arxiv.org/pdf/2312.13179v1.pdf","comment":"4 pages, 1 figure, 2 tables"},{"id":"http://arxiv.org/abs/2311.12420v2","updated":"2023-12-20T15:48:15Z","published":"2023-11-21T08:20:39Z","title":"How Far Have We Gone in Vulnerability Detection Using Large Language\n Models","summary":" As software becomes increasingly complex and prone to vulnerabilities,\nautomated vulnerability detection is critically important, yet challenging.\nGiven the significant successes of large language models (LLMs) in various\ntasks, there is growing anticipation of their efficacy in vulnerability\ndetection. However, a quantitative understanding of their potential in\nvulnerability detection is still missing. To bridge this gap, we introduce a\ncomprehensive vulnerability benchmark VulBench. This benchmark aggregates\nhigh-quality data from a wide range of CTF (Capture-the-Flag) challenges and\nreal-world applications, with annotations for each vulnerable function\ndetailing the vulnerability type and its root cause. Through our experiments\nencompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models\nand static analyzers, we find that several LLMs outperform traditional deep\nlearning approaches in vulnerability detection, revealing an untapped potential\nin LLMs. This work contributes to the understanding and utilization of LLMs for\nenhanced software security.\n","authors":["Zeyu Gao","Hao Wang","Yuchen Zhou","Wenyu Zhu","Chao Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.12420v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13119v1","updated":"2023-12-20T15:38:59Z","published":"2023-12-20T15:38:59Z","title":"Prometheus: Infrastructure Security Posture Analysis with AI-generated\n Attack Graphs","summary":" The rampant occurrence of cybersecurity breaches imposes substantial\nlimitations on the progress of network infrastructures, leading to compromised\ndata, financial losses, potential harm to individuals, and disruptions in\nessential services. The current security landscape demands the urgent\ndevelopment of a holistic security assessment solution that encompasses\nvulnerability analysis and investigates the potential exploitation of these\nvulnerabilities as attack paths. In this paper, we propose Prometheus, an\nadvanced system designed to provide a detailed analysis of the security posture\nof computing infrastructures. Using user-provided information, such as device\ndetails and software versions, Prometheus performs a comprehensive security\nassessment. This assessment includes identifying associated vulnerabilities and\nconstructing potential attack graphs that adversaries can exploit. Furthermore,\nPrometheus evaluates the exploitability of these attack paths and quantifies\nthe overall security posture through a scoring mechanism. The system takes a\nholistic approach by analyzing security layers encompassing hardware, system,\nnetwork, and cryptography. Furthermore, Prometheus delves into the\ninterconnections between these layers, exploring how vulnerabilities in one\nlayer can be leveraged to exploit vulnerabilities in others. In this paper, we\npresent the end-to-end pipeline implemented in Prometheus, showcasing the\nsystematic approach adopted for conducting this thorough security analysis.\n","authors":["Xin Jin","Charalampos Katsis","Fan Sang","Jiahao Sun","Elisa Bertino","Ramana Rao Kompella","Ashish Kundu"],"pdf_url":"https://arxiv.org/pdf/2312.13119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13103v1","updated":"2023-12-20T15:20:33Z","published":"2023-12-20T15:20:33Z","title":"Exploring Multimodal Large Language Models for Radiology Report\n Error-checking","summary":" This paper proposes one of the first clinical applications of multimodal\nlarge language models (LLMs) as an assistant for radiologists to check errors\nin their reports. We created an evaluation dataset from two real-world\nradiology datasets (MIMIC-CXR and IU-Xray), with 1,000 subsampled reports each.\nA subset of original reports was modified to contain synthetic errors by\nintroducing various type of mistakes. The evaluation contained two difficulty\nlevels: SIMPLE for binary error-checking and COMPLEX for identifying error\ntypes. LLaVA (Large Language and Visual Assistant) variant models, including\nour instruction-tuned model, were used for the evaluation. Additionally, a\ndomain expert evaluation was conducted on a small test set. At the SIMPLE\nlevel, the LLaVA v1.5 model outperformed other publicly available models.\nInstruction tuning significantly enhanced performance by 47.4% and 25.4% on\nMIMIC-CXR and IU-Xray data, respectively. The model also surpassed the domain\nexperts accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among the subsets\n(N=21) of the test set where a clinician did not achieve the correct\nconclusion, the LLaVA ensemble mode correctly identified 71.4% of these cases.\nThis study marks a promising step toward utilizing multi-modal LLMs to enhance\ndiagnostic accuracy in radiology. The ensemble model demonstrated comparable\nperformance to clinicians, even capturing errors overlooked by humans.\nNevertheless, future work is needed to improve the model ability to identify\nthe types of inconsistency.\n","authors":["Jinge Wu","Yunsoo Kim","Eva C. Keller","Jamie Chow","Adam P. Levine","Nikolas Pontikos","Zina Ibrahim","Paul Taylor","Michelle C. Williams","Honghan Wu"],"pdf_url":"https://arxiv.org/pdf/2312.13103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13096v1","updated":"2023-12-20T15:17:03Z","published":"2023-12-20T15:17:03Z","title":"In Generative AI we Trust: Can Chatbots Effectively Verify Political\n Information?","summary":" This article presents a comparative analysis of the ability of two large\nlanguage model (LLM)-based chatbots, ChatGPT and Bing Chat, recently rebranded\nto Microsoft Copilot, to detect veracity of political information. We use AI\nauditing methodology to investigate how chatbots evaluate true, false, and\nborderline statements on five topics: COVID-19, Russian aggression against\nUkraine, the Holocaust, climate change, and LGBTQ+ related debates. We compare\nhow the chatbots perform in high- and low-resource languages by using prompts\nin English, Russian, and Ukrainian. Furthermore, we explore the ability of\nchatbots to evaluate statements according to political communication concepts\nof disinformation, misinformation, and conspiracy theory, using\ndefinition-oriented prompts. We also systematically test how such evaluations\nare influenced by source bias which we model by attributing specific claims to\nvarious political and social actors. The results show high performance of\nChatGPT for the baseline veracity evaluation task, with 72 percent of the cases\nevaluated correctly on average across languages without pre-training. Bing Chat\nperformed worse with a 67 percent accuracy. We observe significant disparities\nin how chatbots evaluate prompts in high- and low-resource languages and how\nthey adapt their evaluations to political communication concepts with ChatGPT\nproviding more nuanced outputs than Bing Chat. Finally, we find that for some\nveracity detection-related tasks, the performance of chatbots varied depending\non the topic of the statement or the source to which it is attributed. These\nfindings highlight the potential of LLM-based chatbots in tackling different\nforms of false information in online environments, but also points to the\nsubstantial variation in terms of how such potential is realized due to\nspecific factors, such as language of the prompt or the topic.\n","authors":["Elizaveta Kuznetsova","Mykola Makhortykh","Victoria Vziatysheva","Martha Stolze","Ani Baghumyan","Aleksandra Urman"],"pdf_url":"https://arxiv.org/pdf/2312.13096v1.pdf","comment":"22 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.06022v2","updated":"2023-12-20T15:07:59Z","published":"2023-12-10T22:30:03Z","title":"Exploiting Representation Bias for Data Distillation in Abstractive Text\n Summarization","summary":" Abstractive text summarization is surging with the number of training samples\nto cater to the needs of the deep learning models. These models tend to exploit\nthe training data representations to attain superior performance by improving\nthe quantitative element of the resultant summary. However, increasing the size\nof the training set may not always be the ideal solution to maximize the\nperformance, and therefore, a need to revisit the quality of training samples\nand the learning protocol of deep learning models is a must. In this paper, we\naim to discretize the vector space of the abstractive text summarization models\nto understand the characteristics learned between the input embedding space and\nthe models' encoder space. We show that deep models fail to capture the\ndiversity of the input space. Further, the distribution of data points on the\nencoder space indicates that an unchecked increase in the training samples does\nnot add value; rather, a tear-down of data samples is highly needed to make the\nmodels focus on variability and faithfulness. We employ clustering techniques\nto learn the diversity of a model's sample space and how data points are mapped\nfrom the embedding space to the encoder space and vice versa. Further, we\ndevise a metric to filter out redundant data points to make the model more\nrobust and less data hungry. We benchmark our proposed method using\nquantitative metrics, such as Rouge, and qualitative metrics, such as\nBERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit\nthe models from learning the diversity from the varied input samples.\n","authors":["Yash Kumar Atri","Vikram Goyal","Tanmoy Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2312.06022v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.01039v2","updated":"2023-12-20T15:00:43Z","published":"2022-12-02T09:11:32Z","title":"SoftCorrect: Error Correction with Soft Detection for Automatic Speech\n Recognition","summary":" Error correction in automatic speech recognition (ASR) aims to correct those\nincorrect words in sentences generated by ASR models. Since recent ASR models\nusually have low word error rate (WER), to avoid affecting originally correct\ntokens, error correction models should only modify incorrect words, and\ntherefore detecting incorrect words is important for error correction. Previous\nworks on error correction either implicitly detect error words through\ntarget-source attention or CTC (connectionist temporal classification) loss, or\nexplicitly locate specific deletion/substitution/insertion errors. However,\nimplicit error detection does not provide clear signal about which tokens are\nincorrect and explicit error detection suffers from low detection accuracy. In\nthis paper, we propose SoftCorrect with a soft error detection mechanism to\navoid the limitations of both explicit and implicit error detection.\nSpecifically, we first detect whether a token is correct or not through a\nprobability produced by a dedicatedly designed language model, and then design\na constrained CTC loss that only duplicates the detected incorrect tokens to\nlet the decoder focus on the correction of error tokens. Compared with implicit\nerror detection with CTC loss, SoftCorrect provides explicit signal about which\nwords are incorrect and thus does not need to duplicate every token but only\nincorrect tokens; compared with explicit error detection, SoftCorrect does not\ndetect specific deletion/substitution/insertion errors but just leaves it to\nCTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that\nSoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming\nprevious works by a large margin, while still enjoying fast speed of parallel\ngeneration.\n","authors":["Yichong Leng","Xu Tan","Wenjie Liu","Kaitao Song","Rui Wang","Xiang-Yang Li","Tao Qin","Edward Lin","Tie-Yan Liu"],"pdf_url":"https://arxiv.org/pdf/2212.01039v2.pdf","comment":"AAAI 2023"},{"id":"http://arxiv.org/abs/2312.11193v3","updated":"2023-12-20T14:57:11Z","published":"2023-12-18T13:40:16Z","title":"\"Paraphrasing The Original Text\" Makes High Accuracy Long-Context QA","summary":" Although LLMs continue to iterate and improve, most open-source models still\nhave a context window of no more than 4k, limiting their ability to handle\nlong-context problems. Most existing open-source models for long-context chat\nstill lack satisfactory accuracy. To address this issue, I approach it from the\nperspective of training data and theoretically prove that training the\ncapability to handle long contexts requires \"effective\" rather than \"long\"\ndata. Based on this, I propose using the \"original text paraphrase\" task, and\nsuccessfully extend the context window of the existing model to 32k by a\nlow-cost and effective method, achieving extremely high accuracy in\nmulti-document-QA and surpassing all existing open-source models of the same\nscale. The model and training data have been open-sourced on\nHuggingFace(https://huggingface.co/yuyijiong/Qwen-14b-chat-yarn-32k) and\nWiseModel(https://wisemodel.cn/models/yuyijiong/Qwen-14b-chat-yarn-32k).\n","authors":["Yijiong Yu"],"pdf_url":"https://arxiv.org/pdf/2312.11193v3.pdf","comment":"Chinese version of this paper can be downloaded from\n (https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)"},{"id":"http://arxiv.org/abs/2312.13040v1","updated":"2023-12-20T14:08:58Z","published":"2023-12-20T14:08:58Z","title":"Retrieval-augmented Multilingual Knowledge Editing","summary":" Knowledge represented in Large Language Models (LLMs) is quite often\nincorrect and can also become obsolete over time. Updating knowledge via\nfine-tuning is computationally resource-hungry and not reliable, and so\nknowledge editing (KE) has developed as an effective and economical alternative\nto inject new knowledge or to fix factual errors in LLMs. Although there has\nbeen considerable interest in this area, current KE research exclusively\nfocuses on the monolingual setting, typically in English. However, what happens\nif the new knowledge is supplied in one language, but we would like to query\nthe LLM in a different language? To address the problem of multilingual\nknowledge editing, we propose Retrieval-augmented Multilingual Knowledge Editor\n(ReMaKE) to update new knowledge in LLMs. ReMaKE can perform model-agnostic\nknowledge editing in multilingual settings. ReMaKE concatenates the new\nknowledge retrieved from a multilingual knowledge base with prompts. Our\nexperimental results show that ReMaKE outperforms baseline knowledge editing\nmethods by a significant margin and is the first KE method to work in a\nmultilingual setting. We provide our multilingual knowledge editing dataset\n(MzsRE) in 12 languages, which along with code, and additional project\ninformation is available at https://github.com/Vicky-Wil/ReMaKE.\n","authors":["Weixuan Wang","Barry Haddow","Alexandra Birch"],"pdf_url":"https://arxiv.org/pdf/2312.13040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13026v1","updated":"2023-12-20T13:50:05Z","published":"2023-12-20T13:50:05Z","title":"FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous\n Self-Supervised Learning","summary":" Continued pre-training (CP) offers multiple advantages, like target domain\nadaptation and the potential to exploit the continuous stream of unlabeled data\navailable online. However, continued pre-training on out-of-domain\ndistributions often leads to catastrophic forgetting of previously acquired\nknowledge, leading to sub-optimal ASR performance. This paper presents FusDom,\na simple and novel methodology for SSL-based continued pre-training. FusDom\nlearns speech representations that are robust and adaptive yet not forgetful of\nconcepts seen in the past. Instead of solving the SSL pre-text task on the\noutput representations of a single model, FusDom leverages two identical\npre-trained SSL models, a teacher and a student, with a modified pre-training\nhead to solve the CP SSL pre-text task. This head employs a cross-attention\nmechanism between the representations of both models while only the student\nreceives gradient updates and the teacher does not. Finally, the student is\nfine-tuned for ASR. In practice, FusDom outperforms all our baselines across\nsettings significantly, with WER improvements in the range of 0.2 WER - 7.3 WER\nin the target domain while retaining the performance in the earlier domain.\n","authors":["Ashish Seth","Sreyan Ghosh","S. Umesh","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2312.13026v1.pdf","comment":"Accepted at ICASSP 2024. Code: https://github.com/cs20s030/fusdom"},{"id":"http://arxiv.org/abs/2309.17255v4","updated":"2023-12-20T13:34:31Z","published":"2023-09-29T14:03:34Z","title":"Knowledge Graphs for the Life Sciences: Recent Developments, Challenges\n and Opportunities","summary":" The term life sciences refers to the disciplines that study living organisms\nand life processes, and include chemistry, biology, medicine, and a range of\nother related disciplines. Research efforts in life sciences are heavily\ndata-driven, as they produce and consume vast amounts of scientific data, much\nof which is intrinsically relational and graph-structured.\n The volume of data and the complexity of scientific concepts and relations\nreferred to therein promote the application of advanced knowledge-driven\ntechnologies for managing and interpreting data, with the ultimate aim to\nadvance scientific discovery.\n In this survey and position paper, we discuss recent developments and\nadvances in the use of graph-based technologies in life sciences and set out a\nvision for how these technologies will impact these fields into the future. We\nfocus on three broad topics: the construction and management of Knowledge\nGraphs (KGs), the use of KGs and associated technologies in the discovery of\nnew knowledge, and the use of KGs in artificial intelligence applications to\nsupport explanations (explainable AI). We select a few exemplary use cases for\neach topic, discuss the challenges and open research questions within these\ntopics, and conclude with a perspective and outlook that summarizes the\noverarching challenges and their potential solutions as a guide for future\nresearch.\n","authors":["Jiaoyan Chen","Hang Dong","Janna Hastings","Ernesto Jiménez-Ruiz","Vanessa López","Pierre Monnin","Catia Pesquita","Petr Škoda","Valentina Tamma"],"pdf_url":"https://arxiv.org/pdf/2309.17255v4.pdf","comment":"33 pages, 1 figure, camera-ready version, accepted for Transactions\n on Graph Data and Knowledge (TGDK)"},{"id":"http://arxiv.org/abs/2312.13010v1","updated":"2023-12-20T13:22:41Z","published":"2023-12-20T13:22:41Z","title":"AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and\n Optimisation","summary":" The advancement of natural language processing (NLP) has been significantly\nboosted by the development of transformer-based large language models (LLMs).\nThese models have revolutionized NLP tasks, particularly in code generation,\naiding developers in creating software with enhanced efficiency. Despite their\nadvancements, challenges in balancing code snippet generation with effective\ntest case generation and execution persist. To address these issues, this paper\nintroduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution\ncomprising a multi-agent framework with specialized agents: the programmer\nagent, the test designer agent, and the test executor agent. During the coding\nprocedure, the programmer agent will focus on the code generation and\nrefinement based on the test executor agent's feedback. The test designer agent\nwill generate test cases for the generated code, and the test executor agent\nwill run the code with the test cases and write the feedback to the programmer.\nThis collaborative system ensures robust code generation, surpassing the\nlimitations of single-agent models and traditional methodologies. Our extensive\nexperiments on 9 code generation models and 12 enhancement approaches showcase\nAgentCoder's superior performance over existing code generation models and\nprompt engineering techniques across various benchmarks. For example,\nAgentCoder achieves 77.4% and 89.1% pass@1 in HumanEval-ET and MBPP-ET with\nGPT-3.5, while SOTA baselines obtain only 69.5% and 63.0%.\n","authors":["Dong Huang","Qingwen Bu","Jie M. Zhang","Michael Luck","Heming Cui"],"pdf_url":"https://arxiv.org/pdf/2312.13010v1.pdf","comment":"21 pages, 12 figures"},{"id":"http://arxiv.org/abs/2312.12999v1","updated":"2023-12-20T12:59:31Z","published":"2023-12-20T12:59:31Z","title":"Machine Mindset: An MBTI Exploration of Large Language Models","summary":" We present a novel approach for integrating Myers-Briggs Type Indicator\n(MBTI) personality traits into large language models (LLMs), addressing the\nchallenges of personality consistency in personalized AI. Our method, \"Machine\nMindset,\" involves a two-phase fine-tuning and Direct Preference Optimization\n(DPO) to embed MBTI traits into LLMs. This approach ensures that models\ninternalize these traits, offering a stable and consistent personality profile.\nWe demonstrate the effectiveness of our models across various domains, showing\nalignment between model performance and their respective MBTI traits. The paper\nhighlights significant contributions in the development of personality datasets\nand a new training methodology for personality integration in LLMs, enhancing\nthe potential for personalized AI applications. We also open-sourced our model\nand part of the data at \\url{https://github.com/PKU-YuanGroup/Machine-Mindset}.\n","authors":["Jiaxi Cui","Liuzhenghao Lv","Jing Wen","Jing Tang","YongHong Tian","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.12999v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11662v3","updated":"2023-12-20T12:57:34Z","published":"2023-05-19T13:23:51Z","title":"Separating form and meaning: Using self-consistency to quantify task\n understanding across multiple senses","summary":" At the staggering pace with which the capabilities of large language models\n(LLMs) are increasing, creating future-proof evaluation sets to assess their\nunderstanding becomes more and more challenging. In this paper, we propose a\nnovel paradigm for evaluating LLMs which leverages the idea that correct world\nunderstanding should be consistent across different (Fregean) senses of the\nsame meaning. Accordingly, we measure understanding not in terms of correctness\nbut by evaluating consistency across multiple senses that are generated by the\nmodel itself. We showcase our approach by instantiating a test where the\ndifferent senses are different languages, hence using multilingual\nself-consistency as a litmus test for the model's understanding and\nsimultaneously addressing the important topic of multilinguality. Taking one of\nthe latest versions of ChatGPT as our object of study, we evaluate multilingual\nconsistency for two different tasks across three different languages. We show\nthat its multilingual consistency is still lacking, and that its task and world\nunderstanding are thus not language-independent. As our approach does not\nrequire any static evaluation corpora in languages other than English, it can\neasily and cheaply be extended to different languages and tasks and could\nbecome an integral part of future benchmarking efforts.\n","authors":["Xenia Ohmer","Elia Bruni","Dieuwke Hupkes"],"pdf_url":"https://arxiv.org/pdf/2305.11662v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12989v1","updated":"2023-12-20T12:46:44Z","published":"2023-12-20T12:46:44Z","title":"Benchmarking and Analyzing In-context Learning, Fine-tuning and\n Supervised Learning for Biomedical Knowledge Curation: a focused study on\n chemical entities of biological interest","summary":" Automated knowledge curation for biomedical ontologies is key to ensure that\nthey remain comprehensive, high-quality and up-to-date. In the era of\nfoundational language models, this study compares and analyzes three NLP\nparadigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and\nsupervised learning (ML). Using the Chemical Entities of Biological Interest\n(ChEBI) database as a model ontology, three curation tasks were devised. For\nICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT.\nPubmedBERT was chosen for the FT paradigm. For ML, six embedding models were\nutilized for training Random Forest and Long-Short Term Memory models. Five\nsetups were designed to assess ML and FT model performance across different\ndata availability scenarios.Datasets for curation tasks included: task 1\n(620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive\nversus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of\n0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML\n(trained on ~260,000 triples) outperformed ICL in accuracy across all tasks.\n(accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed\nsimilarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and\n+.002), but worse in task 3 (-.048). Simulations revealed performance declines\nin both ML and FT models with smaller and higher imbalanced training data.\nwhere ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks\n1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed\nML/FT in task 2.ICL-augmented foundation models can be good assistants for\nknowledge curation with correct prompting, however, not making ML and FT\nparadigms obsolete. The latter two require task-specific data to beat ICL. In\nsuch cases, ML relies on small pretrained embeddings, minimizing computational\ndemands.\n","authors":["Emily Groves","Minhong Wang","Yusuf Abdulle","Holger Kunz","Jason Hoelscher-Obermaier","Ronin Wu","Honghan Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12989v1.pdf","comment":"26 pages, 5 figures, 14 tables"},{"id":"http://arxiv.org/abs/2312.12436v2","updated":"2023-12-20T12:40:47Z","published":"2023-12-19T18:59:22Z","title":"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise","summary":" The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Chaoyou Fu","Renrui Zhang","Zihan Wang","Yubo Huang","Zhengye Zhang","Longtian Qiu","Gaoxiang Ye","Yunhang Shen","Mengdan Zhang","Peixian Chen","Sirui Zhao","Shaohui Lin","Deqiang Jiang","Di Yin","Peng Gao","Ke Li","Hongsheng Li","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2312.12436v2.pdf","comment":"Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2309.01431v2","updated":"2023-12-20T11:54:11Z","published":"2023-09-04T08:28:44Z","title":"Benchmarking Large Language Models in Retrieval-Augmented Generation","summary":" Retrieval-Augmented Generation (RAG) is a promising approach for mitigating\nthe hallucination of large language models (LLMs). However, existing research\nlacks rigorous evaluation of the impact of retrieval-augmented generation on\ndifferent large language models, which make it challenging to identify the\npotential bottlenecks in the capabilities of RAG for different LLMs. In this\npaper, we systematically investigate the impact of Retrieval-Augmented\nGeneration on large language models. We analyze the performance of different\nlarge language models in 4 fundamental abilities required for RAG, including\nnoise robustness, negative rejection, information integration, and\ncounterfactual robustness. To this end, we establish Retrieval-Augmented\nGeneration Benchmark (RGB), a new corpus for RAG evaluation in both English and\nChinese. RGB divides the instances within the benchmark into 4 separate\ntestbeds based on the aforementioned fundamental abilities required to resolve\nthe case. Then we evaluate 6 representative LLMs on RGB to diagnose the\nchallenges of current LLMs when applying RAG. Evaluation reveals that while\nLLMs exhibit a certain degree of noise robustness, they still struggle\nsignificantly in terms of negative rejection, information integration, and\ndealing with false information. The aforementioned assessment outcomes indicate\nthat there is still a considerable journey ahead to effectively apply RAG to\nLLMs.\n","authors":["Jiawei Chen","Hongyu Lin","Xianpei Han","Le Sun"],"pdf_url":"https://arxiv.org/pdf/2309.01431v2.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2307.12976v2","updated":"2023-12-20T11:52:41Z","published":"2023-07-24T17:52:46Z","title":"Evaluating the Ripple Effects of Knowledge Editing in Language Models","summary":" Modern language models capture a large body of factual knowledge. However,\nsome facts can be incorrectly induced or become obsolete over time, resulting\nin factually incorrect generations. This has led to the development of various\nediting methods that allow updating facts encoded by the model. Evaluation of\nthese methods has primarily focused on testing whether an individual fact has\nbeen successfully injected, and if similar predictions for other subjects have\nnot changed. Here we argue that such evaluation is limited, since injecting one\nfact (e.g. ``Jack Depp is the son of Johnny Depp'') introduces a ``ripple\neffect'' in the form of additional facts that the model needs to update\n(e.g.``Jack Depp is the sibling of Lily-Rose Depp''). To address this issue, we\npropose a novel set of evaluation criteria that consider the implications of an\nedit on related facts. Using these criteria, we then construct RippleEdits, a\ndiagnostic benchmark of 5K factual edits, capturing a variety of types of\nripple effects. We evaluate prominent editing methods on RippleEdits, showing\nthat current methods fail to introduce consistent changes in the model's\nknowledge. In addition, we find that a simple in-context editing baseline\nobtains the best scores on our benchmark, suggesting a promising research\ndirection for model editing.\n","authors":["Roi Cohen","Eden Biran","Ori Yoran","Amir Globerson","Mor Geva"],"pdf_url":"https://arxiv.org/pdf/2307.12976v2.pdf","comment":"Accepted for publication in Transactions of the Association for\n Computational Linguistics (TACL), 2024. Author's final version"},{"id":"http://arxiv.org/abs/2308.13198v2","updated":"2023-12-20T11:05:17Z","published":"2023-08-25T06:26:05Z","title":"Journey to the Center of the Knowledge Neurons: Discoveries of\n Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons","summary":" Pre-trained language models (PLMs) contain vast amounts of factual knowledge,\nbut how the knowledge is stored in the parameters remains unclear. This paper\ndelves into the complex task of understanding how factual knowledge is stored\nin multilingual PLMs, and introduces the Architecture-adapted Multilingual\nIntegrated Gradients method, which successfully localizes knowledge neurons\nmore precisely compared to current methods, and is more universal across\nvarious architectures and languages. Moreover, we conduct an in-depth\nexploration of knowledge neurons, leading to the following two important\ndiscoveries: (1) The discovery of Language-Independent Knowledge Neurons, which\nstore factual knowledge in a form that transcends language. We design\ncross-lingual knowledge editing experiments, demonstrating that the PLMs can\naccomplish this task based on language-independent neurons; (2) The discovery\nof Degenerate Knowledge Neurons, a novel type of neuron showing that different\nknowledge neurons can store the same fact. Its property of functional overlap\nendows the PLMs with a robust mastery of factual knowledge. We design\nfact-checking experiments, proving that the degenerate knowledge neurons can\nhelp the PLMs to detect wrong facts. Experiments corroborate these findings,\nshedding light on the mechanisms of factual knowledge storage in multilingual\nPLMs, and contribute valuable insights to the field. The code is available at\nhttps://github.com/heng840/AMIG.\n","authors":["Yuheng Chen","Pengfei Cao","Yubo Chen","Kang Liu","Jun Zhao"],"pdf_url":"https://arxiv.org/pdf/2308.13198v2.pdf","comment":"Accepted in the 38th AAAI Conference on Artificial Intelligence (AAAI\n 2024)"},{"id":"http://arxiv.org/abs/2312.12918v1","updated":"2023-12-20T10:53:53Z","published":"2023-12-20T10:53:53Z","title":"Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors","summary":" To combat the potential misuse of Natural Language Generation (NLG)\ntechnology, a variety of algorithms have been developed for the detection of\nAI-generated texts. Traditionally, this task is treated as a binary\nclassification problem. Although supervised learning has demonstrated promising\nresults, acquiring labeled data for detection purposes poses real-world\nchallenges and the risk of overfitting. In an effort to address these issues,\nwe delve into the realm of zero-shot machine-generated text detection. Existing\nzero-shot detectors, typically designed for specific tasks or topics, often\nassume uniform testing scenarios, limiting their practicality. In our research,\nwe explore various advanced Large Language Models (LLMs) and their specialized\nvariants, contributing to this field in several ways. In empirical studies, we\nuncover a significant correlation between topics and detection performance.\nSecondly, we delve into the influence of topic shifts on zero-shot detectors.\nThese investigations shed light on the adaptability and robustness of these\ndetection methods across diverse topics.\n","authors":["Yi-Fan Zhang","Zhang Zhang","Liang Wang","Rong Jin"],"pdf_url":"https://arxiv.org/pdf/2312.12918v1.pdf","comment":"8 pages, 3 figures, AAAI 2024 Workshop on Responsible Language Models"},{"id":"http://arxiv.org/abs/2312.12881v1","updated":"2023-12-20T09:45:44Z","published":"2023-12-20T09:45:44Z","title":"Big Tech influence over AI research revisited: memetic analysis of\n attribution of ideas to affiliation","summary":" There exists a growing discourse around the domination of Big Tech on the\nlandscape of artificial intelligence (AI) research, yet our comprehension of\nthis phenomenon remains cursory. This paper aims to broaden and deepen our\nunderstanding of Big Tech's reach and power within AI research. It highlights\nthe dominance not merely in terms of sheer publication volume but rather in the\npropagation of new ideas or \\textit{memes}. Current studies often oversimplify\nthe concept of influence to the share of affiliations in academic papers,\ntypically sourced from limited databases such as arXiv or specific academic\nconferences.\n The main goal of this paper is to unravel the specific nuances of such\ninfluence, determining which AI ideas are predominantly driven by Big Tech\nentities. By employing network and memetic analysis on AI-oriented paper\nabstracts and their citation network, we are able to grasp a deeper insight\ninto this phenomenon. By utilizing two databases: OpenAlex and S2ORC, we are\nable to perform such analysis on a much bigger scale than previous attempts.\n Our findings suggest, that while Big Tech-affiliated papers are\ndisproportionately more cited in some areas, the most cited papers are those\naffiliated with both Big Tech and Academia. Focusing on the most contagious\nmemes, their attribution to specific affiliation groups (Big Tech, Academia,\nmixed affiliation) seems to be equally distributed between those three groups.\nThis suggests that the notion of Big Tech domination over AI research is\noversimplified in the discourse.\n Ultimately, this more nuanced understanding of Big Tech's and Academia's\ninfluence could inform a more symbiotic alliance between these stakeholders\nwhich would better serve the dual goals of societal welfare and the scientific\nintegrity of AI research.\n","authors":["Stanisław Giziński","Paulina Kaczyńska","Hubert Ruczyński","Emilia Wiśnios","Bartosz Pieliński","Przemysław Biecek","Julian Sienkiewicz"],"pdf_url":"https://arxiv.org/pdf/2312.12881v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11276v3","updated":"2023-12-20T09:43:01Z","published":"2023-12-18T15:18:57Z","title":"Compositional Generalization for Multi-label Text Classification: A\n Data-Augmentation Approach","summary":" Despite significant advancements in multi-label text classification, the\nability of existing models to generalize to novel and seldom-encountered\ncomplex concepts, which are compositions of elementary ones, remains\nunderexplored. This research addresses this gap. By creating unique data splits\nacross three benchmarks, we assess the compositional generalization ability of\nexisting multi-label text classification models. Our results show that these\nmodels often fail to generalize to compositional concepts encountered\ninfrequently during training, leading to inferior performance on tests with\nthese new combinations. To address this, we introduce a data augmentation\nmethod that leverages two innovative text generation models designed to enhance\nthe classification models' capacity for compositional generalization. Our\nexperiments show that this data augmentation approach significantly improves\nthe compositional generalization capabilities of classification models on our\nbenchmarks, with both generation models surpassing other text generation\nbaselines.\n","authors":["Yuyang Chai","Zhuang Li","Jiahui Liu","Lei Chen","Fei Li","Donghong Ji","Chong Teng"],"pdf_url":"https://arxiv.org/pdf/2312.11276v3.pdf","comment":"Accepted by AAAI'24"},{"id":"http://arxiv.org/abs/2304.01246v3","updated":"2023-12-20T09:19:15Z","published":"2023-04-03T16:46:49Z","title":"Safety Analysis in the Era of Large Language Models: A Case Study of\n STPA using ChatGPT","summary":" Can safety analysis make use of Large Language Models (LLMs)? A case study\nexplores Systems Theoretic Process Analysis (STPA) applied to Automatic\nEmergency Brake (AEB) and Electricity Demand Side Management (DSM) systems\nusing ChatGPT. We investigate how collaboration schemes, input semantic\ncomplexity, and prompt guidelines influence STPA results. Comparative results\nshow that using ChatGPT without human intervention may be inadequate due to\nreliability related issues, but with careful design, it may outperform human\nexperts. No statistically significant differences are found when varying the\ninput semantic complexity or using common prompt guidelines, which suggests the\nnecessity for developing domain-specific prompt engineering. We also highlight\nfuture challenges, including concerns about LLM trustworthiness and the\nnecessity for standardisation and regulation in this domain.\n","authors":["Yi Qi","Xingyu Zhao","Siddartha Khastgir","Xiaowei Huang"],"pdf_url":"https://arxiv.org/pdf/2304.01246v3.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2312.12853v1","updated":"2023-12-20T09:06:18Z","published":"2023-12-20T09:06:18Z","title":"CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks\n for Chinese Large Language Models","summary":" As an indispensable ingredient of intelligence, commonsense reasoning is\ncrucial for large language models (LLMs) in real-world scenarios. In this\npaper, we propose CORECODE, a dataset that contains abundant commonsense\nknowledge manually annotated on dyadic dialogues, to evaluate the commonsense\nreasoning and commonsense conflict detection capabilities of Chinese LLMs. We\ncategorize commonsense knowledge in everyday conversations into three\ndimensions: entity, event, and social interaction. For easy and consistent\nannotation, we standardize the form of commonsense knowledge annotation in\nopen-domain dialogues as \"domain: slot = value\". A total of 9 domains and 37\nslots are defined to capture diverse commonsense knowledge. With these\npre-defined domains and slots, we collect 76,787 commonsense knowledge\nannotations from 19,700 dialogues through crowdsourcing. To evaluate and\nenhance the commonsense reasoning capability for LLMs on the curated dataset,\nwe establish a series of dialogue-level reasoning and detection tasks,\nincluding commonsense knowledge filling, commonsense knowledge generation,\ncommonsense conflict phrase detection, domain identification, slot\nidentification, and event causal inference. A wide variety of existing\nopen-source Chinese LLMs are evaluated with these tasks on our dataset.\nExperimental results demonstrate that these models are not competent to predict\nCORECODE's plentiful reasoning content, and even ChatGPT could only achieve\n0.275 and 0.084 accuracy on the domain identification and slot identification\ntasks under the zero-shot setting. We release the data and codes of CORECODE at\nhttps://github.com/danshi777/CORECODE to promote commonsense reasoning\nevaluation and study of LLMs in the context of daily conversations.\n","authors":["Dan Shi","Chaobin You","Jiantao Huang","Taihao Li","Deyi Xiong"],"pdf_url":"https://arxiv.org/pdf/2312.12853v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12852v1","updated":"2023-12-20T09:06:06Z","published":"2023-12-20T09:06:06Z","title":"Language Resources for Dutch Large Language Modelling","summary":" Despite the rapid expansion of types of large language models, there remains\na notable gap in models specifically designed for the Dutch language. This gap\nis not only a shortage in terms of pretrained Dutch models but also in terms of\ndata, and benchmarks and leaderboards. This work provides a small step to\nimprove the situation. First, we introduce two fine-tuned variants of the Llama\n2 13B model. We first fine-tuned Llama 2 using Dutch-specific web-crawled data\nand subsequently refined this model further on multiple synthetic instruction\nand chat datasets. These datasets as well as the model weights are made\navailable. In addition, we provide a leaderboard to keep track of the\nperformance of (Dutch) models on a number of generation tasks, and we include\nresults of a number of state-of-the-art models, including our own. Finally we\nprovide a critical conclusion on what we believe is needed to push forward\nDutch language models and the whole eco-system around the models.\n","authors":["Bram Vanroy"],"pdf_url":"https://arxiv.org/pdf/2312.12852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12850v1","updated":"2023-12-20T09:01:01Z","published":"2023-12-20T09:01:01Z","title":"A Stochastic Analysis of the Linguistic Provenance of English Place\n Names","summary":" In English place name analysis, meanings are often derived from the\nresemblance of roots in place names to topographical features, proper names\nand/or habitation terms in one of the languages that have had an influence on\nEnglish place names. The problem here is that it is sometimes difficult to\ndetermine the base language to use to interpret the roots. The purpose of this\npaper is to stochastically determine the resemblance between 18799 English\nplace names and 84685 place names from Ireland, Scotland, Wales, Denmark,\nNorway, Sweden, France, Germany, the Netherlands and Ancient Rome. Each English\nplace name is ranked according to the extent to which it resembles place names\nfrom the other countries, and this provides a basis for determining the likely\nlanguage to use to interpret the place name. A number of observations can be\nmade using the ranking provided. In particular, it is found that `Didlington'\nis the most archetypically English place name in the English sample, and `Anna'\nis the least. Furthermore, it is found that the place names in the non-English\ndatasets are most similar to Norwegian place names and least similar to Welsh\nplace names.\n","authors":["Michael Dalvean"],"pdf_url":"https://arxiv.org/pdf/2312.12850v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.05221v4","updated":"2023-12-20T08:47:39Z","published":"2023-03-09T12:50:34Z","title":"SEAM: An Integrated Activation-Coupled Model of Sentence Processing and\n Eye Movements in Reading","summary":" Models of eye-movement control during reading, developed largely within\npsychology, usually focus on visual, attentional, lexical, and motor processes\nbut neglect post-lexical language processing; by contrast, models of sentence\ncomprehension processes, developed largely within psycholinguistics, generally\nfocus only on post-lexical language processes. We present a model that combines\nthese two research threads, by integrating eye-movement control and sentence\nprocessing. Developing such an integrated model is extremely challenging and\ncomputationally demanding, but such an integration is an important step toward\ncomplete mathematical models of natural language comprehension in reading. We\ncombine the SWIFT model of eye-movement control (Seelig et al., 2020,\ndoi:10.1016/j.jmp.2019.102313) with key components of the Lewis and Vasishth\nsentence processing model (Lewis & Vasishth, 2005,\ndoi:10.1207/s15516709cog0000_25). This integration becomes possible, for the\nfirst time, due in part to recent advances in successful parameter\nidentification in dynamical models, which allows us to investigate profile\nlog-likelihoods for individual model parameters. We present a fully implemented\nproof-of-concept model demonstrating how such an integrated model can be\nachieved; our approach includes Bayesian model inference with Markov Chain\nMonte Carlo (MCMC) sampling as a key computational tool. The integrated\nSentence-Processing and Eye-Movement Activation-Coupled Model (SEAM) can\nsuccessfully reproduce eye movement patterns that arise due to similarity-based\ninterference in reading. To our knowledge, this is the first-ever integration\nof a complete process model of eye-movement control with linguistic dependency\ncompletion processes in sentence comprehension. In future work, this proof of\nconcept model will need to be evaluated using a comprehensive set of benchmark\ndata.\n","authors":["Maximilian M. Rabe","Dario Paape","Daniela Mertzen","Shravan Vasishth","Ralf Engbert"],"pdf_url":"https://arxiv.org/pdf/2303.05221v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15494v3","updated":"2023-12-20T08:46:01Z","published":"2023-10-24T03:42:49Z","title":"TRAMS: Training-free Memory Selection for Long-range Language Modeling","summary":" The Transformer architecture is crucial for numerous AI models, but it still\nfaces challenges in long-range language modeling. Though several specific\ntransformer architectures have been designed to tackle issues of long-range\ndependencies, existing methods like Transformer-XL are plagued by a high\npercentage of ineffective memories. In this study, we present a plug-and-play\nstrategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens\nparticipating in attention calculation based on one simple metric. This\nstrategy allows us to keep tokens that are likely to have a high attention\nscore with the current queries and ignore the other ones. We have tested our\napproach on the word-level benchmark (WikiText-103) and the character-level\nbenchmark (enwik8), and the results indicate an improvement without having\nadditional training or adding additional parameters.\n","authors":["Haofei Yu","Cunxiang Wang","Yue Zhang","Wei Bi"],"pdf_url":"https://arxiv.org/pdf/2310.15494v3.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2312.12832v1","updated":"2023-12-20T08:28:36Z","published":"2023-12-20T08:28:36Z","title":"Turning Dust into Gold: Distilling Complex Reasoning Capabilities from\n LLMs by Leveraging Negative Data","summary":" Large Language Models (LLMs) have performed well on various reasoning tasks,\nbut their inaccessibility and numerous parameters hinder wide application in\npractice. One promising way is distilling the reasoning ability from LLMs to\nsmall models by the generated chain-of-thought reasoning paths. In some cases,\nhowever, LLMs may produce incorrect reasoning chains, especially when facing\ncomplex mathematical problems. Previous studies only transfer knowledge from\npositive samples and drop the synthesized data with wrong answers. In this\nwork, we illustrate the merit of negative data and propose a model\nspecialization framework to distill LLMs with negative samples besides positive\nones. The framework consists of three progressive steps, covering from training\nto inference stages, to absorb knowledge from negative data. We conduct\nextensive experiments across arithmetic reasoning tasks to demonstrate the role\nof negative data in distillation from LLM.\n","authors":["Yiwei Li","Peiwen Yuan","Shaoxiong Feng","Boyuan Pan","Bin Sun","Xinglin Wang","Heda Wang","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2312.12832v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.09085v2","updated":"2023-12-20T08:03:12Z","published":"2023-12-14T16:16:50Z","title":"The Earth is Flat because...: Investigating LLMs' Belief towards\n Misinformation via Persuasive Conversation","summary":" Large Language Models (LLMs) encapsulate vast amounts of knowledge but still\nremain vulnerable to external misinformation. Existing research mainly studied\nthis susceptibility behavior in a single-turn setting. However, belief can\nchange during a multi-turn conversation, especially a persuasive one.\nTherefore, in this study, we delve into LLMs' susceptibility to persuasive\nconversations, particularly on factual questions that they can answer\ncorrectly. We first curate the Farm (i.e., Fact to Misinform) dataset, which\ncontains factual questions paired with systematically generated persuasive\nmisinformation. Then, we develop a testing framework to track LLMs' belief\nchanges in a persuasive dialogue. Through extensive experiments, we find that\nLLMs' correct beliefs on factual knowledge can be easily manipulated by various\npersuasive strategies.\n","authors":["Rongwu Xu","Brian S. Lin","Shujian Yang","Tianqi Zhang","Weiyan Shi","Tianwei Zhang","Zhixuan Fang","Wei Xu","Han Qiu"],"pdf_url":"https://arxiv.org/pdf/2312.09085v2.pdf","comment":"45 pages"},{"id":"http://arxiv.org/abs/2312.12815v1","updated":"2023-12-20T07:34:20Z","published":"2023-12-20T07:34:20Z","title":"OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using\n Semantic Understanding in Mixed Reality","summary":" One key challenge in augmented reality is the placement of virtual content in\nnatural locations. Existing automated techniques are only able to work with a\nclosed-vocabulary, fixed set of objects. In this paper, we introduce a new\nopen-vocabulary method for object placement. Our eight-stage pipeline leverages\nrecent advances in segmentation models, vision-language models, and LLMs to\nplace any virtual object in any AR camera frame or scene. In a preliminary user\nstudy, we show that our method performs at least as well as human experts 57%\nof the time.\n","authors":["Luke Yoffe","Aditya Sharma","Tobias Höllerer"],"pdf_url":"https://arxiv.org/pdf/2312.12815v1.pdf","comment":"IEEE International Symposium on Mixed and Augmented Reality (ISMAR)\n 2023"},{"id":"http://arxiv.org/abs/2312.11562v2","updated":"2023-12-20T07:25:58Z","published":"2023-12-17T15:16:13Z","title":"A Survey of Reasoning with Foundation Models: Concepts, Methodologies,\n and Outlook","summary":" Reasoning, a crucial ability for complex problem-solving, plays a pivotal\nrole in various real-world settings such as negotiation, medical diagnosis, and\ncriminal investigation. It serves as a fundamental methodology in the field of\nArtificial General Intelligence (AGI). With the ongoing development of\nfoundation models, there is a growing interest in exploring their abilities in\nreasoning tasks. In this paper, we introduce seminal foundation models proposed\nor adaptable for reasoning, highlighting the latest advancements in various\nreasoning tasks, methods, and benchmarks. We then delve into the potential\nfuture directions behind the emergence of reasoning abilities within foundation\nmodels. We also discuss the relevance of multimodal learning, autonomous\nagents, and super alignment in the context of reasoning. By discussing these\nfuture research directions, we hope to inspire researchers in their exploration\nof this field, stimulate further advancements in reasoning with foundation\nmodels, and contribute to the development of AGI.\n","authors":["Jiankai Sun","Chuanyang Zheng","Enze Xie","Zhengying Liu","Ruihang Chu","Jianing Qiu","Jiaqi Xu","Mingyu Ding","Hongyang Li","Mengzhe Geng","Yue Wu","Wenhai Wang","Junsong Chen","Zhangyue Yin","Xiaozhe Ren","Jie Fu","Junxian He","Wu Yuan","Qi Liu","Xihui Liu","Yu Li","Hao Dong","Yu Cheng","Ming Zhang","Pheng Ann Heng","Jifeng Dai","Ping Luo","Jingdong Wang","Ji-Rong Wen","Xipeng Qiu","Yike Guo","Hui Xiong","Qun Liu","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2312.11562v2.pdf","comment":"20 Figures, 159 Pages, 740 References, Project Page\n https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models"},{"id":"http://arxiv.org/abs/2312.12808v1","updated":"2023-12-20T07:15:04Z","published":"2023-12-20T07:15:04Z","title":"Enhancing Consistency in Multimodal Dialogue System Using LLM with\n Dialogue Scenario","summary":" This paper describes our dialogue system submitted to Dialogue Robot\nCompetition 2023. The system's task is to help a user at a travel agency decide\non a plan for visiting two sightseeing spots in Kyoto City that satisfy the\nuser. Our dialogue system is flexible and stable and responds to user\nrequirements by controlling dialogue flow according to dialogue scenarios. We\nalso improved user satisfaction by introducing motion and speech control based\non system utterances and user situations. In the preliminary round, our system\nwas ranked fifth in the impression evaluation and sixth in the plan evaluation\namong all 12 teams.\n","authors":["Hiroki Onozeki","Zhiyang Qi","Kazuma Akiyama","Ryutaro Asahara","Takumasa Kaneko","Michimasa Inaba"],"pdf_url":"https://arxiv.org/pdf/2312.12808v1.pdf","comment":"This paper is part of the proceedings of the Dialogue Robot\n Competition 2023"},{"id":"http://arxiv.org/abs/2312.12806v1","updated":"2023-12-20T07:01:49Z","published":"2023-12-20T07:01:49Z","title":"MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large\n Language Models","summary":" The emergence of various medical large language models (LLMs) in the medical\ndomain has highlighted the need for unified evaluation standards, as manual\nevaluation of LLMs proves to be time-consuming and labor-intensive. To address\nthis issue, we introduce MedBench, a comprehensive benchmark for the Chinese\nmedical domain, comprising 40,041 questions sourced from authentic examination\nexercises and medical reports of diverse branches of medicine. In particular,\nthis benchmark is composed of four key components: the Chinese Medical\nLicensing Examination, the Resident Standardization Training Examination, the\nDoctor In-Charge Qualification Examination, and real-world clinic cases\nencompassing examinations, diagnoses, and treatments. MedBench replicates the\neducational progression and clinical practice experiences of doctors in\nMainland China, thereby establishing itself as a credible benchmark for\nassessing the mastery of knowledge and reasoning abilities in medical language\nlearning models. We perform extensive experiments and conduct an in-depth\nanalysis from diverse perspectives, which culminate in the following findings:\n(1) Chinese medical LLMs underperform on this benchmark, highlighting the need\nfor significant advances in clinical knowledge and diagnostic precision. (2)\nSeveral general-domain LLMs surprisingly possess considerable medical\nknowledge. These findings elucidate both the capabilities and limitations of\nLLMs within the context of MedBench, with the ultimate goal of aiding the\nmedical research community.\n","authors":["Yan Cai","Linlin Wang","Ye Wang","Gerard de Melo","Ya Zhang","Yanfeng Wang","Liang He"],"pdf_url":"https://arxiv.org/pdf/2312.12806v1.pdf","comment":"accepted by AAAI-24"},{"id":"http://arxiv.org/abs/2310.14747v3","updated":"2023-12-20T06:50:20Z","published":"2023-10-23T09:32:53Z","title":"MCC-KD: Multi-CoT Consistent Knowledge Distillation","summary":" Large language models (LLMs) have showcased remarkable capabilities in\ncomplex reasoning through chain of thought (CoT) prompting. Recently, there has\nbeen a growing interest in transferring these reasoning abilities from LLMs to\nsmaller models. However, achieving both the diversity and consistency in\nrationales presents a challenge. In this paper, we focus on enhancing these two\naspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to\nefficiently distill the reasoning capabilities. In MCC-KD, we generate multiple\nrationales for each question and enforce consistency among the corresponding\npredictions by minimizing the bidirectional KL-divergence between the answer\ndistributions. We investigate the effectiveness of MCC-KD with different model\narchitectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both\nmathematical reasoning and commonsense reasoning benchmarks. The empirical\nresults not only confirm MCC-KD's superior performance on in-distribution\ndatasets but also highlight its robust generalization ability on\nout-of-distribution datasets.\n","authors":["Hongzhan Chen","Siyue Wu","Xiaojun Quan","Rui Wang","Ming Yan","Ji Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14747v3.pdf","comment":"Accepted to ENMLP 2023"},{"id":"http://arxiv.org/abs/2312.03719v2","updated":"2023-12-20T06:40:30Z","published":"2023-11-26T05:27:35Z","title":"Assessing AI Chatbots Performance in Comprehensive Standardized Test\n Preparation; A Case Study with GRE","summary":" This research paper presents a comprehensive evaluation of the performance of\nthree artificial 10 intelligence chatbots: Bing, ChatGPT, and GPT-4, in\naddressing standardized test questions. Graduate record examination, known as\nGRE, serves as a case study in this paper, encompassing both quantitative\nreasoning and verbal skills. A total of 137 quantitative reasoning questions,\nfeaturing diverse styles and 157 verbal questions categorized into varying\nlevels of difficulty (easy, medium, and hard) were administered to assess the\nchatbots' capabilities. This paper provides a detailed examination of the\nresults and their implications for the utilization of artificial intelligence\nin standardized test preparation by presenting the performance of each chatbot\nacross various skills and styles tested in the exam. Additionally, this paper\nexplores the proficiency of artificial intelligence in addressing image-based\nquestions and illustrates the uncertainty level of each chatbot. The results\nreveal varying degrees of success across the chatbots, demonstrating the\ninfluence of model sophistication and training data. GPT-4 emerged as the most\nproficient, especially in complex language understanding tasks, highlighting\nthe evolution of artificial intelligence in language comprehension and its\nability to pass the exam with a high score.\n","authors":["Mohammad Abu-Haifa","Bara'a Etawi","Huthaifa Alkhatatbeh","Ayman Ababneh"],"pdf_url":"https://arxiv.org/pdf/2312.03719v2.pdf","comment":"19 Pages, 6 figures, and 6 tables"},{"id":"http://arxiv.org/abs/2312.12783v1","updated":"2023-12-20T06:02:12Z","published":"2023-12-20T06:02:12Z","title":"Stable Distillation: Regularizing Continued Pre-training for\n Low-Resource Automatic Speech Recognition","summary":" Continued self-supervised (SSL) pre-training for adapting existing SSL models\nto the target domain has shown to be extremely effective for low-resource\nAutomatic Speech Recognition (ASR). This paper proposes Stable Distillation, a\nsimple and novel approach for SSL-based continued pre-training that boosts ASR\nperformance in the target domain where both labeled and unlabeled data are\nlimited. Stable Distillation employs self-distillation as regularization for\ncontinued pre-training, alleviating the over-fitting issue, a common problem\ncontinued pre-training faces when the source and target domains differ.\nSpecifically, first, we perform vanilla continued pre-training on an initial\nSSL pre-trained model on the target domain ASR dataset and call it the teacher.\nNext, we take the same initial pre-trained model as a student to perform\ncontinued pre-training while enforcing its hidden representations to be close\nto that of the teacher (via MSE loss). This student is then used for downstream\nASR fine-tuning on the target dataset. In practice, Stable Distillation\noutperforms all our baselines by 0.8 - 7 WER when evaluated in various\nexperimental settings.\n","authors":["Ashish Seth","Sreyan Ghosh","S. Umesh","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2312.12783v1.pdf","comment":"Accepted to ICASSP 2024. Code:\n https://github.com/cs20s030/stable_distillation"},{"id":"http://arxiv.org/abs/2312.11985v2","updated":"2023-12-20T05:27:30Z","published":"2023-12-19T09:26:46Z","title":"Climate Change from Large Language Models","summary":" Climate change presents significant challenges to the global community, and\nit is imperative to raise widespread awareness of the climate crisis and\neducate users about low-carbon living. Artificial intelligence, particularly\nlarge language models (LLMs), have emerged as powerful tools in mitigating the\nclimate crisis, leveraging their extensive knowledge, broad user base, and\nnatural language interaction capabilities. However, despite the growing body of\nresearch on climate change, there is a lack of comprehensive assessments of\nclimate crisis knowledge within LLMs. This paper aims to resolve this gap by\nproposing an automatic evaluation framework. We employ a hybrid approach to\ndata acquisition that combines data synthesis and manual collection to compile\na diverse set of questions related to the climate crisis. These questions cover\nvarious aspects of climate change, including its causes, impacts, mitigation\nstrategies, and adaptation measures. We then evaluate the model knowledge\nthrough prompt engineering based on the collected questions and generated\nanswers. We propose a set of comprehensive metrics to evaluate the climate\ncrisis knowledge, incorporating indicators from 10 different perspectives.\nExperimental results show that our method is effective in evaluating the\nknowledge of LLMs regarding the climate crisis. We evaluate several\nstate-of-the-art LLMs and find that their knowledge falls short in terms of\ntimeliness.\n","authors":["Hongyin Zhu","Prayag Tiwari"],"pdf_url":"https://arxiv.org/pdf/2312.11985v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12773v1","updated":"2023-12-20T05:17:06Z","published":"2023-12-20T05:17:06Z","title":"Segmenting Messy Text: Detecting Boundaries in Text Derived from\n Historical Newspaper Images","summary":" Text segmentation, the task of dividing a document into sections, is often a\nprerequisite for performing additional natural language processing tasks.\nExisting text segmentation methods have typically been developed and tested\nusing clean, narrative-style text with segments containing distinct topics.\nHere we consider a challenging text segmentation task: dividing newspaper\nmarriage announcement lists into units of one announcement each. In many cases\nthe information is not structured into sentences, and adjacent segments are not\ntopically distinct from each other. In addition, the text of the announcements,\nwhich is derived from images of historical newspapers via optical character\nrecognition, contains many typographical errors. As a result, these\nannouncements are not amenable to segmentation with existing techniques. We\npresent a novel deep learning-based model for segmenting such text and show\nthat it significantly outperforms an existing state-of-the-art method on our\ntask.\n","authors":["Carol Anderson","Phil Crone"],"pdf_url":"https://arxiv.org/pdf/2312.12773v1.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.12764v1","updated":"2023-12-20T04:52:24Z","published":"2023-12-20T04:52:24Z","title":"Lattice Rescoring Based on Large Ensemble of Complementary Neural\n Language Models","summary":" We investigate the effectiveness of using a large ensemble of advanced neural\nlanguage models (NLMs) for lattice rescoring on automatic speech recognition\n(ASR) hypotheses. Previous studies have reported the effectiveness of combining\na small number of NLMs. In contrast, in this study, we combine up to eight\nNLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are\ntrained with two different random initialization seeds. We combine these NLMs\nthrough iterative lattice generation. Since these NLMs work complementarily\nwith each other, by combining them one by one at each rescoring iteration,\nlanguage scores attached to given lattice arcs can be gradually refined.\nConsequently, errors of the ASR hypotheses can be gradually reduced. We also\ninvestigate the effectiveness of carrying over contextual information (previous\nrescoring results) across a lattice sequence of a long speech such as a lecture\nspeech. In experiments using a lecture speech corpus, by combining the eight\nNLMs and using context carry-over, we obtained a 24.4% relative word error rate\nreduction from the ASR 1-best baseline. For further comparison, we performed\nsimultaneous (i.e., non-iterative) NLM combination and 100-best rescoring using\nthe large ensemble of NLMs, which confirmed the advantage of lattice rescoring\nwith iterative NLM combination.\n","authors":["Atsunori Ogawa","Naohiro Tawara","Marc Delcroix","Shoko Araki"],"pdf_url":"https://arxiv.org/pdf/2312.12764v1.pdf","comment":"Accepted to ICASSP 2022"},{"id":"http://arxiv.org/abs/2312.12754v1","updated":"2023-12-20T04:27:13Z","published":"2023-12-20T04:27:13Z","title":"Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic\n Segmentation","summary":" Recently, CLIP has found practical utility in the domain of pixel-level\nzero-shot segmentation tasks. The present landscape features two-stage\nmethodologies beset by issues such as intricate pipelines and elevated\ncomputational costs. While current one-stage approaches alleviate these\nconcerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's\ngeneralization capacity, they still fall short in fully harnessing CLIP's\npotential for pixel-level unseen class demarcation and precise pixel\npredictions. To further stimulate CLIP's zero-shot dense prediction capability,\nwe propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from\nimage to pixel. Specifically, we initially introduce Spectral Prompt Tuning\n(SPT), incorporating spectral prompts into the CLIP visual encoder's shallow\nlayers to capture structural intricacies of images, thereby enhancing\ncomprehension of unseen classes. Subsequently, we introduce the Spectral Guided\nDecoder (SGD), utilizing both high and low-frequency information to steer the\nnetwork's spatial focus towards more prominent classification features,\nenabling precise pixel-level prediction outcomes. Through extensive experiments\non two public datasets, we demonstrate the superiority of our method over\nstate-of-the-art approaches, performing well across all classes and\nparticularly excelling in handling unseen classes. Code is available\nat:https://github.com/clearxu/SPT.\n","authors":["Wenhao Xu","Rongtao Xu","Changwei Wang","Shibiao Xu","Li Guo","Man Zhang","Xiaopeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12754v1.pdf","comment":"AAAI2024 Accepted"},{"id":"http://arxiv.org/abs/2312.12747v1","updated":"2023-12-20T03:44:18Z","published":"2023-12-20T03:44:18Z","title":"ALMANACS: A Simulatability Benchmark for Language Model Explainability","summary":" How do we measure the efficacy of language model explainability methods?\nWhile many explainability methods have been developed, they are typically\nevaluated on bespoke tasks, preventing an apples-to-apples comparison. To help\nfill this gap, we present ALMANACS, a language model explainability benchmark.\nALMANACS scores explainability methods on simulatability, i.e., how well the\nexplanations improve behavior prediction on new inputs. The ALMANACS scenarios\nspan twelve safety-relevant topics such as ethical reasoning and advanced AI\nbehaviors; they have idiosyncratic premises to invoke model-specific behavior;\nand they have a train-test distributional shift to encourage faithful\nexplanations. By using another language model to predict behavior based on the\nexplanations, ALMANACS is a fully automated benchmark. We use ALMANACS to\nevaluate counterfactuals, rationalizations, attention, and Integrated Gradients\nexplanations. Our results are sobering: when averaged across all topics, no\nexplanation method outperforms the explanation-free control. We conclude that\ndespite modest successes in prior work, developing an explanation method that\naids simulatability in ALMANACS remains an open challenge.\n","authors":["Edmund Mills","Shiye Su","Stuart Russell","Scott Emmons"],"pdf_url":"https://arxiv.org/pdf/2312.12747v1.pdf","comment":"Code is available at\n https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS"},{"id":"http://arxiv.org/abs/2312.12746v1","updated":"2023-12-20T03:40:45Z","published":"2023-12-20T03:40:45Z","title":"ChatFDA: Medical Records Risk Assessment","summary":" In healthcare, the emphasis on patient safety and the minimization of medical\nerrors cannot be overstated. Despite concerted efforts, many healthcare\nsystems, especially in low-resource regions, still grapple with preventing\nthese errors effectively. This study explores a pioneering application aimed at\naddressing this challenge by assisting caregivers in gauging potential risks\nderived from medical notes. The application leverages data from openFDA,\ndelivering real-time, actionable insights regarding prescriptions. Preliminary\nanalyses conducted on the MIMIC-III \\cite{mimic} dataset affirm a proof of\nconcept highlighting a reduction in medical errors and an amplification in\npatient safety. This tool holds promise for drastically enhancing healthcare\noutcomes in settings with limited resources. To bolster reproducibility and\nfoster further research, the codebase underpinning our methodology is\naccessible on\nhttps://github.com/autonlab/2023.hackAuton/tree/main/prescription_checker. This\nis a submission for the 30th HackAuton CMU.\n","authors":["M Tran","C Sun"],"pdf_url":"https://arxiv.org/pdf/2312.12746v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12430v2","updated":"2023-12-20T03:33:54Z","published":"2023-12-19T18:56:52Z","title":"Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP","summary":" We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel\ntitle reranking technique to achieve efficient title reranking 20x-40x faster\nthan vanilla passage reranker. However, one of the challenges with the training\nof Efficient Title Reranker is the instability. Analyzing the issue, we found\nsome very difficult ground truths might act as noisy labels causing accuracy to\ndrop as well as some extreme values in model probability output causing nan. To\naddress these issues, we introduce the Sigmoid Trick, a novel technique that\nreduces the gradient update of both cases resulting in better retrieval\nefficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we\nachieved four state-of-the-art positions on the kilt knowledge benchmark.\n","authors":["Ziyi Chen","Heyi Tao","Daqian Zuo","Jize Jiang","Jun Yang","Yuxiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12430v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.07207v3","updated":"2023-12-20T03:22:02Z","published":"2022-06-14T23:24:15Z","title":"Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across\n Modalities","summary":" Events describe happenings in our world that are of importance. Naturally,\nunderstanding events mentioned in multimedia content and how they are related\nforms an important way of comprehending our world. Existing literature can\ninfer if events across textual and visual (video) domains are identical (via\ngrounding) and thus, on the same semantic level. However, grounding fails to\ncapture the intricate cross-event relations that exist due to the same events\nbeing referred to on many semantic levels. For example, in Figure 1, the\nabstract event of \"war\" manifests at a lower semantic level through subevents\n\"tanks firing\" (in video) and airplane \"shot\" (in text), leading to a\nhierarchical, multimodal relationship between the events.\n In this paper, we propose the task of extracting event hierarchies from\nmultimodal (video and text) data to capture how the same event manifests itself\nin different modalities at different semantic levels. This reveals the\nstructure of events and is critical to understanding them. To support research\non this task, we introduce the Multimodal Hierarchical Events (MultiHiEve)\ndataset. Unlike prior video-language datasets, MultiHiEve is composed of news\nvideo-article pairs, which makes it rich in event hierarchies. We densely\nannotate a part of the dataset to construct the test benchmark. We show the\nlimitations of state-of-the-art unimodal and multimodal baselines on this task.\nFurther, we address these limitations via a new weakly supervised model,\nleveraging only unannotated video-article pairs from MultiHiEve. We perform a\nthorough evaluation of our proposed method which demonstrates improved\nperformance on this task and highlight opportunities for future research.\n","authors":["Hammad A. Ayyubi","Christopher Thomas","Lovish Chum","Rahul Lokesh","Long Chen","Yulei Niu","Xudong Lin","Xuande Feng","Jaywon Koo","Sounak Ray","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2206.07207v3.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12740v1","updated":"2023-12-20T03:21:48Z","published":"2023-12-20T03:21:48Z","title":"Fine-tuning Large Language Models for Adaptive Machine Translation","summary":" This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose\nlarge language model (LLM), for adaptive machine translation (MT). The\nfine-tuning process involves utilising a combination of zero-shot and one-shot\ntranslation prompts within the medical domain. The primary objective is to\nenhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt\ntranslations to the required domain at inference time. The results,\nparticularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned\nmodel, demonstrating quality improvements in both zero-shot and one-shot\ntranslation scenarios, surpassing Mistral 7B's baseline performance. Notably,\nthe fine-tuned Mistral outperforms ChatGPT \"gpt-3.5-turbo\" in zero-shot\ntranslation while achieving comparable one-shot translation quality. Moreover,\nthe zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B's\nperformance, and its one-shot translation quality surpasses that of NLLB 3.3B.\nThese findings emphasise the significance of fine-tuning efficient LLMs like\nMistral 7B to yield high-quality zero-shot translations comparable to\ntask-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved\nin one-shot translation are comparable to those of commercial LLMs such as\nChatGPT. Our experiments demonstrate that, with a relatively small dataset of\n20,000 segments that incorporate a mix of zero-shot and one-shot prompts,\nfine-tuning significantly enhances Mistral's in-context learning ability,\nespecially for real-time adaptive MT.\n","authors":["Yasmin Moslem","Rejwanul Haque","Andy Way"],"pdf_url":"https://arxiv.org/pdf/2312.12740v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12736v1","updated":"2023-12-20T03:18:50Z","published":"2023-12-20T03:18:50Z","title":"Learning and Forgetting Unsafe Examples in Large Language Models","summary":" As the number of large language models (LLMs) released to the public grows,\nthere is a pressing need to understand the safety implications associated with\nthese models learning from third-party custom finetuning data. We explore the\nbehavior of LLMs finetuned on noisy custom data containing unsafe content,\nrepresented by datasets that contain biases, toxicity, and harmfulness, finding\nthat while aligned LLMs can readily learn this unsafe content, they also tend\nto forget it more significantly than other examples when subsequently finetuned\non safer content. Drawing inspiration from the discrepancies in forgetting, we\nintroduce the \"ForgetFilter\" algorithm, which filters unsafe data based on how\nstrong the model's forgetting signal is for that data. We demonstrate that the\nForgetFilter algorithm ensures safety in customized finetuning without\ncompromising downstream task performance, unlike sequential safety finetuning.\nForgetFilter outperforms alternative strategies like replay and moral\nself-correction in curbing LLMs' ability to assimilate unsafe content during\ncustom finetuning, e.g. 75% lower than not applying any safety measures and 62%\nlower than using self-correction in toxicity score.\n","authors":["Jiachen Zhao","Zhun Deng","David Madras","James Zou","Mengye Ren"],"pdf_url":"https://arxiv.org/pdf/2312.12736v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08742v4","updated":"2023-12-20T03:16:09Z","published":"2023-08-17T02:33:43Z","title":"PMET: Precise Model Editing in a Transformer","summary":" Model editing techniques modify a minor proportion of knowledge in Large\nLanguage Models (LLMs) at a relatively low cost, which have demonstrated\nnotable success. Existing methods assume Transformer Layer (TL) hidden states\nare values of key-value memories of the Feed-Forward Network (FFN). They\nusually optimize the TL hidden states to memorize target knowledge and use it\nto update the weights of the FFN in LLMs. However, the information flow of TL\nhidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN,\nand residual connections. Existing methods neglect the fact that the TL hidden\nstates contains information not specifically required for FFN. Consequently,\nthe performance of model editing decreases. To achieve more precise model\nediting, we analyze hidden states of MHSA and FFN, finding that MHSA encodes\ncertain general knowledge extraction patterns. This implies that MHSA weights\ndo not require updating when new knowledge is introduced. Based on above\nfindings, we introduce PMET, which simultaneously optimizes Transformer\nComponent (TC, namely MHSA and FFN) hidden states, while only using the\noptimized TC hidden states of FFN to precisely update FFN weights. Our\nexperiments demonstrate that PMET exhibits state-of-the-art performance on both\nthe COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the\neffectiveness of our enhancements, further reinforcing the finding that the\nMHSA encodes certain general knowledge extraction patterns and indicating its\nstorage of a small amount of factual knowledge. Our code is available at\nhttps://github.com/xpq-tech/PMET.\n","authors":["Xiaopeng Li","Shasha Li","Shezheng Song","Jing Yang","Jun Ma","Jie Yu"],"pdf_url":"https://arxiv.org/pdf/2308.08742v4.pdf","comment":"Accepted in AAAI24"},{"id":"http://arxiv.org/abs/2312.11681v2","updated":"2023-12-20T03:01:36Z","published":"2023-12-18T20:01:58Z","title":"Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows","summary":" LLM chains enable complex tasks by decomposing work into a sequence of\nsub-tasks. Crowdsourcing workflows similarly decompose complex tasks into\nsmaller tasks for human crowdworkers. Chains address LLM errors analogously to\nthe way crowdsourcing workflows address human error. To characterize\nopportunities for LLM chaining, we survey 107 papers across the crowdsourcing\nand chaining literature to construct a design space for chain development. The\ndesign space connects an LLM designer's objectives to strategies they can use\nto achieve those objectives, and tactics to implement each strategy. To explore\nhow techniques from crowdsourcing may apply to chaining, we adapt crowdsourcing\nworkflows to implement LLM chains across three case studies: creating a\ntaxonomy, shortening text, and writing a short story. From the design space and\nour case studies, we identify which techniques transfer from crowdsourcing to\nLLM chaining and raise implications for future research and development.\n","authors":["Madeleine Grunde-McLaughlin","Michelle S. Lam","Ranjay Krishna","Daniel S. Weld","Jeffrey Heer"],"pdf_url":"https://arxiv.org/pdf/2312.11681v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.03898v3","updated":"2023-12-20T02:43:39Z","published":"2023-04-08T03:24:05Z","title":"The Short Text Matching Model Enhanced with Knowledge via Contrastive\n Learning","summary":" In recent years, short Text Matching tasks have been widely applied in the\nfields ofadvertising search and recommendation. The difficulty lies in the lack\nof semantic information and word ambiguity caused by the short length of the\ntext. Previous works have introduced complement sentences or knowledge bases to\nprovide additional feature information. However, these methods have not fully\ninteracted between the original sentence and the complement sentence, and have\nnot considered the noise issue that may arise from the introduction of external\nknowledge bases. Therefore, this paper proposes a short Text Matching model\nthat combines contrastive learning and external knowledge. The model uses a\ngenerative model to generate corresponding complement sentences and uses the\ncontrastive learning method to guide the model to obtain more semantically\nmeaningful encoding of the original sentence. In addition, to avoid noise, we\nuse keywords as the main semantics of the original sentence to retrieve\ncorresponding knowledge words in the knowledge base, and construct a knowledge\ngraph. The graph encoding model is used to integrate the knowledge base\ninformation into the model. Our designed model achieves state-of-the-art\nperformance on two publicly available Chinese Text Matching datasets,\ndemonstrating the effectiveness of our model.\n","authors":["Ruiqiang Liu","Qiqiang Zhong","Mengmeng Cui","Hanjie Mai","Qiang Zhang","Shaohua Xu","Xiangzheng Liu","Yanlong Du"],"pdf_url":"https://arxiv.org/pdf/2304.03898v3.pdf","comment":"11 pages,2 figures"},{"id":"http://arxiv.org/abs/2312.12716v1","updated":"2023-12-20T02:22:49Z","published":"2023-12-20T02:22:49Z","title":"BloomVQA: Assessing Hierarchical Multi-modal Comprehension","summary":" We propose a novel VQA dataset, based on picture stories designed for\neducating young children, that aims to facilitate comprehensive evaluation and\ncharacterization of vision-language models on comprehension tasks. Unlike\ncurrent VQA datasets that often focus on fact-based memorization and simple\nreasoning tasks without principled scientific grounding, we collect data\ncontaining tasks reflecting different levels of comprehension and underlying\ncognitive processes, as laid out in Bloom's Taxonomy, a classic framework\nwidely adopted in education research. The proposed BloomVQA dataset can be\nmapped to a hierarchical graph-based representation of visual stories, enabling\nautomatic data augmentation and novel measures characterizing model consistency\nacross the underlying taxonomy. We demonstrate graded evaluation and\nreliability analysis based on our proposed consistency metrics on\nstate-of-the-art vision-language models. Our results suggest that, while\ncurrent models achieve the most gain on low-level comprehension tasks, they\ngenerally fall short on high-level tasks requiring more advanced comprehension\nand cognitive skills, as 38.0% drop in VQA accuracy is observed comparing\nlowest and highest level tasks. Furthermore, current models show consistency\npatterns misaligned with human comprehension in various scenarios, suggesting\nemergent structures of model behaviors.\n","authors":["Yunye Gong","Robik Shrestha","Jared Claypoole","Michael Cogswell","Arijit Ray","Christopher Kanan","Ajay Divakaran"],"pdf_url":"https://arxiv.org/pdf/2312.12716v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12713v1","updated":"2023-12-20T02:19:54Z","published":"2023-12-20T02:19:54Z","title":"Response Enhanced Semi-Supervised Dialogue Query Generation","summary":" Leveraging vast and continually updated knowledge from the Internet has been\nconsidered an important ability for a dialogue system. Therefore, the dialogue\nquery generation task is proposed for generating search queries from dialogue\nhistories, which will be submitted to a search engine for retrieving relevant\nwebsites on the Internet. In this regard, previous efforts were devoted to\ncollecting conversations with annotated queries and training a query producer\n(QP) via standard supervised learning. However, these studies still face the\nchallenges of data scarcity and domain adaptation. To address these issues, in\nthis paper, we propose a semi-supervised learning framework -- SemiDQG, to\nimprove model performance with unlabeled conversations. Based on the\nobservation that the search query is typically related to the topic of dialogue\nresponse, we train a response-augmented query producer (RA) to provide rich and\neffective training signals for QP. We first apply a similarity-based query\nselection strategy to select high-quality RA-generated pseudo queries, which\nare used to construct pseudo instances for training QP and RA. Then, we adopt\nthe REINFORCE algorithm to further enhance QP, with RA-provided rewards as\nfine-grained training signals. Experimental results and in-depth analysis of\nthree benchmarks show the effectiveness of our framework in cross-domain and\nlow-resource scenarios. Particularly, SemiDQG significantly surpasses ChatGPT\nand competitive baselines. Our code is available at\n\\url{https://github.com/DeepLearnXMU/SemiDQG}.\n","authors":["Jianheng Huang","Ante Wang","Linfeng Gao","Linfeng Song","Jinsong Su"],"pdf_url":"https://arxiv.org/pdf/2312.12713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03560v2","updated":"2023-12-20T01:14:42Z","published":"2023-10-05T14:18:40Z","title":"Redefining Digital Health Interfaces with Large Language Models","summary":" Digital health tools have the potential to significantly improve the delivery\nof healthcare services. However, their adoption remains comparatively limited\ndue, in part, to challenges surrounding usability and trust. Recently, Large\nLanguage Models (LLMs) have emerged as general-purpose models with the ability\nto process complex information and produce human-quality text, presenting a\nwealth of potential applications in healthcare. Directly applying LLMs in\nclinical settings is not straightforward, with LLMs susceptible to providing\ninconsistent or nonsensical answers. We describe how LLM-based systems can\nutilize external tools to provide a novel interface between clinicians and\ndigital technologies. This enhances the utility and practical impact of digital\nhealthcare tools and AI models while addressing current issues with using LLM\nin clinical settings such as hallucinations. We illustrate LLM-based interfaces\nwith examples from cardiovascular disease and diabetes risk prediction,\nhighlighting the benefit compared to traditional interfaces for digital tools.\n","authors":["Fergus Imrie","Paulius Rauba","Mihaela van der Schaar"],"pdf_url":"https://arxiv.org/pdf/2310.03560v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12683v1","updated":"2023-12-20T00:49:52Z","published":"2023-12-20T00:49:52Z","title":"Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is\n Needed?","summary":" The vast majority of today's large language models are English-centric,\nhaving been pretrained predominantly on English text. Yet, in order to meet\nuser expectations, models need to be able to respond appropriately in multiple\nlanguages once deployed in downstream applications. Given limited exposure to\nother languages during pretraining, cross-lingual transfer is important for\nachieving decent performance in non-English settings. In this work, we\ninvestigate just how much multilinguality is required during finetuning to\nelicit strong cross-lingual generalisation across a range of tasks and target\nlanguages. We find that, compared to English-only finetuning, multilingual\ninstruction tuning with as few as three languages significantly improves a\nmodel's cross-lingual transfer abilities on generative tasks that assume\ninput/output language agreement, while being of less importance for highly\nstructured tasks. Our code and data is available at\nhttps://github.com/ZurichNLP/multilingual-instruction-tuning.\n","authors":["Tannon Kew","Florian Schottmann","Rico Sennrich"],"pdf_url":"https://arxiv.org/pdf/2312.12683v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12682v1","updated":"2023-12-20T00:48:13Z","published":"2023-12-20T00:48:13Z","title":"Mini-GPTs: Efficient Large Language Models through Contextual Pruning","summary":" In AI research, the optimization of Large Language Models (LLMs) remains a\nsignificant challenge, crucial for advancing the field's practical applications\nand sustainability. Building upon the foundational work of Professor Song Han's\nlab at MIT, this paper introduces a novel approach in developing Mini-GPTs via\ncontextual pruning. Our methodology strategically prunes the computational\narchitecture of traditional LLMs, like Phi-1.5, focusing on retaining core\nfunctionalities while drastically reducing model sizes. We employ the technique\nacross diverse and complex datasets, including US law, Medical Q&A, Skyrim\ndialogue, English-Taiwanese translation, and Economics articles. The results\nunderscore the efficiency and effectiveness of contextual pruning, not merely\nas a theoretical concept but as a practical tool in developing domain-specific,\nresource-efficient LLMs. Contextual pruning is a promising method for building\ndomain-specific LLMs, and this research is a building block towards future\ndevelopment with more hardware compute, refined fine-tuning, and quantization.\n","authors":["Tim Valicenti","Justice Vidal","Ritik Patnaik"],"pdf_url":"https://arxiv.org/pdf/2312.12682v1.pdf","comment":"7 pages, 4 figures, Neurips 2023 styling"},{"id":"http://arxiv.org/abs/2312.12681v1","updated":"2023-12-20T00:45:27Z","published":"2023-12-20T00:45:27Z","title":"Imitation of Life: A Search Engine for Biologically Inspired Design","summary":" Biologically Inspired Design (BID), or Biomimicry, is a problem-solving\nmethodology that applies analogies from nature to solve engineering challenges.\nFor example, Speedo engineers designed swimsuits based on shark skin. Finding\nrelevant biological solutions for real-world problems poses significant\nchallenges, both due to the limited biological knowledge engineers and\ndesigners typically possess and to the limited BID resources. Existing BID\ndatasets are hand-curated and small, and scaling them up requires costly human\nannotations.\n In this paper, we introduce BARcode (Biological Analogy Retriever), a search\nengine for automatically mining bio-inspirations from the web at scale. Using\nadvances in natural language understanding and data programming, BARcode\nidentifies potential inspirations for engineering challenges. Our experiments\ndemonstrate that BARcode can retrieve inspirations that are valuable to\nengineers and designers tackling real-world problems, as well as recover famous\nhistorical BID examples. We release data and code; we view BARcode as a step\ntowards addressing the challenges that have historically hindered the practical\napplication of BID to engineering innovation.\n","authors":["Hen Emuna","Nadav Borenstein","Xin Qian","Hyeonsu Kang","Joel Chan","Aniket Kittur","Dafna Shahaf"],"pdf_url":"https://arxiv.org/pdf/2312.12681v1.pdf","comment":"To be published in the AAAI 2024 Proceedings Main Track"},{"id":"http://arxiv.org/abs/2311.18260v3","updated":"2023-12-20T23:08:32Z","published":"2023-11-30T05:38:34Z","title":"Consensus, dissensus and synergy between clinicians and specialist\n foundation models in radiology report generation","summary":" Radiology reports are an instrumental part of modern medicine, informing key\nclinical decisions such as diagnosis and treatment. The worldwide shortage of\nradiologists, however, restricts access to expert care and imposes heavy\nworkloads, contributing to avoidable errors and delays in report delivery.\nWhile recent progress in automated report generation with vision-language\nmodels offer clear potential in ameliorating the situation, the path to\nreal-world adoption has been stymied by the challenge of evaluating the\nclinical quality of AI-generated reports. In this study, we build a\nstate-of-the-art report generation system for chest radiographs,\n$\\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation\nmodel on radiology data. To evaluate the quality of the AI-generated reports, a\ngroup of 16 certified radiologists provide detailed evaluations of AI-generated\nand human written reports for chest X-rays from an intensive care setting in\nthe United States and an inpatient setting in India. At least one radiologist\n(out of two per case) preferred the AI report to the ground truth report in\nover 60$\\%$ of cases for both datasets. Amongst the subset of AI-generated\nreports that contain errors, the most frequently cited reasons were related to\nthe location and finding, whereas for human written reports, most mistakes were\nrelated to severity and finding. This disparity suggested potential\ncomplementarity between our AI system and human experts, prompting us to\ndevelop an assistive scenario in which Flamingo-CXR generates a first-draft\nreport, which is subsequently revised by a clinician. This is the first\ndemonstration of clinician-AI collaboration for report writing, and the\nresultant reports are assessed to be equivalent or preferred by at least one\nradiologist to reports written by experts alone in 80$\\%$ of in-patient cases\nand 60$\\%$ of intensive care cases.\n","authors":["Ryutaro Tanno","David G. T. Barrett","Andrew Sellergren","Sumedh Ghaisas","Sumanth Dathathri","Abigail See","Johannes Welbl","Karan Singhal","Shekoofeh Azizi","Tao Tu","Mike Schaekermann","Rhys May","Roy Lee","SiWai Man","Zahra Ahmed","Sara Mahdavi","Yossi Matias","Joelle Barral","Ali Eslami","Danielle Belgrave","Vivek Natarajan","Shravya Shetty","Pushmeet Kohli","Po-Sen Huang","Alan Karthikesalingam","Ira Ktena"],"pdf_url":"https://arxiv.org/pdf/2311.18260v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11326v3","updated":"2023-12-20T22:40:49Z","published":"2022-09-22T21:40:51Z","title":"Towards Faithful Model Explanation in NLP: A Survey","summary":" End-to-end neural Natural Language Processing (NLP) models are notoriously\ndifficult to understand. This has given rise to numerous efforts towards model\nexplainability in recent years. One desideratum of model explanation is\nfaithfulness, i.e. an explanation should accurately represent the reasoning\nprocess behind the model's prediction. In this survey, we review over 110 model\nexplanation methods in NLP through the lens of faithfulness. We first discuss\nthe definition and evaluation of faithfulness, as well as its significance for\nexplainability. We then introduce recent advances in faithful explanation,\ngrouping existing approaches into five categories: similarity-based methods,\nanalysis of model-internal structures, backpropagation-based methods,\ncounterfactual intervention, and self-explanatory models. For each category, we\nsynthesize its representative studies, strengths, and weaknesses. Finally, we\nsummarize their common virtues and remaining challenges, and reflect on future\nwork directions towards faithful explainability in NLP.\n","authors":["Qing Lyu","Marianna Apidianaki","Chris Callison-Burch"],"pdf_url":"https://arxiv.org/pdf/2209.11326v3.pdf","comment":"Revision round #2 for the Computational Linguistics journal"},{"id":"http://arxiv.org/abs/2312.05964v2","updated":"2023-12-20T22:10:27Z","published":"2023-12-10T18:43:37Z","title":"ConSequence: Synthesizing Logically Constrained Sequences for Electronic\n Health Record Generation","summary":" Generative models can produce synthetic patient records for analytical tasks\nwhen real data is unavailable or limited. However, current methods struggle\nwith adhering to domain-specific knowledge and removing invalid data. We\npresent ConSequence, an effective approach to integrating domain knowledge into\nsequential generative neural network outputs. Our rule-based formulation\nincludes temporal aggregation and antecedent evaluation modules, ensured by an\nefficient matrix multiplication formulation, to satisfy hard and soft logical\nconstraints across time steps. Existing constraint methods often fail to\nguarantee constraint satisfaction, lack the ability to handle temporal\nconstraints, and hinder the learning and computational efficiency of the model.\nIn contrast, our approach efficiently handles all types of constraints with\nguaranteed logical coherence. We demonstrate ConSequence's effectiveness in\ngenerating electronic health records, outperforming competitors in achieving\ncomplete temporal and spatial constraint satisfaction without compromising\nruntime performance or generative quality. Specifically, ConSequence\nsuccessfully prevents all rule violations while improving the model quality in\nreducing its test perplexity by 5% and incurring less than a 13% slowdown in\ngeneration speed compared to an unconstrained model.\n","authors":["Brandon Theodorou","Shrusti Jain","Cao Xiao","Jimeng Sun"],"pdf_url":"https://arxiv.org/pdf/2312.05964v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13437v1","updated":"2023-12-20T21:28:35Z","published":"2023-12-20T21:28:35Z","title":"A General Model for Aggregating Annotations Across Simple, Complex, and\n Multi-Object Annotation Tasks","summary":" Human annotations are vital to supervised learning, yet annotators often\ndisagree on the correct label, especially as annotation tasks increase in\ncomplexity. A strategy to improve label quality is to ask multiple annotators\nto label the same item and aggregate their labels. Many aggregation models have\nbeen proposed for categorical or numerical annotation tasks, but far less work\nhas considered more complex annotation tasks involving open-ended,\nmultivariate, or structured responses. While a variety of bespoke models have\nbeen proposed for specific tasks, our work is the first to introduce\naggregation methods that generalize across many diverse complex tasks,\nincluding sequence labeling, translation, syntactic parsing, ranking, bounding\nboxes, and keypoints. This generality is achieved by devising a task-agnostic\nmethod to model distances between labels rather than the labels themselves.\n This article extends our prior work with investigation of three new research\nquestions. First, how do complex annotation properties impact aggregation\naccuracy? Second, how should a task owner navigate the many modeling choices to\nmaximize aggregation accuracy? Finally, what diagnoses can verify that\naggregation models are specified correctly for the given data? To understand\nhow various factors impact accuracy and to inform model selection, we conduct\nsimulation studies and experiments on real, complex datasets. Regarding\ntesting, we introduce unit tests for aggregation models and present a suite of\nsuch tests to ensure that a given model is not mis-specified and exhibits\nexpected behavior.\n Beyond investigating these research questions above, we discuss the\nfoundational concept of annotation complexity, present a new aggregation model\nas a bridge between traditional models and our own, and contribute a new\nsemi-supervised learning method for complex label aggregation that outperforms\nprior work.\n","authors":["Alexander Braylan","Madalyn Marabella","Omar Alonso","Matthew Lease"],"pdf_url":"https://arxiv.org/pdf/2312.13437v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13423v1","updated":"2023-12-20T21:02:09Z","published":"2023-12-20T21:02:09Z","title":"VADIS -- a VAriable Detection, Interlinking and Summarization system","summary":" The VADIS system addresses the demand of providing enhanced information\naccess in the domain of the social sciences. This is achieved by allowing users\nto search and use survey variables in context of their underlying research data\nand scholarly publications which have been interlinked with each other.\n","authors":["Yavuz Selim Kartal","Muhammad Ahsan Shahid","Sotaro Takeshita","Tornike Tsereteli","Andrea Zielinski","Benjamin Zapilko","Philipp Mayr"],"pdf_url":"https://arxiv.org/pdf/2312.13423v1.pdf","comment":"It is 4 pages and 2 figures. This paper has recently been accepted by\n ECIR 2024 Demo Track and this version is the camera-ready version of the\n paper"},{"id":"http://arxiv.org/abs/2303.10512v2","updated":"2023-12-20T20:56:14Z","published":"2023-03-18T22:36:25Z","title":"AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning","summary":" Fine-tuning large pre-trained language models on downstream tasks has become\nan important paradigm in NLP. However, common practice fine-tunes all of the\nparameters in a pre-trained model, which becomes prohibitive when a large\nnumber of downstream tasks are present. Therefore, many fine-tuning methods are\nproposed to learn incremental updates of pre-trained weights in a parameter\nefficient way, e.g., low-rank increments. These methods often evenly distribute\nthe budget of incremental updates across all pre-trained weight matrices, and\noverlook the varying importance of different weight parameters. As a\nconsequence, the fine-tuning performance is suboptimal. To bridge this gap, we\npropose AdaLoRA, which adaptively allocates the parameter budget among weight\nmatrices according to their importance score. In particular, AdaLoRA\nparameterizes the incremental updates in the form of singular value\ndecomposition. Such a novel approach allows us to effectively prune the\nsingular values of unimportant updates, which is essentially to reduce their\nparameter budget but circumvent intensive exact SVD computations. We conduct\nextensive experiments with several pre-trained models on natural language\nprocessing, question answering, and natural language generation to validate the\neffectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable\nimprovement over baselines, especially in the low budget settings. Our code is\npublicly available at https://github.com/QingruZhang/AdaLoRA .\n","authors":["Qingru Zhang","Minshuo Chen","Alexander Bukharin","Nikos Karampatziakis","Pengcheng He","Yu Cheng","Weizhu Chen","Tuo Zhao"],"pdf_url":"https://arxiv.org/pdf/2303.10512v2.pdf","comment":"The 11th International Conference on Learning Representations (ICLR\n 2023)"},{"id":"http://arxiv.org/abs/2307.15043v2","updated":"2023-12-20T20:48:57Z","published":"2023-07-27T17:49:12Z","title":"Universal and Transferable Adversarial Attacks on Aligned Language\n Models","summary":" Because \"out-of-the-box\" large language models are capable of generating a\ngreat deal of objectionable content, recent work has focused on aligning these\nmodels in an attempt to prevent undesirable generation. While there has been\nsome success at circumventing these measures -- so-called \"jailbreaks\" against\nLLMs -- these attacks have required significant human ingenuity and are brittle\nin practice. In this paper, we propose a simple and effective attack method\nthat causes aligned language models to generate objectionable behaviors.\nSpecifically, our approach finds a suffix that, when attached to a wide range\nof queries for an LLM to produce objectionable content, aims to maximize the\nprobability that the model produces an affirmative response (rather than\nrefusing to answer). However, instead of relying on manual engineering, our\napproach automatically produces these adversarial suffixes by a combination of\ngreedy and gradient-based search techniques, and also improves over past\nautomatic prompt generation methods.\n Surprisingly, we find that the adversarial prompts generated by our approach\nare quite transferable, including to black-box, publicly released LLMs.\nSpecifically, we train an adversarial attack suffix on multiple prompts (i.e.,\nqueries asking for many different types of objectionable content), as well as\nmultiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting\nattack suffix is able to induce objectionable content in the public interfaces\nto ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,\nPythia, Falcon, and others. In total, this work significantly advances the\nstate-of-the-art in adversarial attacks against aligned language models,\nraising important questions about how such systems can be prevented from\nproducing objectionable information. Code is available at\ngithub.com/llm-attacks/llm-attacks.\n","authors":["Andy Zou","Zifan Wang","Nicholas Carlini","Milad Nasr","J. Zico Kolter","Matt Fredrikson"],"pdf_url":"https://arxiv.org/pdf/2307.15043v2.pdf","comment":"Website: http://llm-attacks.org/"},{"id":"http://arxiv.org/abs/2312.13401v1","updated":"2023-12-20T20:04:45Z","published":"2023-12-20T20:04:45Z","title":"Time is Encoded in the Weights of Finetuned Language Models","summary":" We present time vectors, a simple tool to customize language models to new\ntime periods. Time vectors are created by finetuning a language model on data\nfrom a single time (e.g., a year or month), and then subtracting the weights of\nthe original pretrained model. This vector specifies a direction in weight\nspace that, as our experiments show, improves performance on text from that\ntime period. Time vectors specialized to adjacent time periods appear to be\npositioned closer together in a manifold. Using this structure, we interpolate\nbetween time vectors to induce new models that perform better on intervening\nand future time periods, without any additional training. We demonstrate the\nconsistency of our findings across different tasks, domains, model sizes, and\ntime scales. Our results suggest that time is encoded in the weight space of\nfinetuned models.\n","authors":["Kai Nylund","Suchin Gururangan","Noah A. Smith"],"pdf_url":"https://arxiv.org/pdf/2312.13401v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13382v1","updated":"2023-12-20T19:13:26Z","published":"2023-12-20T19:13:26Z","title":"DSPy Assertions: Computational Constraints for Self-Refining Language\n Model Pipelines","summary":" Chaining language model (LM) calls as composable modules is fueling a new\npowerful way of programming. However, ensuring that LMs adhere to important\nconstraints remains a key challenge, one often addressed with heuristic \"prompt\nengineering\". We introduce LM Assertions, a new programming construct for\nexpressing computational constraints that LMs should satisfy. We integrate our\nconstructs into the recent DSPy programming model for LMs, and present new\nstrategies that allow DSPy to compile programs with arbitrary LM Assertions\ninto systems that are more reliable and more accurate. In DSPy, LM Assertions\ncan be integrated at compile time, via automatic prompt optimization, and/or at\ninference time, via automatic selfrefinement and backtracking. We report on two\nearly case studies for complex question answering (QA), in which the LM program\nmust iteratively retrieve information in multiple hops and synthesize a\nlong-form answer with citations. We find that LM Assertions improve not only\ncompliance with imposed rules and guidelines but also enhance downstream task\nperformance, delivering intrinsic and extrinsic gains up to 35.7% and 13.3%,\nrespectively. Our reference implementation of LM Assertions is integrated into\nDSPy at https://github.com/stanfordnlp/dspy\n","authors":["Arnav Singhvi","Manish Shetty","Shangyin Tan","Christopher Potts","Koushik Sen","Matei Zaharia","Omar Khattab"],"pdf_url":"https://arxiv.org/pdf/2312.13382v1.pdf","comment":"Arnav*, Manish*, Shangyin* contributed equally to this work"},{"id":"http://arxiv.org/abs/2312.14187v1","updated":"2023-12-20T09:02:29Z","published":"2023-12-20T09:02:29Z","title":"WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with\n Refined Data Generation","summary":" Recent work demonstrates that, after being fine-tuned on a high-quality\ninstruction dataset, the resulting model can obtain impressive capabilities to\naddress a wide range of tasks. However, existing methods for instruction data\ngeneration often produce duplicate data and are not controllable enough on data\nquality. In this paper, we extend the generalization of instruction tuning by\nclassifying the instruction data to 4 code-related tasks and propose a\nLLM-based Generator-Discriminator data process framework to generate diverse,\nhigh-quality instruction data from open source code. Hence, we introduce\nCodeOcean, a dataset comprising 20,000 instruction instances across 4 universal\ncode-related tasks,which is aimed at augmenting the effectiveness of\ninstruction tuning and improving the generalization ability of fine-tuned\nmodel. Subsequently, we present WaveCoder, a fine-tuned Code LLM with\nWidespread And Versatile Enhanced instruction tuning. This model is\nspecifically designed for enhancing instruction tuning of Code Language Models\n(LLMs). Our experiments demonstrate that Wavecoder models outperform other\nopen-source models in terms of generalization ability across different\ncode-related tasks at the same level of fine-tuning scale. Moreover, Wavecoder\nexhibits high efficiency in previous code generation tasks. This paper thus\noffers a significant contribution to the field of instruction data generation\nand fine-tuning models, providing new insights and tools for enhancing\nperformance in code-related tasks.\n","authors":["Zhaojian Yu","Xin Zhang","Ning Shang","Yangyu Huang","Can Xu","Yishujie Zhao","Wenxiang Hu","Qiufeng Yin"],"pdf_url":"https://arxiv.org/pdf/2312.14187v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2312.13286v1","updated":"2023-12-20T18:59:58Z","published":"2023-12-20T18:59:58Z","title":"Generative Multimodal Models are In-Context Learners","summary":" The human ability to easily solve multimodal tasks in context (i.e., with\nonly a few demonstrations or simple instructions), is what current multimodal\nsystems have largely struggled to imitate. In this work, we demonstrate that\nthe task-agnostic in-context learning capabilities of large multimodal models\ncan be significantly enhanced by effective scaling-up. We introduce Emu2, a\ngenerative multimodal model with 37 billion parameters, trained on large-scale\nmultimodal sequences with a unified autoregressive objective. Emu2 exhibits\nstrong multimodal in-context learning abilities, even emerging to solve tasks\nthat require on-the-fly reasoning, such as visual prompting and object-grounded\ngeneration. The model sets a new record on multiple multimodal understanding\ntasks in few-shot settings. When instruction-tuned to follow specific\ninstructions, Emu2 further achieves new state-of-the-art on challenging tasks\nsuch as question answering benchmarks for large multimodal models and\nopen-ended subject-driven generation. These achievements demonstrate that Emu2\ncan serve as a base model and general-purpose interface for a wide range of\nmultimodal tasks. Code and models are publicly available to facilitate future\nresearch.\n","authors":["Quan Sun","Yufeng Cui","Xiaosong Zhang","Fan Zhang","Qiying Yu","Zhengxiong Luo","Yueze Wang","Yongming Rao","Jingjing Liu","Tiejun Huang","Xinlong Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13286v1.pdf","comment":"Project page: https://baaivision.github.io/emu2"},{"id":"http://arxiv.org/abs/2312.13285v1","updated":"2023-12-20T18:59:42Z","published":"2023-12-20T18:59:42Z","title":"UniSDF: Unifying Neural Representations for High-Fidelity 3D\n Reconstruction of Complex Scenes with Reflections","summary":" Neural 3D scene representations have shown great potential for 3D\nreconstruction from 2D images. However, reconstructing real-world captures of\ncomplex scenes still remains a challenge. Existing generic 3D reconstruction\nmethods often struggle to represent fine geometric details and do not\nadequately model reflective surfaces of large-scale scenes. Techniques that\nexplicitly focus on reflective surfaces can model complex and detailed\nreflections by exploiting better reflection parameterizations. However, we\nobserve that these methods are often not robust in real unbounded scenarios\nwhere non-reflective as well as reflective components are present. In this\nwork, we propose UniSDF, a general purpose 3D reconstruction method that can\nreconstruct large complex scenes with reflections. We investigate both\nview-based as well as reflection-based color prediction parameterization\ntechniques and find that explicitly blending these representations in 3D space\nenables reconstruction of surfaces that are more geometrically accurate,\nespecially for reflective surfaces. We further combine this representation with\na multi-resolution grid backbone that is trained in a coarse-to-fine manner,\nenabling faster reconstructions than prior methods. Extensive experiments on\nobject-level datasets DTU, Shiny Blender as well as unbounded datasets Mip-NeRF\n360 and Ref-NeRF real demonstrate that our method is able to robustly\nreconstruct complex large-scale scenes with fine details and reflective\nsurfaces. Please see our project page at\nhttps://fangjinhuawang.github.io/UniSDF.\n","authors":["Fangjinhua Wang","Marie-Julie Rakotosaona","Michael Niemeyer","Richard Szeliski","Marc Pollefeys","Federico Tombari"],"pdf_url":"https://arxiv.org/pdf/2312.13285v1.pdf","comment":"Project page: https://fangjinhuawang.github.io/UniSDF"},{"id":"http://arxiv.org/abs/2312.12143v2","updated":"2023-12-20T18:58:17Z","published":"2023-12-19T13:23:49Z","title":"Integrating Human Vision Perception in Vision Transformers for\n Classifying Waste Items","summary":" In this paper, we propose an novel methodology aimed at simulating the\nlearning phenomenon of nystagmus through the application of differential\nblurring on datasets. Nystagmus is a biological phenomenon that influences\nhuman vision throughout life, notably by diminishing head shake from infancy to\nadulthood. Leveraging this concept, we address the issue of waste\nclassification, a pressing global concern. The proposed framework comprises two\nmodules, with the second module closely resembling the original Vision\nTransformer, a state-of-the-art model model in classification tasks. The\nprimary motivation behind our approach is to enhance the model's precision and\nadaptability, mirroring the real-world conditions that the human visual system\nundergoes. This novel methodology surpasses the standard Vision Transformer\nmodel in waste classification tasks, exhibiting an improvement with a margin of\n2%. This improvement underscores the potential of our methodology in improving\nmodel precision by drawing inspiration from human vision perception. Further\nresearch in the proposed methodology could yield greater performance results,\nand can be extrapolated to other global issues.\n","authors":["Akshat Kishore Shrivastava","Tapan Kumar Gandhi"],"pdf_url":"https://arxiv.org/pdf/2312.12143v2.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.13277v1","updated":"2023-12-20T18:56:45Z","published":"2023-12-20T18:56:45Z","title":"Deep Learning on 3D Neural Fields","summary":" In recent years, Neural Fields (NFs) have emerged as an effective tool for\nencoding diverse continuous signals such as images, videos, audio, and 3D\nshapes. When applied to 3D data, NFs offer a solution to the fragmentation and\nlimitations associated with prevalent discrete representations. However, given\nthat NFs are essentially neural networks, it remains unclear whether and how\nthey can be seamlessly integrated into deep learning pipelines for solving\ndownstream tasks. This paper addresses this research problem and introduces\nnf2vec, a framework capable of generating a compact latent representation for\nan input NF in a single inference pass. We demonstrate that nf2vec effectively\nembeds 3D objects represented by the input NFs and showcase how the resulting\nembeddings can be employed in deep learning pipelines to successfully address\nvarious tasks, all while processing exclusively NFs. We test this framework on\nseveral NFs used to represent 3D surfaces, such as unsigned/signed distance and\noccupancy fields. Moreover, we demonstrate the effectiveness of our approach\nwith more complex NFs that encompass both geometry and appearance of 3D objects\nsuch as neural radiance fields.\n","authors":["Pierluigi Zama Ramirez","Luca De Luigi","Daniele Sirocchi","Adriano Cardace","Riccardo Spezialetti","Francesco Ballerini","Samuele Salti","Luigi Di Stefano"],"pdf_url":"https://arxiv.org/pdf/2312.13277v1.pdf","comment":"Extended version of the paper \"Deep Learning on Implicit Neural\n Representations of Shapes\" that was presented at ICLR 2023. arXiv admin note:\n text overlap with arXiv:2302.05438"},{"id":"http://arxiv.org/abs/2312.08488v2","updated":"2023-12-20T18:53:23Z","published":"2023-12-13T20:08:26Z","title":"PnP for Two-Dimensional Pose Estimation","summary":" We propose a PnP algorithm for a camera constrained to two-dimensional\nmovement (applicable, for instance, to many wheeled robotics platforms).\nLeveraging this assumption allows performance improvements over 3D PnP\nalgorithms due to the reduction in search space dimensionality. It also reduces\nthe incidence of ambiguous pose estimates (as, in most cases, the spurious\nsolutions fall outside the plane of movement). Our algorithm finds an\napproximate solution using geometric criteria and refines its prediction\niteratively. We compare this algorithm to existing 3D PnP algorithms in terms\nof accuracy, performance, and robustness to noise.\n","authors":["Joshua Wang"],"pdf_url":"https://arxiv.org/pdf/2312.08488v2.pdf","comment":"4 pages, 3 figures. Improved testing figures from version 1"},{"id":"http://arxiv.org/abs/2305.15296v3","updated":"2023-12-20T18:52:00Z","published":"2023-05-24T16:22:18Z","title":"MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal\n Image Generation","summary":" The recent popularity of text-to-image diffusion models (DM) can largely be\nattributed to the intuitive interface they provide to users. The intended\ngeneration can be expressed in natural language, with the model producing\nfaithful interpretations of text prompts. However, expressing complex or\nnuanced ideas in text alone can be difficult. To ease image generation, we\npropose MultiFusion that allows one to express complex and nuanced concepts\nwith arbitrarily interleaved inputs of multiple modalities and languages.\nMutliFusion leverages pre-trained models and aligns them for integration into a\ncohesive system, thereby avoiding the need for extensive training from scratch.\nOur experimental results demonstrate the efficient transfer of capabilities\nfrom individual modules to the downstream model. Specifically, the fusion of\nall independent components allows the image generation module to utilize\nmultilingual, interleaved multimodal inputs despite being trained solely on\nmonomodal data in a single language.\n","authors":["Marco Bellagente","Manuel Brack","Hannah Teufel","Felix Friedrich","Björn Deiseroth","Constantin Eichenberg","Andrew Dai","Robert Baldock","Souradeep Nanda","Koen Oostermeijer","Andres Felipe Cruz-Salinas","Patrick Schramowski","Kristian Kersting","Samuel Weinbach"],"pdf_url":"https://arxiv.org/pdf/2305.15296v3.pdf","comment":"Proceedings of Advances in Neural Information Processing Systems:\n Annual Conference on Neural Information Processing Systems (NeurIPS)"},{"id":"http://arxiv.org/abs/2312.13271v1","updated":"2023-12-20T18:51:02Z","published":"2023-12-20T18:51:02Z","title":"Repaint123: Fast and High-quality One Image to 3D Generation with\n Progressive Controllable 2D Repainting","summary":" Recent one image to 3D generation methods commonly adopt Score Distillation\nSampling (SDS). Despite the impressive results, there are multiple deficiencies\nincluding multi-view inconsistency, over-saturated and over-smoothed textures,\nas well as the slow generation speed. To address these deficiencies, we present\nRepaint123 to alleviate multi-view bias as well as texture degradation and\nspeed up the generation process. The core idea is to combine the powerful image\ngeneration capability of the 2D diffusion model and the texture alignment\nability of the repainting strategy for generating high-quality multi-view\nimages with consistency. We further propose visibility-aware adaptive\nrepainting strength for overlap regions to enhance the generated image quality\nin the repainting process. The generated high-quality and multi-view consistent\nimages enable the use of simple Mean Square Error (MSE) loss for fast 3D\ncontent generation. We conduct extensive experiments and show that our method\nhas a superior ability to generate high-quality 3D content with multi-view\nconsistency and fine textures in 2 minutes from scratch. Code is at\nhttps://github.com/junwuzhang19/repaint123.\n","authors":["Junwu Zhang","Zhenyu Tang","Yatian Pang","Xinhua Cheng","Peng Jin","Yida Wei","Wangbo Yu","Munan Ning","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.13271v1.pdf","comment":"Code: https://github.com/junwuzhang19/repaint123"},{"id":"http://arxiv.org/abs/2312.13265v1","updated":"2023-12-20T18:43:20Z","published":"2023-12-20T18:43:20Z","title":"ClassLIE: Structure- and Illumination-Adaptive Classification for\n Low-Light Image Enhancement","summary":" Low-light images often suffer from limited visibility and multiple types of\ndegradation, rendering low-light image enhancement (LIE) a non-trivial task.\nSome endeavors have been recently made to enhance low-light images using\nconvolutional neural networks (CNNs). However, they have low efficiency in\nlearning the structural information and diverse illumination levels at the\nlocal regions of an image. Consequently, the enhanced results are affected by\nunexpected artifacts, such as unbalanced exposure, blur, and color bias. To\nthis end, this paper proposes a novel framework, called ClassLIE, that combines\nthe potential of CNNs and transformers. It classifies and adaptively learns the\nstructural and illumination information from the low-light images in a holistic\nand regional manner, thus showing better enhancement performance. Our framework\nfirst employs a structure and illumination classification (SIC) module to learn\nthe degradation information adaptively. In SIC, we decompose an input image\ninto an illumination map and a reflectance map. A class prediction block is\nthen designed to classify the degradation information by calculating the\nstructure similarity scores on the reflectance map and mean square error on the\nillumination map. As such, each input image can be divided into patches with\nthree enhancement difficulty levels. Then, a feature learning and fusion (FLF)\nmodule is proposed to adaptively learn the feature information with CNNs for\ndifferent enhancement difficulty levels while learning the long-range\ndependencies for the patches in a holistic manner. Experiments on five\nbenchmark datasets consistently show our ClassLIE achieves new state-of-the-art\nperformance, with 25.74 PSNR and 0.92 SSIM on the LOL dataset.\n","authors":["Zixiang Wei","Yiting Wang","Lichao Sun","Athanasios V. Vasilakos","Lin Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13253v1","updated":"2023-12-20T18:27:53Z","published":"2023-12-20T18:27:53Z","title":"Conditional Image Generation with Pretrained Generative Model","summary":" In recent years, diffusion models have gained popularity for their ability to\ngenerate higher-quality images in comparison to GAN models. However, like any\nother large generative models, these models require a huge amount of data,\ncomputational resources, and meticulous tuning for successful training. This\nposes a significant challenge, rendering it infeasible for most individuals. As\na result, the research community has devised methods to leverage pre-trained\nunconditional diffusion models with additional guidance for the purpose of\nconditional image generative. These methods enable conditional image\ngenerations on diverse inputs and, most importantly, circumvent the need for\ntraining the diffusion model. In this paper, our objective is to reduce the\ntime-required and computational overhead introduced by the addition of guidance\nin diffusion models -- while maintaining comparable image quality. We propose a\nset of methods based on our empirical analysis, demonstrating a reduction in\ncomputation time by approximately threefold.\n","authors":["Rajesh Shrestha","Bowen Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13253v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13252v1","updated":"2023-12-20T18:27:47Z","published":"2023-12-20T18:27:47Z","title":"Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model","summary":" While methods for monocular depth estimation have made significant strides on\nstandard benchmarks, zero-shot metric depth estimation remains unsolved.\nChallenges include the joint modeling of indoor and outdoor scenes, which often\nexhibit significantly different distributions of RGB and depth, and the\ndepth-scale ambiguity due to unknown camera intrinsics. Recent work has\nproposed specialized multi-head architectures for jointly modeling indoor and\noutdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion\nmodel, with several advancements such as log-scale depth parameterization to\nenable joint modeling of indoor and outdoor scenes, conditioning on the\nfield-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV\nduring training to generalize beyond the limited camera intrinsics in training\ndatasets. Furthermore, by employing a more diverse training mixture than is\ncommon, and an efficient diffusion parameterization, our method, DMD (Diffusion\nfor Metric Depth) achieves a 25\\% reduction in relative error (REL) on\nzero-shot indoor and 33\\% reduction on zero-shot outdoor datasets over the\ncurrent SOTA using only a small number of denoising steps. For an overview see\nhttps://diffusion-vision.github.io/dmd\n","authors":["Saurabh Saxena","Junhwa Hur","Charles Herrmann","Deqing Sun","David J. Fleet"],"pdf_url":"https://arxiv.org/pdf/2312.13252v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13250v1","updated":"2023-12-20T18:25:15Z","published":"2023-12-20T18:25:15Z","title":"The role of data embedding in equivariant quantum convolutional neural\n networks","summary":" Geometric deep learning refers to the scenario in which the symmetries of a\ndataset are used to constrain the parameter space of a neural network and thus,\nimprove their trainability and generalization. Recently this idea has been\nincorporated into the field of quantum machine learning, which has given rise\nto equivariant quantum neural networks (EQNNs). In this work, we investigate\nthe role of classical-to-quantum embedding on the performance of equivariant\nquantum convolutional neural networks (EQCNNs) for the classification of\nimages. We discuss the connection between the data embedding method and the\nresulting representation of a symmetry group and analyze how changing\nrepresentation affects the expressibility of an EQCNN. We numerically compare\nthe classification accuracy of EQCNNs with three different basis-permuted\namplitude embeddings to the one obtained from a non-equivariant quantum\nconvolutional neural network (QCNN). Our results show that all the EQCNNs\nachieve higher classification accuracy than the non-equivariant QCNN for small\nnumbers of training iterations, while for large iterations this improvement\ncrucially depends on the used embedding. It is expected that the results of\nthis work can be useful to the community for a better understanding of the\nimportance of data embedding choice in the context of geometric quantum machine\nlearning.\n","authors":["Sreetama Das","Stefano Martina","Filippo Caruso"],"pdf_url":"https://arxiv.org/pdf/2312.13250v1.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.13240v1","updated":"2023-12-20T18:08:02Z","published":"2023-12-20T18:08:02Z","title":"Efficient Verification-Based Face Identification","summary":" We study the problem of performing face verification with an efficient neural\nmodel $f$. The efficiency of $f$ stems from simplifying the face verification\nproblem from an embedding nearest neighbor search into a binary problem; each\nuser has its own neural network $f$. To allow information sharing between\ndifferent individuals in the training set, we do not train $f$ directly but\ninstead generate the model weights using a hypernetwork $h$. This leads to the\ngeneration of a compact personalized model for face identification that can be\ndeployed on edge devices. Key to the method's success is a novel way of\ngenerating hard negatives and carefully scheduling the training objectives. Our\nmodel leads to a substantially small $f$ requiring only 23k parameters and 5M\nfloating point operations (FLOPS). We use six face verification datasets to\ndemonstrate that our method is on par or better than state-of-the-art models,\nwith a significantly reduced number of parameters and computational burden.\nFurthermore, we perform an extensive ablation study to demonstrate the\nimportance of each element in our method.\n","authors":["Amit Rozner","Barak Battash","Ofir Lindenbaum","Lior Wolf"],"pdf_url":"https://arxiv.org/pdf/2312.13240v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.13236v1","updated":"2023-12-20T18:00:16Z","published":"2023-12-20T18:00:16Z","title":"Diffusion Models With Learned Adaptive Noise","summary":" Diffusion models have gained traction as powerful algorithms for synthesizing\nhigh-quality images. Central to these algorithms is the diffusion process,\nwhich maps data to noise according to equations inspired by thermodynamics and\ncan significantly impact performance. A widely held assumption is that the ELBO\nobjective of a diffusion model is invariant to the noise process (Kingma et\nal.,2021). In this work, we dispel this assumption -- we propose multivariate\nlearned adaptive noise (MuLAN), a learned diffusion process that applies\nGaussian noise at different rates across an image. Our method consists of three\ncomponents -- a multivariate noise schedule, instance-conditional diffusion,\nand auxiliary variables -- which ensure that the learning objective is no\nlonger invariant to the choice of the noise schedule as in previous works. Our\nwork is grounded in Bayesian inference and casts the learned diffusion process\nas an approximate variational posterior that yields a tighter lower bound on\nmarginal likelihood. Empirically, MuLAN sets a new state-of-the-art in density\nestimation on CIFAR-10 and ImageNet compared to classical diffusion. Code is\navailable at https://github.com/s-sahoo/MuLAN\n","authors":["Subham Sekhar Sahoo","Aaron Gokaslan","Chris De Sa","Volodymyr Kuleshov"],"pdf_url":"https://arxiv.org/pdf/2312.13236v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13223v1","updated":"2023-12-20T17:46:48Z","published":"2023-12-20T17:46:48Z","title":"StableKD: Breaking Inter-block Optimization Entanglement for Stable\n Knowledge Distillation","summary":" Knowledge distillation (KD) has been recognized as an effective tool to\ncompress and accelerate models. However, current KD approaches generally suffer\nfrom an accuracy drop and/or an excruciatingly long distillation process. In\nthis paper, we tackle the issue by first providing a new insight into a\nphenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which\nmakes the conventional end-to-end KD approaches unstable with noisy gradients.\nWe then propose StableKD, a novel KD framework that breaks the IBOE and\nachieves more stable optimization. StableKD distinguishes itself through two\noperations: Decomposition and Recomposition, where the former divides a pair of\nteacher and student networks into several blocks for separate distillation, and\nthe latter progressively merges them back, evolving towards end-to-end\ndistillation. We conduct extensive experiments on CIFAR100, Imagewoof, and\nImageNet datasets with various teacher-student pairs. Compared to other KD\napproaches, our simple yet effective StableKD greatly boosts the model accuracy\nby 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them\nwith only 40% of the training data.\n","authors":["Shiu-hong Kao","Jierun Chen","S. H. Gary Chan"],"pdf_url":"https://arxiv.org/pdf/2312.13223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13220v1","updated":"2023-12-20T17:38:56Z","published":"2023-12-20T17:38:56Z","title":"SISMIK for brain MRI: Deep-learning-based motion estimation and\n model-based motion correction in k-space","summary":" MRI, a widespread non-invasive medical imaging modality, is highly sensitive\nto patient motion. Despite many attempts over the years, motion correction\nremains a difficult problem and there is no general method applicable to all\nsituations. We propose a retrospective method for motion quantification and\ncorrection to tackle the problem of in-plane rigid-body motion, apt for\nclassical 2D Spin-Echo scans of the brain, which are regularly used in clinical\npractice. Due to the sequential acquisition of k-space, motion artifacts are\nwell localized. The method leverages the power of deep neural networks to\nestimate motion parameters in k-space and uses a model-based approach to\nrestore degraded images to avoid ''hallucinations''. Notable advantages are its\nability to estimate motion occurring in high spatial frequencies without the\nneed of a motion-free reference. The proposed method operates on the whole\nk-space dynamic range and is moderately affected by the lower SNR of higher\nharmonics. As a proof of concept, we provide models trained using supervised\nlearning on 600k motion simulations based on motion-free scans of 43 different\nsubjects. Generalization performance was tested with simulations as well as\nin-vivo. Qualitative and quantitative evaluations are presented for motion\nparameter estimations and image reconstruction. Experimental results show that\nour approach is able to obtain good generalization performance on simulated\ndata and in-vivo acquisitions.\n","authors":["Oscar Dabrowski","Jean-Luc Falcone","Antoine Klauser","Julien Songeon","Michel Kocher","Bastien Chopard","François Lazeyras","Sébastien Courvoisier"],"pdf_url":"https://arxiv.org/pdf/2312.13220v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13219v1","updated":"2023-12-20T17:38:04Z","published":"2023-12-20T17:38:04Z","title":"Interactive Visual Task Learning for Robots","summary":" We present a framework for robots to learn novel visual concepts and tasks\nvia in-situ linguistic interactions with human users. Previous approaches have\neither used large pre-trained visual models to infer novel objects zero-shot,\nor added novel concepts along with their attributes and representations to a\nconcept hierarchy. We extend the approaches that focus on learning visual\nconcept hierarchies by enabling them to learn novel concepts and solve unseen\nrobotics tasks with them. To enable a visual concept learner to solve robotics\ntasks one-shot, we developed two distinct techniques. Firstly, we propose a\nnovel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which\naugments information of a novel concept to its parent nodes within a concept\nhierarchy. This information propagation allows all concepts in a hierarchy to\nupdate as novel concepts are taught in a continual learning setting. Secondly,\nwe represent a visual task as a scene graph with language annotations, allowing\nus to create novel permutations of a demonstrated task zero-shot in-situ. We\npresent two sets of results. Firstly, we compare Hi-Viscont with the baseline\nmodel (FALCON) on visual question answering(VQA) in three domains. While being\ncomparable to the baseline model on leaf level concepts, Hi-Viscont achieves an\nimprovement of over 9% on non-leaf concepts on average. We compare our model's\nperformance against the baseline FALCON model. Our framework achieves 33%\nimprovements in success rate metric, and 19% improvements in the object level\naccuracy compared to the baseline model. With both of these results we\ndemonstrate the ability of our model to learn tasks and concepts in a continual\nlearning setting on the robot.\n","authors":["Weiwei Gu","Anant Sah","Nakul Gopalan"],"pdf_url":"https://arxiv.org/pdf/2312.13219v1.pdf","comment":"In Proceedings of The 38th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2312.13216v1","updated":"2023-12-20T17:35:24Z","published":"2023-12-20T17:35:24Z","title":"Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps","summary":" Recent progress in self-supervised representation learning has resulted in\nmodels that are capable of extracting image features that are not only\neffective at encoding image level, but also pixel-level, semantics. These\nfeatures have been shown to be effective for dense visual semantic\ncorrespondence estimation, even outperforming fully-supervised methods.\nNevertheless, current self-supervised approaches still fail in the presence of\nchallenging image characteristics such as symmetries and repeated parts. To\naddress these limitations, we propose a new approach for semantic\ncorrespondence estimation that supplements discriminative self-supervised\nfeatures with 3D understanding via a weak geometric spherical prior. Compared\nto more involved 3D pipelines, our model only requires weak viewpoint\ninformation, and the simplicity of our spherical representation enables us to\ninject informative geometric priors into the model during training. We propose\na new evaluation metric that better accounts for repeated part and\nsymmetry-induced mistakes. We present results on the challenging SPair-71k\ndataset, where we show that our approach demonstrates is capable of\ndistinguishing between symmetric views and repeated parts across many object\ncategories, and also demonstrate that we can generalize to unseen classes on\nthe AwA dataset.\n","authors":["Octave Mariotti","Oisin Mac Aodha","Hakan Bilen"],"pdf_url":"https://arxiv.org/pdf/2312.13216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.03087v2","updated":"2023-12-20T17:24:33Z","published":"2022-10-06T17:46:00Z","title":"Iterative Vision-and-Language Navigation","summary":" We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for\nevaluating language-guided agents navigating in a persistent environment over\ntime. Existing Vision-and-Language Navigation (VLN) benchmarks erase the\nagent's memory at the beginning of every episode, testing the ability to\nperform cold-start navigation with no prior information. However, deployed\nrobots occupy the same environment for long periods of time. The IVLN paradigm\naddresses this disparity by training and evaluating VLN agents that maintain\nmemory across tours of scenes that consist of up to 100 ordered\ninstruction-following Room-to-Room (R2R) episodes, each defined by an\nindividual language instruction and a target path. We present discrete and\ncontinuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours\neach in 80 indoor scenes. We find that extending the implicit memory of\nhigh-performing transformer VLN agents is not sufficient for IVLN, but agents\nthat build maps can benefit from environment persistence, motivating a renewed\nfocus on map-building agents in VLN.\n","authors":["Jacob Krantz","Shurjo Banerjee","Wang Zhu","Jason Corso","Peter Anderson","Stefan Lee","Jesse Thomason"],"pdf_url":"https://arxiv.org/pdf/2210.03087v2.pdf","comment":"Accepted by CVPR 2023"},{"id":"http://arxiv.org/abs/2312.13162v1","updated":"2023-12-20T16:23:48Z","published":"2023-12-20T16:23:48Z","title":"Brain-Inspired Visual Odometry: Balancing Speed and Interpretability\n through a System of Systems Approach","summary":" In this study, we address the critical challenge of balancing speed and\naccuracy while maintaining interpretablity in visual odometry (VO) systems, a\npivotal aspect in the field of autonomous navigation and robotics. Traditional\nVO systems often face a trade-off between computational speed and the precision\nof pose estimation. To tackle this issue, we introduce an innovative system\nthat synergistically combines traditional VO methods with a specifically\ntailored fully connected network (FCN). Our system is unique in its approach to\nhandle each degree of freedom independently within the FCN, placing a strong\nemphasis on causal inference to enhance interpretability. This allows for a\ndetailed and accurate assessment of relative pose error (RPE) across various\ndegrees of freedom, providing a more comprehensive understanding of parameter\nvariations and movement dynamics in different environments. Notably, our system\ndemonstrates a remarkable improvement in processing speed without compromising\naccuracy. In certain scenarios, it achieves up to a 5% reduction in Root Mean\nSquare Error (RMSE), showcasing its ability to effectively bridge the gap\nbetween speed and accuracy that has long been a limitation in VO research. This\nadvancement represents a significant step forward in developing more efficient\nand reliable VO systems, with wide-ranging applications in real-time navigation\nand robotic systems.\n","authors":["Habib Boloorchi Tabrizi","Christopher Crick"],"pdf_url":"https://arxiv.org/pdf/2312.13162v1.pdf","comment":"https://www.american-cse.org/csci2023 is website of conference and\n conference name is CSCI2023"},{"id":"http://arxiv.org/abs/2304.02150v2","updated":"2023-12-20T16:15:43Z","published":"2023-04-04T22:45:50Z","title":"Re-Evaluating LiDAR Scene Flow for Autonomous Driving","summary":" Popular benchmarks for self-supervised LiDAR scene flow (stereoKITTI, and\nFlyingThings3D) have unrealistic rates of dynamic motion, unrealistic\ncorrespondences, and unrealistic sampling patterns. As a result, progress on\nthese benchmarks is misleading and may cause researchers to focus on the wrong\nproblems. We evaluate a suite of top methods on a suite of real-world datasets\n(Argoverse 2.0, Waymo, and NuScenes) and report several conclusions. First, we\nfind that performance on stereoKITTI is negatively correlated with performance\non real-world data. Second, we find that one of this task's key components --\nremoving the dominant ego-motion -- is better solved by classic ICP than any\ntested method. Finally, we show that despite the emphasis placed on learning,\nmost performance gains are caused by pre- and post-processing steps:\npiecewise-rigid refinement and ground removal. We demonstrate this through a\nbaseline method that combines these processing steps with a learning-free\ntest-time flow optimization. This baseline outperforms every evaluated method.\n","authors":["Nathaniel Chodosh","Deva Ramanan","Simon Lucey"],"pdf_url":"https://arxiv.org/pdf/2304.02150v2.pdf","comment":"WACV 2024"},{"id":"http://arxiv.org/abs/2312.13150v1","updated":"2023-12-20T16:14:58Z","published":"2023-12-20T16:14:58Z","title":"Splatter Image: Ultra-Fast Single-View 3D Reconstruction","summary":" We introduce the Splatter Image, an ultra-fast approach for monocular 3D\nobject reconstruction which operates at 38 FPS. Splatter Image is based on\nGaussian Splatting, which has recently brought real-time rendering, fast\ntraining, and excellent scaling to multi-view reconstruction. For the first\ntime, we apply Gaussian Splatting in a monocular reconstruction setting. Our\napproach is learning-based, and, at test time, reconstruction only requires the\nfeed-forward evaluation of a neural network. The main innovation of Splatter\nImage is the surprisingly straightforward design: it uses a 2D image-to-image\nnetwork to map the input image to one 3D Gaussian per pixel. The resulting\nGaussians thus have the form of an image, the Splatter Image. We further extend\nthe method to incorporate more than one image as input, which we do by adding\ncross-view attention. Owning to the speed of the renderer (588 FPS), we can use\na single GPU for training while generating entire images at each iteration in\norder to optimize perceptual metrics like LPIPS. On standard benchmarks, we\ndemonstrate not only fast reconstruction but also better results than recent\nand much more expensive baselines in terms of PSNR, LPIPS, and other metrics.\n","authors":["Stanislaw Szymanowicz","Christian Rupprecht","Andrea Vedaldi"],"pdf_url":"https://arxiv.org/pdf/2312.13150v1.pdf","comment":"Project page: https://szymanowiczs.github.io/splatter-image.html .\n Code: https://github.com/szymanowiczs/splatter-image"},{"id":"http://arxiv.org/abs/2209.14719v3","updated":"2023-12-20T16:08:32Z","published":"2022-09-29T12:26:18Z","title":"In Search of Projectively Equivariant Networks","summary":" Equivariance of linear neural network layers is well studied. In this work,\nwe relax the equivariance condition to only be true in a projective sense. We\npropose a way to construct a projectively equivariant neural network through\nbuilding a standard equivariant network where the linear group representations\nacting on each intermediate feature space are \"multiplicatively modified lifts\"\nof projective group representations. By theoretically studying the relation of\nprojectively and linearly equivariant linear layers, we show that our approach\nis the most general possible when building a network out of linear layers. The\ntheory is showcased in two simple experiments.\n","authors":["Georg Bökman","Axel Flinth","Fredrik Kahl"],"pdf_url":"https://arxiv.org/pdf/2209.14719v3.pdf","comment":"v3: Another significant rewrite. Accepted for publication in TMLR.\n v2: Significant rewrite. The title has been changed: \"neural network\" ->\n \"network\". More general description of projectively equivariant linear\n layers, with new proposed architectures, and a completely new accompanying\n experiment section, as a result"},{"id":"http://arxiv.org/abs/2312.13139v1","updated":"2023-12-20T16:00:43Z","published":"2023-12-20T16:00:43Z","title":"Unleashing Large-Scale Video Generative Pre-training for Visual Robot\n Manipulation","summary":" Generative pre-trained models have demonstrated remarkable effectiveness in\nlanguage and vision domains by learning useful representations. In this paper,\nwe extend the scope of this effectiveness by showing that visual robot\nmanipulation can significantly benefit from large-scale video generative\npre-training. We introduce GR-1, a straightforward GPT-style model designed for\nmulti-task language-conditioned visual robot manipulation. GR-1 takes as inputs\na language instruction, a sequence of observation images, and a sequence of\nrobot states. It predicts robot actions as well as future images in an\nend-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly\nfinetuned on robot data after pre-trained on a large-scale video dataset. We\nperform extensive experiments on the challenging CALVIN benchmark and a real\nrobot. On CALVIN benchmark, our method outperforms state-of-the-art baseline\nmethods and improves the success rate from 88.9% to 94.9%. In the setting of\nzero-shot unseen scene generalization, GR-1 improves the success rate from\n53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline\nmethods and shows strong potentials in generalization to unseen scenes and\nobjects. We provide inaugural evidence that a unified GPT-style transformer,\naugmented with large-scale video generative pre-training, exhibits remarkable\ngeneralization to multi-task visual robot manipulation. Project page:\nhttps://GR1-Manipulation.github.io\n","authors":["Hongtao Wu","Ya Jing","Chilam Cheang","Guangzeng Chen","Jiafeng Xu","Xinghang Li","Minghuan Liu","Hang Li","Tao Kong"],"pdf_url":"https://arxiv.org/pdf/2312.13139v1.pdf","comment":"Project page: https://GR1-Manipulation.github.io"},{"id":"http://arxiv.org/abs/2311.13073v2","updated":"2023-12-20T15:58:26Z","published":"2023-11-22T00:26:15Z","title":"FusionFrames: Efficient Architectural Aspects for Text-to-Video\n Generation Pipeline","summary":" Multimedia generation approaches occupy a prominent place in artificial\nintelligence research. Text-to-image models achieved high-quality results over\nthe last few years. However, video synthesis methods recently started to\ndevelop. This paper presents a new two-stage latent diffusion text-to-video\ngeneration architecture based on the text-to-image diffusion model. The first\nstage concerns keyframes synthesis to figure the storyline of a video, while\nthe second one is devoted to interpolation frames generation to make movements\nof the scene and objects smooth. We compare several temporal conditioning\napproaches for keyframes generation. The results show the advantage of using\nseparate temporal blocks over temporal layers in terms of metrics reflecting\nvideo generation quality aspects and human preference. The design of our\ninterpolation model significantly reduces computational costs compared to other\nmasked frame interpolation approaches. Furthermore, we evaluate different\nconfigurations of MoVQ-based video decoding scheme to improve consistency and\nachieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our\npipeline with existing solutions and achieve top-2 scores overall and top-1\namong open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:\nhttps://ai-forever.github.io/kandinsky-video/\n","authors":["Vladimir Arkhipkin","Zein Shaheen","Viacheslav Vasilev","Elizaveta Dakhova","Andrey Kuznetsov","Denis Dimitrov"],"pdf_url":"https://arxiv.org/pdf/2311.13073v2.pdf","comment":"Project page: https://ai-forever.github.io/kandinsky-video/"},{"id":"http://arxiv.org/abs/2312.13127v1","updated":"2023-12-20T15:47:21Z","published":"2023-12-20T15:47:21Z","title":"Pixel-to-Abundance Translation: Conditional Generative Adversarial\n Networks Based on Patch Transformer for Hyperspectral Unmixing","summary":" Spectral unmixing is a significant challenge in hyperspectral image\nprocessing. Existing unmixing methods utilize prior knowledge about the\nabundance distribution to solve the regularization optimization problem, where\nthe difficulty lies in choosing appropriate prior knowledge and solving the\ncomplex regularization optimization problem. To solve these problems, we\npropose a hyperspectral conditional generative adversarial network (HyperGAN)\nmethod as a generic unmixing framework, based on the following assumption: the\nunmixing process from pixel to abundance can be regarded as a transformation of\ntwo modalities with an internal specific relationship. The proposed HyperGAN is\ncomposed of a generator and discriminator, the former completes the modal\nconversion from mixed hyperspectral pixel patch to the abundance of\ncorresponding endmember of the central pixel and the latter is used to\ndistinguish whether the distribution and structure of generated abundance are\nthe same as the true ones. We propose hyperspectral image (HSI) Patch\nTransformer as the main component of the generator, which utilize adaptive\nattention score to capture the internal pixels correlation of the HSI patch and\nleverage the spatial-spectral information in a fine-grained way to achieve\noptimization of the unmixing process. Experiments on synthetic data and real\nhyperspectral data achieve impressive results compared to state-of-the-art\ncompetitors.\n","authors":["Li Wang","Xiaohua Zhang","Longfei Li","Hongyun Meng","Xianghai Cao"],"pdf_url":"https://arxiv.org/pdf/2312.13127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13116v1","updated":"2023-12-20T15:36:30Z","published":"2023-12-20T15:36:30Z","title":"VSR-Net: Vessel-like Structure Rehabilitation Network with Graph\n Clustering","summary":" The morphologies of vessel-like structures, such as blood vessels and nerve\nfibres, play significant roles in disease diagnosis, e.g., Parkinson's disease.\nDeep network-based refinement segmentation methods have recently achieved\npromising vessel-like structure segmentation results. There are still two\nchallenges: (1) existing methods have limitations in rehabilitating subsection\nruptures in segmented vessel-like structures; (2) they are often overconfident\nin predicted segmentation results. To tackle these two challenges, this paper\nattempts to leverage the potential of spatial interconnection relationships\namong subsection ruptures from the structure rehabilitation perspective. Based\non this, we propose a novel Vessel-like Structure Rehabilitation Network\n(VSR-Net) to rehabilitate subsection ruptures and improve the model calibration\nbased on coarse vessel-like structure segmentation results. VSR-Net first\nconstructs subsection rupture clusters with Curvilinear Clustering Module\n(CCM). Then, the well-designed Curvilinear Merging Module (CMM) is applied to\nrehabilitate the subsection ruptures to obtain the refined vessel-like\nstructures. Extensive experiments on five 2D/3D medical image datasets show\nthat VSR-Net significantly outperforms state-of-the-art (SOTA) refinement\nsegmentation methods with lower calibration error. Additionally, we provide\nquantitative analysis to explain the morphological difference between the\nrehabilitation results of VSR-Net and ground truth (GT), which is smaller than\nSOTA methods and GT, demonstrating that our method better rehabilitates\nvessel-like structures by restoring subsection ruptures.\n","authors":["Haili Ye","Xiaoqing Zhang","Yan Hu","Huazhu Fu","Jiang Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13116v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13114v1","updated":"2023-12-20T15:34:15Z","published":"2023-12-20T15:34:15Z","title":"Investigating Color Illusions from the Perspective of Computational\n Color Constancy","summary":" Color constancy and color illusion perception are two phenomena occurring in\nthe human visual system, which can help us reveal unknown mechanisms of human\nperception. For decades computer vision scientists have developed numerous\ncolor constancy methods, which estimate the reflectance of the surface by\ndiscounting the illuminant. However, color illusions have not been analyzed in\ndetail in the field of computational color constancy, which we find surprising\nsince the relationship they share is significant and may let us design more\nrobust systems. We argue that any model that can reproduce our sensation on\ncolor illusions should also be able to provide pixel-wise estimates of the\nlight source. In other words, we suggest that the analysis of color illusions\nhelps us to improve the performance of the existing global color constancy\nmethods, and enable them to provide pixel-wise estimates for scenes illuminated\nby multiple light sources. In this study, we share the outcomes of our\ninvestigation in which we take several color constancy methods and modify them\nto reproduce the behavior of the human visual system on color illusions. Also,\nwe show that parameters purely extracted from illusions are able to improve the\nperformance of color constancy methods. A noteworthy outcome is that our\nstrategy based on the investigation of color illusions outperforms the\nstate-of-the-art methods that are specifically designed to transform global\ncolor constancy algorithms into multi-illuminant algorithms.\n","authors":["Oguzhan Ulucan","Diclehan Ulucan","Marc Ebner"],"pdf_url":"https://arxiv.org/pdf/2312.13114v1.pdf","comment":"This work is accepted at VISAPP 2024 as a long paper"},{"id":"http://arxiv.org/abs/2312.13108v1","updated":"2023-12-20T15:28:38Z","published":"2023-12-20T15:28:38Z","title":"ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation","summary":" Graphical User Interface (GUI) automation holds significant promise for\nassisting users with complex tasks, thereby boosting human productivity.\nExisting works leveraging Large Language Model (LLM) or LLM-based AI agents\nhave shown capabilities in automating tasks on Android and Web platforms.\nHowever, these tasks are primarily aimed at simple device usage and\nentertainment operations. This paper presents a novel benchmark, AssistGUI, to\nevaluate whether models are capable of manipulating the mouse and keyboard on\nthe Windows platform in response to user-requested tasks. We carefully\ncollected a set of 100 tasks from nine widely-used software applications, such\nas, After Effects and MS Word, each accompanied by the necessary project files\nfor better evaluation. Moreover, we propose an advanced Actor-Critic Embodied\nAgent framework, which incorporates a sophisticated GUI parser driven by an\nLLM-agent and an enhanced reasoning mechanism adept at handling lengthy\nprocedural tasks. Our experimental results reveal that our GUI Parser and\nReasoning mechanism outshine existing methods in performance. Nevertheless, the\npotential remains substantial, with the best model attaining only a 46% success\nrate on our benchmark. We conclude with a thorough analysis of the current\nmethods' limitations, setting the stage for future breakthroughs in this\ndomain.\n","authors":["Difei Gao","Lei Ji","Zechen Bai","Mingyu Ouyang","Peiran Li","Dongxing Mao","Qinchen Wu","Weichen Zhang","Peiyi Wang","Xiangwu Guo","Hengxu Wang","Luowei Zhou","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2312.13108v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.02464v2","updated":"2023-12-20T15:26:34Z","published":"2023-12-05T03:33:47Z","title":"SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object\n and Boundary Constraints","summary":" Semantic segmentation of remote sensing imagery plays a pivotal role in\nextracting precise information for diverse down-stream applications. Recent\ndevelopment of the Segment Anything Model (SAM), an advanced general-purpose\nsegmentation model, has revolutionized this field, presenting new avenues for\naccurate and efficient segmentation. However, SAM is limited to generating\nsegmentation results without class information. Consequently, the utilization\nof such a powerful general vision model for semantic segmentation in remote\nsensing images has become a focal point of research. In this paper, we present\na streamlined framework aimed at leveraging the raw output of SAM by exploiting\ntwo novel concepts called SAM-Generated Object (SGO) and SAM-Generated Boundary\n(SGB). More specifically, we propose a novel object loss and further introduce\na boundary loss as augmentative components to aid in model optimization in a\ngeneral semantic segmentation framework. Taking into account the content\ncharacteristics of SGO, we introduce the concept of object consistency to\nleverage segmented regions lacking semantic information. By imposing\nconstraints on the consistency of predicted values within objects, the object\nloss aims to enhance semantic segmentation performance. Furthermore, the\nboundary loss capitalizes on the distinctive features of SGB by directing the\nmodel's attention to the boundary information of the object. Experimental\nresults on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban,\ndemonstrate the effectiveness of our proposed method. The source code for this\nwork will be accessible at https://github.com/sstary/SSRS.\n","authors":["Xianping Ma","Qianqian Wu","Xingyu Zhao","Xiaokang Zhang","Man-On Pun","Bo Huang"],"pdf_url":"https://arxiv.org/pdf/2312.02464v2.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.13104v1","updated":"2023-12-20T15:22:34Z","published":"2023-12-20T15:22:34Z","title":"Optimizing Ego Vehicle Trajectory Prediction: The Graph Enhancement\n Approach","summary":" Predicting the trajectory of an ego vehicle is a critical component of\nautonomous driving systems. Current state-of-the-art methods typically rely on\nDeep Neural Networks (DNNs) and sequential models to process front-view images\nfor future trajectory prediction. However, these approaches often struggle with\nperspective issues affecting object features in the scene. To address this, we\nadvocate for the use of Bird's Eye View (BEV) perspectives, which offer unique\nadvantages in capturing spatial relationships and object homogeneity. In our\nwork, we leverage Graph Neural Networks (GNNs) and positional encoding to\nrepresent objects in a BEV, achieving competitive performance compared to\ntraditional DNN-based methods. While the BEV-based approach loses some detailed\ninformation inherent to front-view images, we balance this by enriching the BEV\ndata by representing it as a graph where relationships between the objects in a\nscene are captured effectively.\n","authors":["Sushil Sharma","Aryan Singh","Ganesh Sistu","Mark Halton","Ciarán Eising"],"pdf_url":"https://arxiv.org/pdf/2312.13104v1.pdf","comment":"Accepted for publication in the Electronic Imagine Autonomous\n Vehicles and Machines (EI-AVM) Conference"},{"id":"http://arxiv.org/abs/2312.13103v1","updated":"2023-12-20T15:20:33Z","published":"2023-12-20T15:20:33Z","title":"Exploring Multimodal Large Language Models for Radiology Report\n Error-checking","summary":" This paper proposes one of the first clinical applications of multimodal\nlarge language models (LLMs) as an assistant for radiologists to check errors\nin their reports. We created an evaluation dataset from two real-world\nradiology datasets (MIMIC-CXR and IU-Xray), with 1,000 subsampled reports each.\nA subset of original reports was modified to contain synthetic errors by\nintroducing various type of mistakes. The evaluation contained two difficulty\nlevels: SIMPLE for binary error-checking and COMPLEX for identifying error\ntypes. LLaVA (Large Language and Visual Assistant) variant models, including\nour instruction-tuned model, were used for the evaluation. Additionally, a\ndomain expert evaluation was conducted on a small test set. At the SIMPLE\nlevel, the LLaVA v1.5 model outperformed other publicly available models.\nInstruction tuning significantly enhanced performance by 47.4% and 25.4% on\nMIMIC-CXR and IU-Xray data, respectively. The model also surpassed the domain\nexperts accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among the subsets\n(N=21) of the test set where a clinician did not achieve the correct\nconclusion, the LLaVA ensemble mode correctly identified 71.4% of these cases.\nThis study marks a promising step toward utilizing multi-modal LLMs to enhance\ndiagnostic accuracy in radiology. The ensemble model demonstrated comparable\nperformance to clinicians, even capturing errors overlooked by humans.\nNevertheless, future work is needed to improve the model ability to identify\nthe types of inconsistency.\n","authors":["Jinge Wu","Yunsoo Kim","Eva C. Keller","Jamie Chow","Adam P. Levine","Nikolas Pontikos","Zina Ibrahim","Paul Taylor","Michelle C. Williams","Honghan Wu"],"pdf_url":"https://arxiv.org/pdf/2312.13103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13102v1","updated":"2023-12-20T15:20:25Z","published":"2023-12-20T15:20:25Z","title":"SpecNeRF: Gaussian Directional Encoding for Specular Reflections","summary":" Neural radiance fields have achieved remarkable performance in modeling the\nappearance of 3D scenes. However, existing approaches still struggle with the\nview-dependent appearance of glossy surfaces, especially under complex lighting\nof indoor environments. Unlike existing methods, which typically assume distant\nlighting like an environment map, we propose a learnable Gaussian directional\nencoding to better model the view-dependent effects under near-field lighting\nconditions. Importantly, our new directional encoding captures the\nspatially-varying nature of near-field lighting and emulates the behavior of\nprefiltered environment maps. As a result, it enables the efficient evaluation\nof preconvolved specular color at any 3D location with varying roughness\ncoefficients. We further introduce a data-driven geometry prior that helps\nalleviate the shape radiance ambiguity in reflection modeling. We show that our\nGaussian directional encoding and geometry prior significantly improve the\nmodeling of challenging specular reflections in neural radiance fields, which\nhelps decompose appearance into more physically meaningful components.\n","authors":["Li Ma","Vasu Agrawal","Haithem Turki","Changil Kim","Chen Gao","Pedro Sander","Michael Zollhöfer","Christian Richardt"],"pdf_url":"https://arxiv.org/pdf/2312.13102v1.pdf","comment":"Project page: https://limacv.github.io/SpecNeRF_web/"},{"id":"http://arxiv.org/abs/2312.13100v1","updated":"2023-12-20T15:18:51Z","published":"2023-12-20T15:18:51Z","title":"SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized\n Zero-Shot Learning","summary":" Generalized Zero-Shot Learning (GZSL) recognizes unseen classes by\ntransferring knowledge from the seen classes, depending on the inherent\ninteractions between visual and semantic data. However, the discrepancy between\nwell-prepared training data and unpredictable real-world test scenarios remains\na significant challenge. This paper introduces a dual strategy to address the\ngeneralization gap. Firstly, we incorporate semantic information through an\ninnovative encoder. This encoder effectively integrates class-specific semantic\ninformation by targeting the performance disparity, enhancing the produced\nfeatures to enrich the semantic space for class-specific attributes. Secondly,\nwe refine our generative capabilities using a novel compositional loss\nfunction. This approach generates discriminative classes, effectively\nclassifying both seen and unseen classes. In addition, we extend the\nexploitation of the learned latent space by utilizing controlled semantic\ninputs, ensuring the robustness of the model in varying environments. This\napproach yields a model that outperforms the state-of-the-art models in terms\nof both generalization and diverse settings, notably without requiring\nhyperparameter tuning or domain-specific adaptations. We also propose a set of\nnovel evaluation metrics to provide a more detailed assessment of the\nreliability and reproducibility of the results. The complete code is made\navailable on https://github.com/william-heyden/SEER-ZeroShotLearning/.\n","authors":["William Heyden","Habib Ullah","M. Salman Siddiqui","Fadi Al Machot"],"pdf_url":"https://arxiv.org/pdf/2312.13100v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13091v1","updated":"2023-12-20T15:12:53Z","published":"2023-12-20T15:12:53Z","title":"MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using\n Differentiable Shading","summary":" Reconstructing an avatar from a portrait image has many applications in\nmultimedia, but remains a challenging research problem. Extracting reflectance\nmaps and geometry from one image is ill-posed: recovering geometry is a\none-to-many mapping problem and reflectance and light are difficult to\ndisentangle. Accurate geometry and reflectance can be captured under the\ncontrolled conditions of a light stage, but it is costly to acquire large\ndatasets in this fashion. Moreover, training solely with this type of data\nleads to poor generalization with in-the-wild images. This motivates the\nintroduction of MoSAR, a method for 3D avatar generation from monocular images.\nWe propose a semi-supervised training scheme that improves generalization by\nlearning from both light stage and in-the-wild datasets. This is achieved using\na novel differentiable shading formulation. We show that our approach\neffectively disentangles the intrinsic face parameters, producing relightable\navatars. As a result, MoSAR estimates a richer set of skin reflectance maps,\nand generates more realistic avatars than existing state-of-the-art methods. We\nalso introduce a new dataset, named FFHQ-UV-Intrinsics, the first public\ndataset providing intrisic face attributes at scale (diffuse, specular, ambient\nocclusion and translucency maps) for a total of 10k subjects. The project\nwebsite and the dataset are available on the following link:\nhttps://ubisoftlaforge.github.io/character/mosar\n","authors":["Abdallah Dib","Luiz Gustavo Hafemann","Emeline Got","Trevor Anderson","Amin Fadaeinejad","Rafael M. O. Cruz","Marc-Andre Carbonneau"],"pdf_url":"https://arxiv.org/pdf/2312.13091v1.pdf","comment":"https://ubisoft-laforge.github.io/character/mosar/"},{"id":"http://arxiv.org/abs/2312.13090v1","updated":"2023-12-20T15:12:27Z","published":"2023-12-20T15:12:27Z","title":"Perception Test 2023: A Summary of the First Challenge And Outcome","summary":" The First Perception Test challenge was held as a half-day workshop alongside\nthe IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the\ngoal of benchmarking state-of-the-art video models on the recently proposed\nPerception Test benchmark. The challenge had six tracks covering low-level and\nhigh-level tasks, with both a language and non-language interface, across\nvideo, audio, and text modalities, and covering: object tracking, point\ntracking, temporal action localisation, temporal sound localisation,\nmultiple-choice video question-answering, and grounded video\nquestion-answering. We summarise in this report the task descriptions, metrics,\nbaselines, and results.\n","authors":["Joseph Heyward","João Carreira","Dima Damen","Andrew Zisserman","Viorica Pătrăucean"],"pdf_url":"https://arxiv.org/pdf/2312.13090v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15409v2","updated":"2023-12-20T15:08:11Z","published":"2023-07-28T09:03:06Z","title":"Uncertainty-aware Unsupervised Multi-Object Tracking","summary":" Without manually annotated identities, unsupervised multi-object trackers are\ninferior to learning reliable feature embeddings. It causes the\nsimilarity-based inter-frame association stage also be error-prone, where an\nuncertainty problem arises. The frame-by-frame accumulated uncertainty prevents\ntrackers from learning the consistent feature embedding against time variation.\nTo avoid this uncertainty problem, recent self-supervised techniques are\nadopted, whereas they failed to capture temporal relations. The interframe\nuncertainty still exists. In fact, this paper argues that though the\nuncertainty problem is inevitable, it is possible to leverage the uncertainty\nitself to improve the learned consistency in turn. Specifically, an\nuncertainty-based metric is developed to verify and rectify the risky\nassociations. The resulting accurate pseudo-tracklets boost learning the\nfeature consistency. And accurate tracklets can incorporate temporal\ninformation into spatial transformation. This paper proposes a tracklet-guided\naugmentation strategy to simulate tracklets' motion, which adopts a\nhierarchical uncertainty-based sampling mechanism for hard sample mining. The\nultimate unsupervised MOT framework, namely U2MOT, is proven effective on\nMOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance\namong the published supervised and unsupervised trackers.\n","authors":["Kai Liu","Sheng Jin","Zhihang Fu","Ze Chen","Rongxin Jiang","Jieping Ye"],"pdf_url":"https://arxiv.org/pdf/2307.15409v2.pdf","comment":"Accepted by International Conference on Computer Vision (ICCV) 2023.\n Code is available at https://github.com/alibaba/u2mot/"},{"id":"http://arxiv.org/abs/2312.13081v1","updated":"2023-12-20T15:02:37Z","published":"2023-12-20T15:02:37Z","title":"BEVSeg2TP: Surround View Camera Bird's-Eye-View Based Joint Vehicle\n Segmentation and Ego Vehicle Trajectory Prediction","summary":" Trajectory prediction is, naturally, a key task for vehicle autonomy. While\nthe number of traffic rules is limited, the combinations and uncertainties\nassociated with each agent's behaviour in real-world scenarios are nearly\nimpossible to encode. Consequently, there is a growing interest in\nlearning-based trajectory prediction. The proposed method in this paper\npredicts trajectories by considering perception and trajectory prediction as a\nunified system. In considering them as unified tasks, we show that there is the\npotential to improve the performance of perception. To achieve these goals, we\npresent BEVSeg2TP - a surround-view camera bird's-eye-view-based joint vehicle\nsegmentation and ego vehicle trajectory prediction system for autonomous\nvehicles. The proposed system uses a network trained on multiple camera views.\nThe images are transformed using several deep learning techniques to perform\nsemantic segmentation of objects, including other vehicles, in the scene. The\nsegmentation outputs are fused across the camera views to obtain a\ncomprehensive representation of the surrounding vehicles from the\nbird's-eye-view perspective. The system further predicts the future trajectory\nof the ego vehicle using a spatiotemporal probabilistic network (STPN) to\noptimize trajectory prediction. This network leverages information from\nencoder-decoder transformers and joint vehicle segmentation.\n","authors":["Sushil Sharma","Arindam Das","Ganesh Sistu","Mark Halton","Ciarán Eising"],"pdf_url":"https://arxiv.org/pdf/2312.13081v1.pdf","comment":"Accepted for publication in the International Conference on Computer\n Vision Theory and Applications (VISAPP) 2024"},{"id":"http://arxiv.org/abs/2312.13071v1","updated":"2023-12-20T14:52:07Z","published":"2023-12-20T14:52:07Z","title":"Point Deformable Network with Enhanced Normal Embedding for Point Cloud\n Analysis","summary":" Recently MLP-based methods have shown strong performance in point cloud\nanalysis. Simple MLP architectures are able to learn geometric features in\nlocal point groups yet fail to model long-range dependencies directly. In this\npaper, we propose Point Deformable Network (PDNet), a concise MLP-based network\nthat can capture long-range relations with strong representation ability.\nSpecifically, we put forward Point Deformable Aggregation Module (PDAM) to\nimprove representation capability in both long-range dependency and adaptive\naggregation among points. For each query point, PDAM aggregates information\nfrom deformable reference points rather than points in limited local areas. The\ndeformable reference points are generated data-dependent, and we initialize\nthem according to the input point positions. Additional offsets and modulation\nscalars are learned on the whole point features, which shift the deformable\nreference points to the regions of interest. We also suggest estimating the\nnormal vector for point clouds and applying Enhanced Normal Embedding (ENE) to\nthe geometric extractors to improve the representation ability of single-point.\nExtensive experiments and ablation studies on various benchmarks demonstrate\nthe effectiveness and superiority of our PDNet.\n","authors":["Xingyilang Yin","Xi Yang","Liangchen Liu","Nannan Wang","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2312.13071v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13066v1","updated":"2023-12-20T14:45:57Z","published":"2023-12-20T14:45:57Z","title":"PPEA-Depth: Progressive Parameter-Efficient Adaptation for\n Self-Supervised Monocular Depth Estimation","summary":" Self-supervised monocular depth estimation is of significant importance with\napplications spanning across autonomous driving and robotics. However, the\nreliance on self-supervision introduces a strong static-scene assumption,\nthereby posing challenges in achieving optimal performance in dynamic scenes,\nwhich are prevalent in most real-world situations. To address these issues, we\npropose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to\ntransfer a pre-trained image model for self-supervised depth estimation. The\ntraining comprises two sequential stages: an initial phase trained on a dataset\nprimarily composed of static scenes, succeeded by an expansion to more\nintricate datasets involving dynamic scenes. To facilitate this process, we\ndesign compact encoder and decoder adapters to enable parameter-efficient\ntuning, allowing the network to adapt effectively. They not only uphold\ngeneralized patterns from pre-trained image models but also retain knowledge\ngained from the preceding phase into the subsequent one. Extensive experiments\ndemonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI,\nCityScapes and DDAD datasets.\n","authors":["Yue-Jiang Dong","Yuan-Chen Guo","Ying-Tian Liu","Fang-Lue Zhang","Song-Hai Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13066v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2311.14521v4","updated":"2023-12-20T14:35:27Z","published":"2023-11-24T14:46:59Z","title":"GaussianEditor: Swift and Controllable 3D Editing with Gaussian\n Splatting","summary":" 3D editing plays a crucial role in many areas such as gaming and virtual\nreality. Traditional 3D editing methods, which rely on representations like\nmeshes and point clouds, often fall short in realistically depicting complex\nscenes. On the other hand, methods based on implicit 3D representations, like\nNeural Radiance Field (NeRF), render complex scenes effectively but suffer from\nslow processing speeds and limited control over specific scene areas. In\nresponse to these challenges, our paper presents GaussianEditor, an innovative\nand efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D\nrepresentation. GaussianEditor enhances precision and control in editing\nthrough our proposed Gaussian semantic tracing, which traces the editing target\nthroughout the training process. Additionally, we propose Hierarchical Gaussian\nsplatting (HGS) to achieve stabilized and fine results under stochastic\ngenerative guidance from 2D diffusion models. We also develop editing\nstrategies for efficient object removal and integration, a challenging task for\nexisting methods. Our comprehensive experiments demonstrate GaussianEditor's\nsuperior control, efficacy, and rapid performance, marking a significant\nadvancement in 3D editing. Project Page:\nhttps://buaacyw.github.io/gaussian-editor/\n","authors":["Yiwen Chen","Zilong Chen","Chi Zhang","Feng Wang","Xiaofeng Yang","Yikai Wang","Zhongang Cai","Lei Yang","Huaping Liu","Guosheng Lin"],"pdf_url":"https://arxiv.org/pdf/2311.14521v4.pdf","comment":"Project Page: https://buaacyw.github.io/gaussian-editor/ Code:\n https://github.com/buaacyw/GaussianEditor"},{"id":"http://arxiv.org/abs/2312.13053v1","updated":"2023-12-20T14:26:54Z","published":"2023-12-20T14:26:54Z","title":"Quantifying Bias in Text-to-Image Generative Models","summary":" Bias in text-to-image (T2I) models can propagate unfair social\nrepresentations and may be used to aggressively market ideas or push\ncontroversial agendas. Existing T2I model bias evaluation methods only focus on\nsocial biases. We look beyond that and instead propose an evaluation\nmethodology to quantify general biases in T2I generative models, without any\npreconceived notions. We assess four state-of-the-art T2I models and compare\ntheir baseline bias characteristics to their respective variants (two for\neach), where certain biases have been intentionally induced. We propose three\nevaluation metrics to assess model biases including: (i) Distribution bias,\n(ii) Jaccard hallucination and (iii) Generative miss-rate. We conduct two\nevaluation studies, modelling biases under general, and task-oriented\nconditions, using a marketing scenario as the domain for the latter. We also\nquantify social biases to compare our findings to related works. Finally, our\nmethodology is transferred to evaluate captioned-image datasets and measure\ntheir bias. Our approach is objective, domain-agnostic and consistently\nmeasures different forms of T2I model biases. We have developed a web\napplication and practical implementation of what has been proposed in this\nwork, which is at https://huggingface.co/spaces/JVice/try-before-you-bias. A\nvideo series with demonstrations is available at\nhttps://www.youtube.com/channel/UCk-0xyUyT0MSd_hkp4jQt1Q\n","authors":["Jordan Vice","Naveed Akhtar","Richard Hartley","Ajmal Mian"],"pdf_url":"https://arxiv.org/pdf/2312.13053v1.pdf","comment":"main manuscript = 9 pages, 6 tables, 4 figures. Supplementary\n material = 15 pages, 13 tables, 14 figures"},{"id":"http://arxiv.org/abs/2307.13986v2","updated":"2023-12-20T14:24:17Z","published":"2023-07-26T06:52:29Z","title":"Hybrid Representation-Enhanced Sampling for Bayesian Active Learning in\n Musculoskeletal Segmentation of Lower Extremities","summary":" Purpose: Manual annotations for training deep learning (DL) models in\nauto-segmentation are time-intensive. This study introduces a hybrid\nrepresentation-enhanced sampling strategy that integrates both density and\ndiversity criteria within an uncertainty-based Bayesian active learning (BAL)\nframework to reduce annotation efforts by selecting the most informative\ntraining samples. Methods: The experiments are performed on two lower extremity\n(LE) datasets of MRI and CT images, focusing on the segmentation of the femur,\npelvis, sacrum, quadriceps femoris, hamstrings, adductors, sartorius, and\niliopsoas, utilizing a U-net-based BAL framework. Our method selects uncertain\nsamples with high density and diversity for manual revision, optimizing for\nmaximal similarity to unlabeled instances and minimal similarity to existing\ntraining data. We assess the accuracy and efficiency using Dice and a proposed\nmetric called reduced annotation cost (RAC), respectively. We further evaluate\nthe impact of various acquisition rules on BAL performance and design an\nablation study for effectiveness estimation. Results: In MRI and CT datasets,\nour method was superior or comparable to existing ones, achieving a 0.8\\% Dice\nand 1.0\\% RAC increase in CT (statistically significant), and a 0.8\\% Dice and\n1.1\\% RAC increase in MRI (not statistically significant) in volume-wise\nacquisition. Our ablation study indicates that combining density and diversity\ncriteria enhances the efficiency of BAL in musculoskeletal segmentation\ncompared to using either criterion alone. Conclusion: Our sampling method is\nproven efficient in reducing annotation costs in image segmentation tasks. The\ncombination of the proposed method and our BAL framework provides a\nsemi-automatic way for efficient annotation of medical image datasets.\n","authors":["Ganping Li","Yoshito Otake","Mazen Soufi","Masashi Taniguchi","Masahide Yagi","Noriaki Ichihashi","Keisuke Uemura","Masaki Takao","Nobuhiko Sugano","Yoshinobu Sato"],"pdf_url":"https://arxiv.org/pdf/2307.13986v2.pdf","comment":"15 pages, 5 figures"},{"id":"http://arxiv.org/abs/2202.02980v4","updated":"2023-12-20T14:16:19Z","published":"2022-02-07T07:12:24Z","title":"3D Object Detection from Images for Autonomous Driving: A Survey","summary":" 3D object detection from images, one of the fundamental and challenging\nproblems in autonomous driving, has received increasing attention from both\nindustry and academia in recent years. Benefiting from the rapid development of\ndeep learning technologies, image-based 3D detection has achieved remarkable\nprogress. Particularly, more than 200 works have studied this problem from 2015\nto 2021, encompassing a broad spectrum of theories, algorithms, and\napplications. However, to date no recent survey exists to collect and organize\nthis knowledge. In this paper, we fill this gap in the literature and provide\nthe first comprehensive survey of this novel and continuously growing research\nfield, summarizing the most commonly used pipelines for image-based 3D\ndetection and deeply analyzing each of their components. Additionally, we also\npropose two new taxonomies to organize the state-of-the-art methods into\ndifferent categories, with the intent of providing a more systematic review of\nexisting methods and facilitating fair comparisons with future works. In\nretrospect of what has been achieved so far, we also analyze the current\nchallenges in the field and discuss future directions for image-based 3D\ndetection research.\n","authors":["Xinzhu Ma","Wanli Ouyang","Andrea Simonelli","Elisa Ricci"],"pdf_url":"https://arxiv.org/pdf/2202.02980v4.pdf","comment":"Accepted by T-PAMI"},{"id":"http://arxiv.org/abs/2303.11048v3","updated":"2023-12-20T14:11:26Z","published":"2023-03-20T11:59:23Z","title":"SGFormer: Semantic Graph Transformer for Point Cloud-based 3D Scene\n Graph Generation","summary":" In this paper, we propose a novel model called SGFormer, Semantic Graph\nTransFormer for point cloud-based 3D scene graph generation. The task aims to\nparse a point cloud-based scene into a semantic structural graph, with the core\nchallenge of modeling the complex global structure. Existing methods based on\ngraph convolutional networks (GCNs) suffer from the over-smoothing dilemma and\ncan only propagate information from limited neighboring nodes. In contrast,\nSGFormer uses Transformer layers as the base building block to allow global\ninformation passing, with two types of newly-designed layers tailored for the\n3D scene graph generation task. Specifically, we introduce the graph embedding\nlayer to best utilize the global information in graph edges while maintaining\ncomparable computation costs. Furthermore, we propose the semantic injection\nlayer to leverage linguistic knowledge from large-scale language model (i.e.,\nChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the\nestablished 3DSSG dataset and achieve a 40.94% absolute improvement in\nrelationship prediction's R@50 and an 88.36% boost on the subset with complex\nscenes over the state-of-the-art. Our analyses further show SGFormer's\nsuperiority in the long-tail and zero-shot scenarios. Our source code is\navailable at https://github.com/Andy20178/SGFormer.\n","authors":["Changsheng Lv","Mengshi Qi","Xia Li","Zhengyuan Yang","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2303.11048v3.pdf","comment":"To be published in Thirty-Eighth AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2303.12332v2","updated":"2023-12-20T14:08:37Z","published":"2023-03-22T06:08:34Z","title":"Weakly-Supervised Temporal Action Localization by Inferring Salient\n Snippet-Feature","summary":" Weakly-supervised temporal action localization aims to locate action regions\nand identify action categories in untrimmed videos simultaneously by taking\nonly video-level labels as the supervision. Pseudo label generation is a\npromising strategy to solve the challenging problem, but the current methods\nignore the natural temporal structure of the video that can provide rich\ninformation to assist such a generation process. In this paper, we propose a\nnovel weakly-supervised temporal action localization method by inferring\nsalient snippet-feature. First, we design a saliency inference module that\nexploits the variation relationship between temporal neighbor snippets to\ndiscover salient snippet-features, which can reflect the significant dynamic\nchange in the video. Secondly, we introduce a boundary refinement module that\nenhances salient snippet-features through the information interaction unit.\nThen, a discrimination enhancement module is introduced to enhance the\ndiscriminative nature of snippet-features. Finally, we adopt the refined\nsnippet-features to produce high-fidelity pseudo labels, which could be used to\nsupervise the training of the action localization network. Extensive\nexperiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet\nv1.3, demonstrate our proposed method achieves significant improvements\ncompared to the state-of-the-art methods.\n","authors":["Wulian Yun","Mengshi Qi","Chuanming Wang","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2303.12332v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13027v1","updated":"2023-12-20T13:50:26Z","published":"2023-12-20T13:50:26Z","title":"Doubly Perturbed Task-Free Continual Learning","summary":" Task-free online continual learning (TF-CL) is a challenging problem where\nthe model incrementally learns tasks without explicit task information.\nAlthough training with entire data from the past, present as well as future is\nconsidered as the gold standard, naive approaches in TF-CL with the current\nsamples may be conflicted with learning with samples in the future, leading to\ncatastrophic forgetting and poor plasticity. Thus, a proactive consideration of\nan unseen future sample in TF-CL becomes imperative. Motivated by this\nintuition, we propose a novel TF-CL framework considering future samples and\nshow that injecting adversarial perturbations on both input data and\ndecision-making is effective. Then, we propose a novel method named Doubly\nPerturbed Continual Learning (DPCL) to efficiently implement these input and\ndecision-making perturbations. Specifically, for input perturbation, we propose\nan approximate perturbation method that injects noise into the input data as\nwell as the feature vector and then interpolates the two perturbed samples. For\ndecision-making process perturbation, we devise multiple stochastic\nclassifiers. We also investigate a memory management scheme and learning rate\nscheduling reflecting our proposed double perturbations. We demonstrate that\nour proposed method outperforms the state-of-the-art baseline methods by large\nmargins on various TF-CL benchmarks.\n","authors":["Byung Hyun Lee","Min-hwan Oh","Se Young Chun"],"pdf_url":"https://arxiv.org/pdf/2312.13027v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13016v1","updated":"2023-12-20T13:31:11Z","published":"2023-12-20T13:31:11Z","title":"DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View\n Synthesis","summary":" We present DiffPortrait3D, a conditional diffusion model that is capable of\nsynthesizing 3D-consistent photo-realistic novel views from as few as a single\nin-the-wild portrait. Specifically, given a single RGB input, we aim to\nsynthesize plausible but consistent facial details rendered from novel camera\nviews with retained both identity and facial expression. In lieu of\ntime-consuming optimization and fine-tuning, our zero-shot method generalizes\nwell to arbitrary face portraits with unposed camera views, extreme facial\nexpressions, and diverse artistic depictions. At its core, we leverage the\ngenerative prior of 2D diffusion models pre-trained on large-scale image\ndatasets as our rendering backbone, while the denoising is guided with\ndisentangled attentive control of appearance and camera pose. To achieve this,\nwe first inject the appearance context from the reference image into the\nself-attention layers of the frozen UNets. The rendering view is then\nmanipulated with a novel conditional control module that interprets the camera\npose by watching a condition image of a crossed subject from the same view.\nFurthermore, we insert a trainable cross-view attention module to enhance view\nconsistency, which is further strengthened with a novel 3D-aware noise\ngeneration process during inference. We demonstrate state-of-the-art results\nboth qualitatively and quantitatively on our challenging in-the-wild and\nmulti-view benchmarks.\n","authors":["Yuming Gu","Hongyi Xu","You Xie","Guoxian Song","Yichun Shi","Di Chang","Jing Yang","Lingjie Luo"],"pdf_url":"https://arxiv.org/pdf/2312.13016v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13008v1","updated":"2023-12-20T13:20:31Z","published":"2023-12-20T13:20:31Z","title":"No More Shortcuts: Realizing the Potential of Temporal Self-Supervision","summary":" Self-supervised approaches for video have shown impressive results in video\nunderstanding tasks. However, unlike early works that leverage temporal\nself-supervision, current state-of-the-art methods primarily rely on tasks from\nthe image domain (e.g., contrastive learning) that do not explicitly promote\nthe learning of temporal features. We identify two factors that limit existing\ntemporal self-supervision: 1) tasks are too simple, resulting in saturated\ntraining performance, and 2) we uncover shortcuts based on local appearance\nstatistics that hinder the learning of high-level features. To address these\nissues, we propose 1) a more challenging reformulation of temporal\nself-supervision as frame-level (rather than clip-level) recognition tasks and\n2) an effective augmentation strategy to mitigate shortcuts. Our model extends\na representation of single video frames, pre-trained through contrastive\nlearning, with a transformer that we train through temporal self-supervision.\nWe demonstrate experimentally that our more challenging frame-level task\nformulations and the removal of shortcuts drastically improve the quality of\nfeatures learned through temporal self-supervision. The generalization\ncapability of our self-supervised video method is evidenced by its\nstate-of-the-art performance in a wide range of high-level semantic tasks,\nincluding video retrieval, action classification, and video attribute\nrecognition (such as object and scene identification), as well as low-level\ntemporal correspondence tasks like video object segmentation and pose tracking.\nAdditionally, we show that the video representations learned through our method\nexhibit increased robustness to the input perturbations.\n","authors":["Ishan Rajendrakumar Dave","Simon Jenni","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2312.13008v1.pdf","comment":"AAAI 2024 (Main Technical Track)"},{"id":"http://arxiv.org/abs/2312.12995v1","updated":"2023-12-20T12:57:01Z","published":"2023-12-20T12:57:01Z","title":"Aggregating Multiple Bio-Inspired Image Region Classifiers For Effective\n And Lightweight Visual Place Recognition","summary":" Visual place recognition (VPR) enables autonomous systems to localize\nthemselves within an environment using image information. While VPR techniques\nbuilt upon a Convolutional Neural Network (CNN) backbone dominate\nstate-of-the-art VPR performance, their high computational requirements make\nthem unsuitable for platforms equipped with low-end hardware. Recently, a\nlightweight VPR system based on multiple bio-inspired classifiers, dubbed\nDrosoNets, has been proposed, achieving great computational efficiency at the\ncost of reduced absolute place retrieval performance. In this work, we propose\na novel multi-DrosoNet localization system, dubbed RegionDrosoNet, with\nsignificantly improved VPR performance, while preserving a low-computational\nprofile. Our approach relies on specializing distinct groups of DrosoNets on\ndifferently sliced partitions of the original image, increasing extrinsic model\ndifferentiation. Furthermore, we introduce a novel voting module to combine the\noutputs of all DrosoNets into the final place prediction which considers\nmultiple top refence candidates from each DrosoNet. RegionDrosoNet outperforms\nother lightweight VPR techniques when dealing with both appearance changes and\nviewpoint variations. Moreover, it competes with computationally expensive\nmethods on some benchmark datasets at a small fraction of their online\ninference time.\n","authors":["Bruno Arcanjo","Bruno Ferrarini","Maria Fasli","Michael Milford","Klaus D. McDonald-Maier","Shoaib Ehsan"],"pdf_url":"https://arxiv.org/pdf/2312.12995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12990v1","updated":"2023-12-20T12:48:18Z","published":"2023-12-20T12:48:18Z","title":"Multi-task Learning To Improve Semantic Segmentation Of CBCT Scans Using\n Image Reconstruction","summary":" Semantic segmentation is a crucial task in medical image processing,\nessential for segmenting organs or lesions such as tumors. In this study we aim\nto improve automated segmentation in CBCTs through multi-task learning. To\nevaluate effects on different volume qualities, a CBCT dataset is synthesised\nfrom the CT Liver Tumor Segmentation Benchmark (LiTS) dataset. To improve\nsegmentation, two approaches are investigated. First, we perform multi-task\nlearning to add morphology based regularization through a volume reconstruction\ntask. Second, we use this reconstruction task to reconstruct the best quality\nCBCT (most similar to the original CT), facilitating denoising effects. We\nexplore both holistic and patch-based approaches. Our findings reveal that,\nespecially using a patch-based approach, multi-task learning improves\nsegmentation in most cases and that these results can further be improved by\nour denoising approach.\n","authors":["Maximilian Ernst Tschuchnig","Julia Coste-Marin","Philipp Steininger","Michael Gadermayr"],"pdf_url":"https://arxiv.org/pdf/2312.12990v1.pdf","comment":"Accepted at German Conference on Medical Image Computing (BVM) 2024"},{"id":"http://arxiv.org/abs/2312.12436v2","updated":"2023-12-20T12:40:47Z","published":"2023-12-19T18:59:22Z","title":"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise","summary":" The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Chaoyou Fu","Renrui Zhang","Zihan Wang","Yubo Huang","Zhengye Zhang","Longtian Qiu","Gaoxiang Ye","Yunhang Shen","Mengdan Zhang","Peixian Chen","Sirui Zhao","Shaohui Lin","Deqiang Jiang","Di Yin","Peng Gao","Ke Li","Hongsheng Li","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2312.12436v2.pdf","comment":"Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2312.12970v1","updated":"2023-12-20T12:19:17Z","published":"2023-12-20T12:19:17Z","title":"D3Former: Jointly Learning Repeatable Dense Detectors and\n Feature-enhanced Descriptors via Saliency-guided Transformer","summary":" Establishing accurate and representative matches is a crucial step in\naddressing the point cloud registration problem. A commonly employed approach\ninvolves detecting keypoints with salient geometric features and subsequently\nmapping these keypoints from one frame of the point cloud to another. However,\nmethods within this category are hampered by the repeatability of the sampled\nkeypoints. In this paper, we introduce a saliency-guided trans\\textbf{former},\nreferred to as \\textit{D3Former}, which entails the joint learning of\nrepeatable \\textbf{D}ense \\textbf{D}etectors and feature-enhanced\n\\textbf{D}escriptors. The model comprises a Feature Enhancement Descriptor\nLearning (FEDL) module and a Repetitive Keypoints Detector Learning (RKDL)\nmodule. The FEDL module utilizes a region attention mechanism to enhance\nfeature distinctiveness, while the RKDL module focuses on detecting repeatable\nkeypoints to enhance matching capabilities. Extensive experimental results on\nchallenging indoor and outdoor benchmarks demonstrate that our proposed method\nconsistently outperforms state-of-the-art point cloud matching methods.\nNotably, tests on 3DLoMatch, even with a low overlap ratio, show that our\nmethod consistently outperforms recently published approaches such as RoReg and\nRoITr. For instance, with the number of extracted keypoints reduced to 250, the\nregistration recall scores for RoReg, RoITr, and our method are 64.3\\%, 73.6\\%,\nand 76.5\\%, respectively.\n","authors":["Junjie Gao","Pengfei Wang","Qiujie Dong","Qiong Zeng","Shiqing Xin","Caiming Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12970v1.pdf","comment":"15 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.02273v3","updated":"2023-12-20T12:10:09Z","published":"2023-07-05T13:17:14Z","title":"Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient\n Neural Image Compression","summary":" Recently, the performance of neural image compression (NIC) has steadily\nimproved thanks to the last line of study, reaching or outperforming\nstate-of-the-art conventional codecs. Despite significant progress, current NIC\nmethods still rely on ConvNet-based entropy coding, limited in modeling\nlong-range dependencies due to their local connectivity and the increasing\nnumber of architectural biases and priors, resulting in complex underperforming\nmodels with high decoding latency. Motivated by the efficiency investigation of\nthe Tranformer-based transform coding framework, namely SwinT-ChARM, we propose\nto enhance the latter, as first, with a more straightforward yet effective\nTranformer-based channel-wise auto-regressive prior model, resulting in an\nabsolute image compression transformer (ICT). Through the proposed ICT, we can\ncapture both global and local contexts from the latent representations and\nbetter parameterize the distribution of the quantized latents. Further, we\nleverage a learnable scaling module with a sandwich ConvNeXt-based\npre-/post-processor to accurately extract more compact latent codes while\nreconstructing higher-quality images. Extensive experimental results on\nbenchmark datasets showed that the proposed framework significantly improves\nthe trade-off between coding efficiency and decoder complexity over the\nversatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec\nSwinT-ChARM. Moreover, we provide model scaling studies to verify the\ncomputational efficiency of our approach and conduct several objective and\nsubjective analyses to bring to the fore the performance gap between the\nadaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.\n","authors":["Ahmed Ghorbel","Wassim Hamidouche","Luce Morin"],"pdf_url":"https://arxiv.org/pdf/2307.02273v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12961v1","updated":"2023-12-20T12:05:59Z","published":"2023-12-20T12:05:59Z","title":"Radar Fields: An Extension of Radiance Fields to SAR","summary":" Radiance fields have been a major breakthrough in the field of inverse\nrendering, novel view synthesis and 3D modeling of complex scenes from\nmulti-view image collections. Since their introduction, it was shown that they\ncould be extended to other modalities such as LiDAR, radio frequencies, X-ray\nor ultrasound. In this paper, we show that, despite the important difference\nbetween optical and synthetic aperture radar (SAR) image formation models, it\nis possible to extend radiance fields to radar images thus presenting the first\n\"radar fields\". This allows us to learn surface models using only collections\nof radar images, similar to how regular radiance fields are learned and with\nthe same computational complexity on average. Thanks to similarities in how\nboth fields are defined, this work also shows a potential for hybrid methods\ncombining both optical and SAR images.\n","authors":["Thibaud Ehret","Roger Marí","Dawa Derksen","Nicolas Gasnier","Gabriele Facciolo"],"pdf_url":"https://arxiv.org/pdf/2312.12961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12954v1","updated":"2023-12-20T11:51:49Z","published":"2023-12-20T11:51:49Z","title":"TADAP: Trajectory-Aided Drivable area Auto-labeling with Pre-trained\n self-supervised features in winter driving conditions","summary":" Detection of the drivable area in all conditions is crucial for autonomous\ndriving and advanced driver assistance systems. However, the amount of labeled\ndata in adverse driving conditions is limited, especially in winter, and\nsupervised methods generalize poorly to conditions outside the training\ndistribution. For easy adaption to all conditions, the need for human\nannotation should be removed from the learning process. In this paper,\nTrajectory-Aided Drivable area Auto-labeling with Pre-trained self-supervised\nfeatures (TADAP) is presented for automated annotation of the drivable area in\nwinter driving conditions. A sample of the drivable area is extracted based on\nthe trajectory estimate from the global navigation satellite system. Similarity\nwith the sample area is determined based on pre-trained self-supervised visual\nfeatures. Image areas similar to the sample area are considered to be drivable.\nThese TADAP labels were evaluated with a novel winter-driving dataset,\ncollected in varying driving scenes. A prediction model trained with the TADAP\nlabels achieved a +9.6 improvement in intersection over union compared to the\nprevious state-of-the-art of self-supervised drivable area detection.\n","authors":["Eerik Alamikkotervo","Risto Ojala","Alvari Seppänen","Kari Tammi"],"pdf_url":"https://arxiv.org/pdf/2312.12954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.02916v2","updated":"2023-12-20T11:42:46Z","published":"2023-12-05T17:46:52Z","title":"MIND: Multi-Task Incremental Network Distillation","summary":" The recent surge of pervasive devices that generate dynamic data streams has\nunderscored the necessity for learning systems to adapt continually to data\ndistributional shifts. To tackle this challenge, the research community has put\nforth a spectrum of methodologies, including the demanding pursuit of\nclass-incremental learning without replay data. In this study, we present MIND,\na parameter isolation method that aims to significantly enhance the performance\nof replay-free solutions and achieve state-of-the-art results on several widely\nstudied datasets. Our approach introduces two main contributions: two\nalternative distillation procedures that significantly improve the efficiency\nof MIND increasing the accumulated knowledge of each sub-network, and the\noptimization of the BachNorm layers across tasks inside the sub-networks.\nOverall, MIND outperforms all the state-of-the-art methods for rehearsal-free\nClass-Incremental learning (with an increment in classification accuracy of\napprox. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx.\n+40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each\ncontribution to demonstrate its impact on performance improvement. Our results\nshowcase the superior performance of MIND indicating its potential for\naddressing the challenges posed by Class-incremental and Domain-Incremental\nlearning in resource-constrained environments.\n","authors":["Jacopo Bonato","Francesco Pelosin","Luigi Sabetta","Alessandro Nicolosi"],"pdf_url":"https://arxiv.org/pdf/2312.02916v2.pdf","comment":"Accepted at the 38th AAAI Conference on Artificial Intelligence"},{"id":"http://arxiv.org/abs/2308.10542v2","updated":"2023-12-20T11:17:24Z","published":"2023-08-21T07:52:39Z","title":"Learning Weakly Convex Regularizers for Convergent Image-Reconstruction\n Algorithms","summary":" We propose to learn non-convex regularizers with a prescribed upper bound on\ntheir weak-convexity modulus. Such regularizers give rise to variational\ndenoisers that minimize a convex energy. They rely on few parameters (less than\n15,000) and offer a signal-processing interpretation as they mimic handcrafted\nsparsity-promoting regularizers. Through numerical experiments, we show that\nsuch denoisers outperform convex-regularization methods as well as the popular\nBM3D denoiser. Additionally, the learned regularizer can be deployed to solve\ninverse problems with iterative schemes that provably converge. For both CT and\nMRI reconstruction, the regularizer generalizes well and offers an excellent\ntradeoff between performance, number of parameters, guarantees, and\ninterpretability when compared to other data-driven approaches.\n","authors":["Alexis Goujon","Sebastian Neumayer","Michael Unser"],"pdf_url":"https://arxiv.org/pdf/2308.10542v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.04810v2","updated":"2023-12-20T11:17:20Z","published":"2023-12-08T02:59:29Z","title":"RS-Corrector: Correcting the Racial Stereotypes in Latent Diffusion\n Models","summary":" Recent text-conditioned image generation models have demonstrated an\nexceptional capacity to produce diverse and creative imagery with high visual\nquality. However, when pre-trained on billion-sized datasets randomly collected\nfrom the Internet, where potential biased human preferences exist, these models\ntend to produce images with common and recurring stereotypes, particularly for\ncertain racial groups. In this paper, we conduct an initial analysis of the\npublicly available Stable Diffusion model and its derivatives, highlighting the\npresence of racial stereotypes. These models often generate distorted or biased\nimages for certain racial groups, emphasizing stereotypical characteristics. To\naddress these issues, we propose a framework called \"RS-Corrector\", designed to\nestablish an anti-stereotypical preference in the latent space and update the\nlatent code for refined generated results. The correction process occurs during\nthe inference stage without requiring fine-tuning of the original model.\nExtensive empirical evaluations demonstrate that the introduced \\themodel\neffectively corrects the racial stereotypes of the well-trained Stable\nDiffusion model while leaving the original model unchanged.\n","authors":["Yue Jiang","Yueming Lyu","Tianxiang Ma","Bo Peng","Jing Dong"],"pdf_url":"https://arxiv.org/pdf/2312.04810v2.pdf","comment":"16 pages, 15 figures, conference"},{"id":"http://arxiv.org/abs/2303.09429v2","updated":"2023-12-20T11:07:57Z","published":"2023-03-16T16:02:24Z","title":"Data Roaming and Quality Assessment for Composed Image Retrieval","summary":" The task of Composed Image Retrieval (CoIR) involves queries that combine\nimage and text modalities, allowing users to express their intent more\neffectively. However, current CoIR datasets are orders of magnitude smaller\ncompared to other vision and language (V&L) datasets. Additionally, some of\nthese datasets have noticeable issues, such as queries containing redundant\nmodalities. To address these shortcomings, we introduce the Large Scale\nComposed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times\nlarger than existing ones. Pre-training on our LaSCo, shows a noteworthy\nimprovement in performance, even in zero-shot. Furthermore, we propose a new\napproach for analyzing CoIR datasets and methods, which detects modality\nredundancy or necessity, in queries. We also introduce a new CoIR baseline, the\nCross-Attention driven Shift Encoder (CASE). This baseline allows for early\nfusion of modalities using a cross-attention module and employs an additional\nauxiliary task during training. Our experiments demonstrate that this new\nbaseline outperforms the current state-of-the-art methods on established\nbenchmarks like FashionIQ and CIRR.\n","authors":["Matan Levy","Rami Ben-Ari","Nir Darshan","Dani Lischinski"],"pdf_url":"https://arxiv.org/pdf/2303.09429v2.pdf","comment":"Camera Ready version for AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12917v1","updated":"2023-12-20T10:53:06Z","published":"2023-12-20T10:53:06Z","title":"Sign Language Production with Latent Motion Transformer","summary":" Sign Language Production (SLP) is the tough task of turning sign language\ninto sign videos. The main goal of SLP is to create these videos using a sign\ngloss. In this research, we've developed a new method to make high-quality sign\nvideos without using human poses as a middle step. Our model works in two main\nparts: first, it learns from a generator and the video's hidden features, and\nnext, it uses another model to understand the order of these hidden features.\nTo make this method even better for sign videos, we make several significant\nimprovements. (i) In the first stage, we take an improved 3D VQ-GAN to learn\ndownsampled latent representations. (ii) In the second stage, we introduce\nsequence-to-sequence attention to better leverage conditional information.\n(iii) The separated two-stage training discards the realistic visual semantic\nof the latent codes in the second stage. To endow the latent sequences semantic\ninformation, we extend the token-level autoregressive latent codes learning\nwith perceptual loss and reconstruction loss for the prior model with visual\nperception. Compared with previous state-of-the-art approaches, our model\nperforms consistently better on two word-level sign language datasets, i.e.,\nWLASL and NMFs-CSL.\n","authors":["Pan Xie","Taiyi Peng","Yao Du","Qipeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12917v1.pdf","comment":"Accepted by WACV2024"},{"id":"http://arxiv.org/abs/2312.12913v1","updated":"2023-12-20T10:49:49Z","published":"2023-12-20T10:49:49Z","title":"Produce Once, Utilize Twice for Anomaly Detection","summary":" Visual anomaly detection aims at classifying and locating the regions that\ndeviate from the normal appearance. Embedding-based methods and\nreconstruction-based methods are two main approaches for this task. However,\nthey are either not efficient or not precise enough for the industrial\ndetection. To deal with this problem, we derive POUTA (Produce Once Utilize\nTwice for Anomaly detection), which improves both the accuracy and efficiency\nby reusing the discriminant information potential in the reconstructive\nnetwork. We observe that the encoder and decoder representations of the\nreconstructive network are able to stand for the features of the original and\nreconstructed image respectively. And the discrepancies between the symmetric\nreconstructive representations provides roughly accurate anomaly information.\nTo refine this information, a coarse-to-fine process is proposed in POUTA,\nwhich calibrates the semantics of each discriminative layer by the high-level\nrepresentations and supervision loss. Equipped with the above modules, POUTA is\nendowed with the ability to provide a more precise anomaly location than the\nprior arts. Besides, the representation reusage also enables to exclude the\nfeature extraction process in the discriminative network, which reduces the\nparameters and improves the efficiency. Extensive experiments show that, POUTA\nis superior or comparable to the prior methods with even less cost.\nFurthermore, POUTA also achieves better performance than the state-of-the-art\nfew-shot anomaly detection methods without any special design, showing that\nPOUTA has strong ability to learn representations inherent in the training\ndata.\n","authors":["Shuyuan Wang","Qi Li","Huiyuan Luo","Chengkan Lv","Zhengtao Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12913v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08288v2","updated":"2023-12-20T10:46:33Z","published":"2023-12-13T17:04:16Z","title":"Hybrid Sample Synthesis-based Debiasing of Classifier in Limited Data\n Setting","summary":" Deep learning models are known to suffer from the problem of bias, and\nresearchers have been exploring methods to address this issue. However, most of\nthese methods require prior knowledge of the bias and are not always practical.\nIn this paper, we focus on a more practical setting with no prior information\nabout the bias. Generally, in this setting, there are a large number of\nbias-aligned samples that cause the model to produce biased predictions and a\nfew bias-conflicting samples that do not conform to the bias. If the training\ndata is limited, the influence of the bias-aligned samples may become even\nstronger on the model predictions, and we experimentally demonstrate that\nexisting debiasing techniques suffer severely in such cases. In this paper, we\nexamine the effects of unknown bias in small dataset regimes and present a\nnovel approach to mitigate this issue. The proposed approach directly addresses\nthe issue of the extremely low occurrence of bias-conflicting samples in\nlimited data settings through the synthesis of hybrid samples that can be used\nto reduce the effect of bias. We perform extensive experiments on several\nbenchmark datasets and experimentally demonstrate the effectiveness of our\nproposed approach in addressing any unknown bias in the presence of limited\ndata. Specifically, our approach outperforms the vanilla, LfF, LDD, and DebiAN\ndebiasing methods by absolute margins of 10.39%, 9.08%, 8.07%, and 9.67% when\nonly 10% of the Corrupted CIFAR-10 Type 1 dataset is available with a\nbias-conflicting sample ratio of 0.05.\n","authors":["Piyush Arora","Pratik Mazumder"],"pdf_url":"https://arxiv.org/pdf/2312.08288v2.pdf","comment":"Accepted in WACV 2024"},{"id":"http://arxiv.org/abs/2312.12908v1","updated":"2023-12-20T10:45:22Z","published":"2023-12-20T10:45:22Z","title":"The Common Optical Music Recognition Evaluation Framework","summary":" The quality of Optical Music Recognition (OMR) systems is a rather difficult\nmagnitude to measure. There is no lingua franca shared among OMR datasets that\nallows to compare systems' performance on equal grounds, since most of them are\nspecialised on certain approaches. As a result, most state-of-the-art works\ncurrently report metrics that cannot be compared directly. In this paper we\nidentify the need of a common music representation language and propose the\nMusic Tree Notation (MTN) format, thanks to which the definition of standard\nmetrics is possible. This format represents music as a set of primitives that\ngroup together into higher-abstraction nodes, a compromise between the\nexpression of fully graph-based and sequential notation formats. We have also\ndeveloped a specific set of OMR metrics and a typeset score dataset as a proof\nof concept of this idea.\n","authors":["Pau Torras","Sanket Biswas","Alicia Fornés"],"pdf_url":"https://arxiv.org/pdf/2312.12908v1.pdf","comment":"18 pages, 4 figures, 3 tables, submitted (under review) for the\n International Journal in Document Analysis and Recognition"},{"id":"http://arxiv.org/abs/2312.12880v1","updated":"2023-12-20T09:45:21Z","published":"2023-12-20T09:45:21Z","title":"Testing the Segment Anything Model on radiology data","summary":" Deep learning models trained with large amounts of data have become a recent\nand effective approach to predictive problem solving -- these have become known\nas \"foundation models\" as they can be used as fundamental tools for other\napplications. While the paramount examples of image classification (earlier)\nand large language models (more recently) led the way, the Segment Anything\nModel (SAM) was recently proposed and stands as the first foundation model for\nimage segmentation, trained on over 10 million images and with recourse to over\n1 billion masks. However, the question remains -- what are the limits of this\nfoundation? Given that magnetic resonance imaging (MRI) stands as an important\nmethod of diagnosis, we sought to understand whether SAM could be used for a\nfew tasks of zero-shot segmentation using MRI data. Particularly, we wanted to\nknow if selecting masks from the pool of SAM predictions could lead to good\nsegmentations.\n Here, we provide a critical assessment of the performance of SAM on magnetic\nresonance imaging data. We show that, while acceptable in a very limited set of\ncases, the overall trend implies that these models are insufficient for MRI\nsegmentation across the whole volume, but can provide good segmentations in a\nfew, specific slices. More importantly, we note that while foundation models\ntrained on natural images are set to become key aspects of predictive\nmodelling, they may prove ineffective when used on other imaging modalities.\n","authors":["José Guilherme de Almeida","Nuno M. Rodrigues","Sara Silva","Nickolas Papanikolaou"],"pdf_url":"https://arxiv.org/pdf/2312.12880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12877v1","updated":"2023-12-20T09:39:55Z","published":"2023-12-20T09:39:55Z","title":"Relightable and Animatable Neural Avatars from Videos","summary":" Lightweight creation of 3D digital avatars is a highly desirable but\nchallenging task. With only sparse videos of a person under unknown\nillumination, we propose a method to create relightable and animatable neural\navatars, which can be used to synthesize photorealistic images of humans under\nnovel viewpoints, body poses, and lighting. The key challenge here is to\ndisentangle the geometry, material of the clothed body, and lighting, which\nbecomes more difficult due to the complex geometry and shadow changes caused by\nbody motions. To solve this ill-posed problem, we propose novel techniques to\nbetter model the geometry and shadow changes. For geometry change modeling, we\npropose an invertible deformation field, which helps to solve the inverse\nskinning problem and leads to better geometry quality. To model the spatial and\ntemporal varying shading cues, we propose a pose-aware part-wise light\nvisibility network to estimate light occlusion. Extensive experiments on\nsynthetic and real datasets show that our approach reconstructs high-quality\ngeometry and generates realistic shadows under different body poses. Code and\ndata are available at\n\\url{https://wenbin-lin.github.io/RelightableAvatar-page/}.\n","authors":["Wenbin Lin","Chengwei Zheng","Jun-Hai Yong","Feng Xu"],"pdf_url":"https://arxiv.org/pdf/2312.12877v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12876v1","updated":"2023-12-20T09:39:53Z","published":"2023-12-20T09:39:53Z","title":"COVID-19 Diagnosis: ULGFBP-ResNet51 approach on the CT and the Chest\n X-ray Images Classification","summary":" The contagious and pandemic COVID-19 disease is currently considered as the\nmain health concern and posed widespread panic across human-beings. It affects\nthe human respiratory tract and lungs intensely. So that it has imposed\nsignificant threats for premature death. Although, its early diagnosis can play\na vital role in revival phase, the radiography tests with the manual\nintervention are a time-consuming process. Time is also limited for such manual\ninspecting of numerous patients in the hospitals. Thus, the necessity of\nautomatic diagnosis on the chest X-ray or the CT images with a high efficient\nperformance is urgent. Toward this end, we propose a novel method, named as the\nULGFBP-ResNet51 to tackle with the COVID-19 diagnosis in the images. In fact,\nthis method includes Uniform Local Binary Pattern (ULBP), Gabor Filter (GF),\nand ResNet51. According to our results, this method could offer superior\nperformance in comparison with the other methods, and attain maximum accuracy.\n","authors":["Vida Esmaeili","Mahmood Mohassel Feghhi","Seyed Omid Shahdi"],"pdf_url":"https://arxiv.org/pdf/2312.12876v1.pdf","comment":"16 pages, 8 figures, submitted for possible journal publication"},{"id":"http://arxiv.org/abs/2312.12872v1","updated":"2023-12-20T09:37:06Z","published":"2023-12-20T09:37:06Z","title":"Integration and Performance Analysis of Artificial Intelligence and\n Computer Vision Based on Deep Learning Algorithms","summary":" This paper focuses on the analysis of the application effectiveness of the\nintegration of deep learning and computer vision technologies. Deep learning\nachieves a historic breakthrough by constructing hierarchical neural networks,\nenabling end-to-end feature learning and semantic understanding of images. The\nsuccessful experiences in the field of computer vision provide strong support\nfor training deep learning algorithms. The tight integration of these two\nfields has given rise to a new generation of advanced computer vision systems,\nsignificantly surpassing traditional methods in tasks such as machine vision\nimage classification and object detection. In this paper, typical image\nclassification cases are combined to analyze the superior performance of deep\nneural network models while also pointing out their limitations in\ngeneralization and interpretability, proposing directions for future\nimprovements. Overall, the efficient integration and development trend of deep\nlearning with massive visual data will continue to drive technological\nbreakthroughs and application expansion in the field of computer vision, making\nit possible to build truly intelligent machine vision systems. This deepening\nfusion paradigm will powerfully promote unprecedented tasks and functions in\ncomputer vision, providing stronger development momentum for related\ndisciplines and industries.\n","authors":["Bo Liu","Liqiang Yu","Chang Che","Qunwei Lin","Hao Hu","Xinyu Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.12872v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12870v1","updated":"2023-12-20T09:34:22Z","published":"2023-12-20T09:34:22Z","title":"The Audio-Visual Conversational Graph: From an Egocentric-Exocentric\n Perspective","summary":" In recent years, the thriving development of research related to egocentric\nvideos has provided a unique perspective for the study of conversational\ninteractions, where both visual and audio signals play a crucial role. While\nmost prior work focus on learning about behaviors that directly involve the\ncamera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction\nproblem, marking the first attempt to infer exocentric conversational\ninteractions from egocentric videos. We propose a unified multi-modal,\nmulti-task framework -- Audio-Visual Conversational Attention (Av-CONV), for\nthe joint prediction of conversation behaviors -- speaking and listening -- for\nboth the camera wearer as well as all other social partners present in the\negocentric video. Specifically, we customize the self-attention mechanism to\nmodel the representations across-time, across-subjects, and across-modalities.\nTo validate our method, we conduct experiments on a challenging egocentric\nvideo dataset that includes first-person perspective, multi-speaker, and\nmulti-conversation scenarios. Our results demonstrate the superior performance\nof our method compared to a series of baselines. We also present detailed\nablation studies to assess the contribution of each component in our model.\nProject page: https://vjwq.github.io/AV-CONV/.\n","authors":["Wenqi Jia","Miao Liu","Hao Jiang","Ishwarya Ananthabhotla","James M. Rehg","Vamsi Krishna Ithapu","Ruohan Gao"],"pdf_url":"https://arxiv.org/pdf/2312.12870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12865v1","updated":"2023-12-20T09:27:41Z","published":"2023-12-20T09:27:41Z","title":"RadEdit: stress-testing biomedical vision models via diffusion image\n editing","summary":" Biomedical imaging datasets are often small and biased, meaning that\nreal-world performance of predictive models can be substantially lower than\nexpected from internal testing. This work proposes using generative image\nediting to simulate dataset shifts and diagnose failure modes of biomedical\nvision models; this can be used in advance of deployment to assess readiness,\npotentially reducing cost and patient harm. Existing editing methods can\nproduce undesirable changes, with spurious correlations learned due to the\nco-occurrence of disease and treatment interventions, limiting practical\napplicability. To address this, we train a text-to-image diffusion model on\nmultiple chest X-ray datasets and introduce a new editing method RadEdit that\nuses multiple masks, if present, to constrain changes and ensure consistency in\nthe edited images. We consider three types of dataset shifts: acquisition\nshift, manifestation shift, and population shift, and demonstrate that our\napproach can diagnose failures and quantify model robustness without additional\ndata collection, complementing more qualitative tools for explainable AI.\n","authors":["Fernando Pérez-García","Sam Bond-Taylor","Pedro P. Sanchez","Boris van Breugel","Daniel C. Castro","Harshita Sharma","Valentina Salvatelli","Maria T. A. Wetscherek","Hannah Richardson","Matthew P. Lungren","Aditya Nori","Javier Alvarez-Valle","Ozan Oktay","Maximilian Ilse"],"pdf_url":"https://arxiv.org/pdf/2312.12865v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12856v1","updated":"2023-12-20T09:19:48Z","published":"2023-12-20T09:19:48Z","title":"SkyScript: A Large and Semantically Diverse Vision-Language Dataset for\n Remote Sensing","summary":" Remote sensing imagery, despite its broad applications in helping achieve\nSustainable Development Goals and tackle climate change, has not yet benefited\nfrom the recent advancements of versatile, task-agnostic vision language models\n(VLMs). A key reason is that the large-scale, semantically diverse image-text\ndataset required for developing VLMs is still absent for remote sensing images.\nUnlike natural images, remote sensing images and their associated text\ndescriptions cannot be efficiently collected from the public Internet at scale.\nIn this work, we bridge this gap by using geo-coordinates to automatically\nconnect open, unlabeled remote sensing images with rich semantics covered in\nOpenStreetMap, and thus construct SkyScript, a comprehensive vision-language\ndataset for remote sensing images, comprising 2.6 million image-text pairs\ncovering 29K distinct semantic tags. With continual pre-training on this\ndataset, we obtain a VLM that surpasses baseline models with a 6.2% average\naccuracy gain in zero-shot scene classification across seven benchmark\ndatasets. It also demonstrates the ability of zero-shot transfer for\nfine-grained object attribute classification and cross-modal retrieval. We hope\nthis dataset can support the advancement of VLMs for various multi-modal tasks\nin remote sensing, such as open-vocabulary classification, retrieval,\ncaptioning, and text-to-image synthesis.\n","authors":["Zhecheng Wang","Rajanie Prabha","Tianyuan Huang","Jiajun Wu","Ram Rajagopal"],"pdf_url":"https://arxiv.org/pdf/2312.12856v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2311.15803v2","updated":"2023-12-20T09:15:57Z","published":"2023-11-27T13:25:47Z","title":"SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using\n Neural Radiance Fields","summary":" In rapidly-evolving domains such as autonomous driving, the use of multiple\nsensors with different modalities is crucial to ensure high operational\nprecision and stability. To correctly exploit the provided information by each\nsensor in a single common frame, it is essential for these sensors to be\naccurately calibrated. In this paper, we leverage the ability of Neural\nRadiance Fields (NeRF) to represent different sensors modalities in a common\nvolumetric representation to achieve robust and accurate spatio-temporal sensor\ncalibration. By designing a partitioning approach based on the visible part of\nthe scene for each sensor, we formulate the calibration problem using only the\noverlapping areas. This strategy results in a more robust and accurate\ncalibration that is less prone to failure. We demonstrate that our approach\nworks on outdoor urban scenes by validating it on multiple established driving\ndatasets. Results show that our method is able to get better accuracy and\nrobustness compared to existing methods.\n","authors":["Quentin Herau","Nathan Piasco","Moussab Bennehar","Luis Roldão","Dzmitry Tsishkou","Cyrille Migniot","Pascal Vasseur","Cédric Demonceaux"],"pdf_url":"https://arxiv.org/pdf/2311.15803v2.pdf","comment":"Paper + Supplementary, under review. Project page:\n https://qherau.github.io/SOAC/"},{"id":"http://arxiv.org/abs/2310.14958v2","updated":"2023-12-20T09:10:00Z","published":"2023-10-23T14:02:57Z","title":"Learning Real-World Image De-Weathering with Imperfect Supervision","summary":" Real-world image de-weathering aims at removing various undesirable\nweather-related artifacts. Owing to the impossibility of capturing image pairs\nconcurrently, existing real-world de-weathering datasets often exhibit\ninconsistent illumination, position, and textures between the ground-truth\nimages and the input degraded images, resulting in imperfect supervision. Such\nnon-ideal supervision negatively affects the training process of learning-based\nde-weathering methods. In this work, we attempt to address the problem with a\nunified solution for various inconsistencies. Specifically, inspired by\ninformation bottleneck theory, we first develop a Consistent Label Constructor\n(CLC) to generate a pseudo-label as consistent as possible with the input\ndegraded image while removing most weather-related degradations. In particular,\nmultiple adjacent frames of the current input are also fed into CLC to enhance\nthe pseudo-label. Then we combine the original imperfect labels and\npseudo-labels to jointly supervise the de-weathering model by the proposed\nInformation Allocation Strategy (IAS). During testing, only the de-weathering\nmodel is used for inference. Experiments on two real-world de-weathering\ndatasets show that our method helps existing de-weathering models achieve\nbetter performance. Codes are available at\nhttps://github.com/1180300419/imperfect-deweathering.\n","authors":["Xiaohui Liu","Zhilu Zhang","Xiaohe Wu","Chaoyu Feng","Xiaotao Wang","LEI LEI","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2310.14958v2.pdf","comment":"17 pages, 14 figures"},{"id":"http://arxiv.org/abs/2308.14078v2","updated":"2023-12-20T09:04:05Z","published":"2023-08-27T11:52:00Z","title":"Sparse3D: Distilling Multiview-Consistent Diffusion for Object\n Reconstruction from Sparse Views","summary":" Reconstructing 3D objects from extremely sparse views is a long-standing and\nchallenging problem. While recent techniques employ image diffusion models for\ngenerating plausible images at novel viewpoints or for distilling pre-trained\ndiffusion priors into 3D representations using score distillation sampling\n(SDS), these methods often struggle to simultaneously achieve high-quality,\nconsistent, and detailed results for both novel-view synthesis (NVS) and\ngeometry. In this work, we present Sparse3D, a novel 3D reconstruction method\ntailored for sparse view inputs. Our approach distills robust priors from a\nmultiview-consistent diffusion model to refine a neural radiance field.\nSpecifically, we employ a controller that harnesses epipolar features from\ninput views, guiding a pre-trained diffusion model, such as Stable Diffusion,\nto produce novel-view images that maintain 3D consistency with the input. By\ntapping into 2D priors from powerful image diffusion models, our integrated\nmodel consistently delivers high-quality results, even when faced with\nopen-world objects. To address the blurriness introduced by conventional SDS,\nwe introduce the category-score distillation sampling (C-SDS) to enhance\ndetail. We conduct experiments on CO3DV2 which is a multi-view dataset of\nreal-world objects. Both quantitative and qualitative evaluations demonstrate\nthat our approach outperforms previous state-of-the-art works on the metrics\nregarding NVS and geometry reconstruction.\n","authors":["Zi-Xin Zou","Weihao Cheng","Yan-Pei Cao","Shi-Sheng Huang","Ying Shan","Song-Hai Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.14078v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.09169v3","updated":"2023-12-20T08:58:03Z","published":"2022-01-23T03:39:31Z","title":"Rich Action-semantic Consistent Knowledge for Early Action Prediction","summary":" Early action prediction (EAP) aims to recognize human actions from a part of\naction execution in ongoing videos, which is an important task for many\npractical applications. Most prior works treat partial or full videos as a\nwhole, ignoring rich action knowledge hidden in videos, i.e., semantic\nconsistencies among different partial videos. In contrast, we partition\noriginal partial or full videos to form a new series of partial videos and mine\nthe Action-Semantic Consistent Knowledge (ASCK) among these new partial videos\nevolving in arbitrary progress levels. Moreover, a novel Rich Action-semantic\nConsistent Knowledge network (RACK) under the teacher-student framework is\nproposed for EAP. Firstly, we use a two-stream pre-trained model to extract\nfeatures of videos. Secondly, we treat the RGB or flow features of the partial\nvideos as nodes and their action semantic consistencies as edges. Next, we\nbuild a bi-directional semantic graph for the teacher network and a\nsingle-directional semantic graph for the student network to model rich ASCK\namong partial videos. The MSE and MMD losses are incorporated as our\ndistillation loss to enrich the ASCK of partial videos from the teacher to the\nstudent network. Finally, we obtain the final prediction by summering the\nlogits of different subnetworks and applying a softmax layer. Extensive\nexperiments and ablative studies have been conducted, demonstrating the\neffectiveness of modeling rich ASCK for EAP. With the proposed RACK, we have\nachieved state-of-the-art performance on three benchmarks. The code is\navailable at https://github.com/lily2lab/RACK.git.\n","authors":["Xiaoli Liu","Jianqin Yin","Di Guo","Huaping Liu"],"pdf_url":"https://arxiv.org/pdf/2201.09169v3.pdf","comment":"Accepted by IEEE TIP,15pages"},{"id":"http://arxiv.org/abs/2312.12848v1","updated":"2023-12-20T08:56:35Z","published":"2023-12-20T08:56:35Z","title":"Quantum Annealing for Computer Vision Minimization Problems","summary":" Computer Vision (CV) labelling algorithms play a pivotal role in the domain\nof low-level vision. For decades, it has been known that these problems can be\nelegantly formulated as discrete energy minimization problems derived from\nprobabilistic graphical models (such as Markov Random Fields). Despite recent\nadvances in inference algorithms (such as graph-cut and message-passing\nalgorithms), the resulting energy minimization problems are generally viewed as\nintractable. The emergence of quantum computations, which offer the potential\nfor faster solutions to certain problems than classical methods, has led to an\nincreased interest in utilizing quantum properties to overcome intractable\nproblems. Recently, there has also been a growing interest in Quantum Computer\nVision (QCV), with the hope of providing a credible alternative or assistant to\ndeep learning solutions in the field. This study investigates a new Quantum\nAnnealing based inference algorithm for CV discrete energy minimization\nproblems. Our contribution is focused on Stereo Matching as a significant CV\nlabeling problem. As a proof of concept, we also use a hybrid quantum-classical\nsolver provided by D-Wave System to compare our results with the best classical\ninference algorithms in the literature.\n","authors":["Shahrokh Heidari","Michael J. Dinneen","Patrice Delmas"],"pdf_url":"https://arxiv.org/pdf/2312.12848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07879v2","updated":"2023-12-20T08:53:40Z","published":"2023-12-13T03:48:45Z","title":"CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation","summary":" Current text-to-image editing models often encounter challenges with smoothly\nmanipulating multiple attributes using a single instruction. Taking inspiration\nfrom the Chain-of-Thought prompting technique utilized in language models, we\npresent an innovative concept known as Chain-of-Instruct Editing (CoIE), which\nenhances the capabilities of these models through step-by-step editing using a\nseries of instructions. In particular, in the context of face manipulation, we\nleverage the contextual learning abilities of a pretrained Large Language Model\n(LLM), such as GPT-4, to generate a sequence of instructions from the original\ninput, utilizing a purpose-designed 1-shot template. To further improve the\nprecision of each editing step, we conduct fine-tuning on the editing models\nusing our self-constructed instruction-guided face editing dataset,\nInstruct-CelebA. And additionally, we incorporate a super-resolution module to\nmitigate the adverse effects of editability and quality degradation.\nExperimental results across various challenging cases confirm the significant\nboost in multi-attribute facial image manipulation using chain-of-instruct\nediting. This is evident in enhanced editing success rates, measured by CLIPSim\nand Coverage metrics, improved by 17.86% and 85.45% respectively, and\nheightened controllability indicated by Preserve L1 and Quality metrics,\nimproved by 11.58% and 4.93% respectively.\n","authors":["Zhenduo Zhang","Bo-Wen Zhang","Guang Liu"],"pdf_url":"https://arxiv.org/pdf/2312.07879v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10079v3","updated":"2023-12-20T08:49:59Z","published":"2023-08-19T17:59:12Z","title":"MeDM: Mediating Image Diffusion Models for Video-to-Video Translation\n with Temporal Correspondence Guidance","summary":" This study introduces an efficient and effective method, MeDM, that utilizes\npre-trained image Diffusion Models for video-to-video translation with\nconsistent temporal flow. The proposed framework can render videos from scene\nposition information, such as a normal G-buffer, or perform text-guided editing\non videos captured in real-world scenarios. We employ explicit optical flows to\nconstruct a practical coding that enforces physical constraints on generated\nframes and mediates independent frame-wise scores. By leveraging this coding,\nmaintaining temporal consistency in the generated videos can be framed as an\noptimization problem with a closed-form solution. To ensure compatibility with\nStable Diffusion, we also suggest a workaround for modifying observation-space\nscores in latent Diffusion Models. Notably, MeDM does not require fine-tuning\nor test-time optimization of the Diffusion Models. Through extensive\nqualitative, quantitative, and subjective experiments on various benchmarks,\nthe study demonstrates the effectiveness and superiority of the proposed\napproach. Our project page can be found at https://medm2023.github.io\n","authors":["Ernie Chu","Tzuhsuan Huang","Shuo-Yen Lin","Jun-Cheng Chen"],"pdf_url":"https://arxiv.org/pdf/2308.10079v3.pdf","comment":"Accepted as a conference paper in AAAI 2024. Project page:\n https://medm2023.github.io"},{"id":"http://arxiv.org/abs/2312.12838v1","updated":"2023-12-20T08:42:57Z","published":"2023-12-20T08:42:57Z","title":"FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image\n Segmentation Against Heterogeneous Annotation Noise","summary":" Federated learning (FL) has emerged as a promising paradigm for training\nsegmentation models on decentralized medical data, owing to its\nprivacy-preserving property. However, existing research overlooks the prevalent\nannotation noise encountered in real-world medical datasets, which limits the\nperformance ceilings of FL. In this paper, we, for the first time, identify and\ntackle this problem. For problem formulation, we propose a contour evolution\nfor modeling non-independent and identically distributed (Non-IID) noise across\npixels within each client and then extend it to the case of multi-source data\nto form a heterogeneous noise model (\\textit{i.e.}, Non-IID annotation noise\nacross clients). For robust learning from annotations with such two-level\nNon-IID noise, we emphasize the importance of data quality in model\naggregation, allowing high-quality clients to have a greater impact on FL. To\nachieve this, we propose \\textbf{Fed}erated learning with \\textbf{A}nnotation\nqu\\textbf{A}lity-aware \\textbf{A}ggregat\\textbf{I}on, named \\textbf{FedA$^3$I},\nby introducing a quality factor based on client-wise noise estimation.\nSpecifically, noise estimation at each client is accomplished through the\nGaussian mixture model and then incorporated into model aggregation in a\nlayer-wise manner to up-weight high-quality clients. Extensive experiments on\ntwo real-world medical image segmentation datasets demonstrate the superior\nperformance of FedA$^3$I against the state-of-the-art approaches in dealing\nwith cross-client annotation noise. The code is available at\n\\color{blue}{https://github.com/wnn2000/FedAAAI}.\n","authors":["Nannan Wu","Zhaobin Sun","Zengqiang Yan","Li Yu"],"pdf_url":"https://arxiv.org/pdf/2312.12838v1.pdf","comment":"Accepted at AAAI'24"},{"id":"http://arxiv.org/abs/2312.12833v1","updated":"2023-12-20T08:30:07Z","published":"2023-12-20T08:30:07Z","title":"Learning Exhaustive Correlation for Spectral Super-Resolution: Where\n Unified Spatial-Spectral Attention Meets Mutual Linear Dependence","summary":" Spectral super-resolution from the easily obtainable RGB image to\nhyperspectral image (HSI) has drawn increasing interest in the field of\ncomputational photography. The crucial aspect of spectral super-resolution lies\nin exploiting the correlation within HSIs. However, two types of bottlenecks in\nexisting Transformers limit performance improvement and practical applications.\nFirst, existing Transformers often separately emphasize either spatial-wise or\nspectral-wise correlation, disrupting the 3D features of HSI and hindering the\nexploitation of unified spatial-spectral correlation. Second, the existing\nself-attention mechanism learns the correlation between pairs of tokens and\ncaptures the full-rank correlation matrix, leading to its inability to\nestablish mutual linear dependence among multiple tokens. To address these\nissues, we propose a novel Exhaustive Correlation Transformer (ECT) for\nspectral super-resolution. First, we propose a Spectral-wise Discontinuous 3D\n(SD3D) splitting strategy, which models unified spatial-spectral correlation by\nsimultaneously utilizing spatial-wise continuous splitting and spectral-wise\ndiscontinuous splitting. Second, we propose a Dynamic Low-Rank Mapping (DLRM)\nmodel, which captures mutual linear dependence among multiple tokens through a\ndynamically calculated low-rank dependence map. By integrating unified\nspatial-spectral attention with mutual linear dependence, our ECT can establish\nexhaustive correlation within HSI. The experimental results on both simulated\nand real data indicate that our method achieves state-of-the-art performance.\nCodes and pretrained models will be available later.\n","authors":["Hongyuan Wang","Lizhi Wang","Jiang Xu","Chang Chen","Xue Hu","Fenglong Song","Youliang Yan"],"pdf_url":"https://arxiv.org/pdf/2312.12833v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12340v2","updated":"2023-12-20T08:27:37Z","published":"2023-12-19T17:13:51Z","title":"Scalable Geometric Fracture Assembly via Co-creation Space among\n Assemblers","summary":" Geometric fracture assembly presents a challenging practical task in\narchaeology and 3D computer vision. Previous methods have focused solely on\nassembling fragments based on semantic information, which has limited the\nquantity of objects that can be effectively assembled. Therefore, there is a\nneed to develop a scalable framework for geometric fracture assembly without\nrelying on semantic information. To improve the effectiveness of assembling\ngeometric fractures without semantic information, we propose a co-creation\nspace comprising several assemblers capable of gradually and unambiguously\nassembling fractures. Additionally, we introduce a novel loss function, i.e.,\nthe geometric-based collision loss, to address collision issues during the\nfracture assembly process and enhance the results. Our framework exhibits\nbetter performance on both PartNet and Breaking Bad datasets compared to\nexisting state-of-the-art frameworks. Extensive experiments and quantitative\ncomparisons demonstrate the effectiveness of our proposed framework, which\nfeatures linear computational complexity, enhanced abstraction, and improved\ngeneralization. Our code is publicly available at\nhttps://github.com/Ruiyuan-Zhang/CCS.\n","authors":["Ruiyuan Zhang","Jiaxiang Liu","Zexi Li","Hao Dong","Jie Fu","Chao Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12340v2.pdf","comment":"AAAI2024"},{"id":"http://arxiv.org/abs/2304.03693v2","updated":"2023-12-20T08:19:09Z","published":"2023-04-07T15:30:49Z","title":"Model-Agnostic Gender Debiased Image Captioning","summary":" Image captioning models are known to perpetuate and amplify harmful societal\nbias in the training set. In this work, we aim to mitigate such gender bias in\nimage captioning models. While prior work has addressed this problem by forcing\nmodels to focus on people to reduce gender misclassification, it conversely\ngenerates gender-stereotypical words at the expense of predicting the correct\ngender. From this observation, we hypothesize that there are two types of\ngender bias affecting image captioning models: 1) bias that exploits context to\npredict gender, and 2) bias in the probability of generating certain (often\nstereotypical) words because of gender. To mitigate both types of gender\nbiases, we propose a framework, called LIBRA, that learns from synthetically\nbiased samples to decrease both types of biases, correcting gender\nmisclassification and changing gender-stereotypical words to more neutral ones.\nCode is available at https://github.com/rebnej/LIBRA.\n","authors":["Yusuke Hirota","Yuta Nakashima","Noa Garcia"],"pdf_url":"https://arxiv.org/pdf/2304.03693v2.pdf","comment":"CVPR 2023"},{"id":"http://arxiv.org/abs/2304.03483v3","updated":"2023-12-20T08:18:10Z","published":"2023-04-07T05:29:59Z","title":"RED-PSM: Regularization by Denoising of Partially Separable Models for\n Dynamic Imaging","summary":" Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at\neach time instant using its undersampled measurements. In particular, in the\ncase of dynamic tomography, only a single projection at a single view angle may\nbe available at a time, making the problem severely ill-posed. In this work, we\npropose an approach, RED-PSM, which combines for the first time two powerful\ntechniques to address this challenging imaging problem. The first, are\npartially separable models, which have been used to efficiently introduce a\nlow-rank prior for the spatio-temporal object. The second is the recent\n\\textit{Regularization by Denoising (RED)}, which provides a flexible framework\nto exploit the impressive performance of state-of-the-art image denoising\nalgorithms, for various inverse problems. We propose a partially separable\nobjective with RED and a computationally efficient and scalable optimization\nscheme with variable splitting and ADMM. Theoretical analysis proves the\nconvergence of our objective to a value corresponding to a stationary point\nsatisfying the first-order optimality conditions. Convergence is accelerated by\na particular projection-domain-based initialization. We demonstrate the\nperformance and computational improvements of our proposed RED-PSM with a\nlearned image denoiser by comparing it to a recent deep-prior-based method\nknown as TD-DIP. Although the main focus is on dynamic tomography, we also show\nperformance advantages of RED-PSM in a cardiac dynamic MRI setting.\n","authors":["Berk Iskender","Marc L. Klasky","Yoram Bresler"],"pdf_url":"https://arxiv.org/pdf/2304.03483v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12828v1","updated":"2023-12-20T08:15:40Z","published":"2023-12-20T08:15:40Z","title":"TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary\n Multi-Label Classification of CLIP Without Training","summary":" Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive\ncapabilities in open-vocabulary classification. The class token in the image\nencoder is trained to capture the global features to distinguish different text\ndescriptions supervised by contrastive loss, making it highly effective for\nsingle-label classification. However, it shows poor performance on multi-label\ndatasets because the global feature tends to be dominated by the most prominent\nclass and the contrastive nature of softmax operation aggravates it. In this\nstudy, we observe that the multi-label classification results heavily rely on\ndiscriminative local features but are overlooked by CLIP. As a result, we\ndissect the preservation of patch-wise spatial information in CLIP and proposed\na local-to-global framework to obtain image tags. It comprises three steps: (1)\npatch-level classification to obtain coarse scores; (2) dual-masking attention\nrefinement (DMAR) module to refine the coarse scores; (3) class-wise\nreidentification (CWR) module to remedy predictions from a global perspective.\nThis framework is solely based on frozen CLIP and significantly enhances its\nmulti-label classification performance on various benchmarks without\ndataset-specific training. Besides, to comprehensively assess the quality and\npracticality of generated tags, we extend their application to the downstream\ntask, i.e., weakly supervised semantic segmentation (WSSS) with generated tags\nas image-level pseudo labels. Experiments demonstrate that this\nclassify-then-segment paradigm dramatically outperforms other annotation-free\nsegmentation methods and validates the effectiveness of generated tags. Our\ncode is available at https://github.com/linyq2117/TagCLIP.\n","authors":["Yuqi Lin","Minghao Chen","Kaipeng Zhang","Hengjia Li","Mingming Li","Zheng Yang","Dongqin Lv","Binbin Lin","Haifeng Liu","Deng Cai"],"pdf_url":"https://arxiv.org/pdf/2312.12828v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.12826v1","updated":"2023-12-20T08:05:57Z","published":"2023-12-20T08:05:57Z","title":"ReCo-Diff: Explore Retinex-Based Condition Strategy in Diffusion Model\n for Low-Light Image Enhancement","summary":" Low-light image enhancement (LLIE) has achieved promising performance by\nemploying conditional diffusion models. In this study, we propose ReCo-Diff, a\nnovel approach that incorporates Retinex-based prior as an additional\npre-processing condition to regulate the generating capabilities of the\ndiffusion model. ReCo-Diff first leverages a pre-trained decomposition network\nto produce initial reflectance and illumination maps of the low-light image.\nThen, an adjustment network is introduced to suppress the noise in the\nreflectance map and brighten the illumination map, thus forming the learned\nRetinex-based condition. The condition is integrated into a refinement network,\nimplementing Retinex-based conditional modules that offer sufficient guidance\nat both feature- and image-levels. By treating Retinex theory as a condition,\nReCo-Diff presents a unique perspective for establishing an LLIE-specific\ndiffusion model. Extensive experiments validate the rationality and superiority\nof our ReCo-Diff approach. The code will be made publicly available.\n","authors":["Yuhui Wu","Guoqing Wang","Zhiwen Wang","Yang Yang","Tianyu Li","Peng Wang","Chongyi Li","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2312.12826v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12824v1","updated":"2023-12-20T07:58:41Z","published":"2023-12-20T07:58:41Z","title":"FedSODA: Federated Cross-assessment and Dynamic Aggregation for\n Histopathology Segmentation","summary":" Federated learning (FL) for histopathology image segmentation involving\nmultiple medical sites plays a crucial role in advancing the field of accurate\ndisease diagnosis and treatment. However, it is still a task of great\nchallenges due to the sample imbalance across clients and large data\nheterogeneity from disparate organs, variable segmentation tasks, and diverse\ndistribution. Thus, we propose a novel FL approach for histopathology nuclei\nand tissue segmentation, FedSODA, via synthetic-driven cross-assessment\noperation (SO) and dynamic stratified-layer aggregation (DA). Our SO constructs\na cross-assessment strategy to connect clients and mitigate the representation\nbias under sample imbalance. Our DA utilizes layer-wise interaction and dynamic\naggregation to diminish heterogeneity and enhance generalization. The\neffectiveness of our FedSODA has been evaluated on the most extensive\nhistopathology image segmentation dataset from 7 independent datasets. The code\nis available at https://github.com/yuanzhang7/FedSODA.\n","authors":["Yuan Zhang","Yaolei Qi","Xiaoming Qi","Lotfi Senhadji","Yongyue Wei","Feng Chen","Guanyu Yang"],"pdf_url":"https://arxiv.org/pdf/2312.12824v1.pdf","comment":"Accepted by ICASSP2024"},{"id":"http://arxiv.org/abs/2312.03795v2","updated":"2023-12-20T07:52:24Z","published":"2023-12-06T14:13:54Z","title":"AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and\n Reconstruction with Canonical Score Distillation","summary":" Text-to-3D model adaptations have advanced static 3D model quality, but\nsequential 3D model generation, particularly for animatable objects with large\nmotions, is still scarce. Our work proposes AnimatableDreamer, a text-to-4D\ngeneration framework capable of generating diverse categories of non-rigid\nobjects while adhering to the object motions extracted from a monocular video.\nAt its core, AnimatableDreamer is equipped with our novel optimization design\ndubbed Canonical Score Distillation (CSD), which simplifies the generation\ndimension from 4D to 3D by denoising over different frames in the time-varying\ncamera spaces while conducting the distillation process in a unique canonical\nspace shared per video. Concretely, CSD ensures that score gradients\nback-propagate to the canonical space through differentiable warping, hence\nguaranteeing the time-consistent generation and maintaining morphological\nplausibility across different poses. By lifting the 3D generator to 4D with\nwarping functions, AnimatableDreamer offers a novel perspective on non-rigid 3D\nmodel generation and reconstruction. Besides, with inductive knowledge from a\nmulti-view consistent diffusion model, CSD regularizes reconstruction from\nnovel views, thus cyclically enhancing the generation process. Extensive\nexperiments demonstrate the capability of our method in generating\nhigh-flexibility text-guided 3D models from the monocular video, while also\nshowing improved reconstruction performance over typical non-rigid\nreconstruction methods. Project page https://AnimatableDreamer.github.io.\n","authors":["Xinzhou Wang","Yikai Wang","Junliang Ye","Zhengyi Wang","Fuchun Sun","Pengkun Liu","Ling Wang","Kai Sun","Xintong Wang","Bin He"],"pdf_url":"https://arxiv.org/pdf/2312.03795v2.pdf","comment":"Project page: https://animatabledreamer.github.io/"},{"id":"http://arxiv.org/abs/2312.12816v1","updated":"2023-12-20T07:36:38Z","published":"2023-12-20T07:36:38Z","title":"Object-aware Adaptive-Positivity Learning for Audio-Visual Question\n Answering","summary":" This paper focuses on the Audio-Visual Question Answering (AVQA) task that\naims to answer questions derived from untrimmed audible videos. To generate\naccurate answers, an AVQA model is expected to find the most informative\naudio-visual clues relevant to the given questions. In this paper, we propose\nto explicitly consider fine-grained visual objects in video frames\n(object-level clues) and explore the multi-modal relations(i.e., the object,\naudio, and question) in terms of feature interaction and model optimization.\nFor the former, we present an end-to-end object-oriented network that adopts a\nquestion-conditioned clue discovery module to concentrate audio/visual\nmodalities on respective keywords of the question and designs a\nmodality-conditioned clue collection module to highlight closely associated\naudio segments or visual objects. For model optimization, we propose an\nobject-aware adaptive-positivity learning strategy that selects the highly\nsemantic-matched multi-modal pair as positivity. Specifically, we design two\nobject-aware contrastive loss functions to identify the highly relevant\nquestion-object pairs and audio-object pairs, respectively. These selected\npairs are constrained to have larger similarity values than the mismatched\npairs. The positivity-selecting process is adaptive as the positivity pairs\nselected in each video frame may be different. These two object-aware\nobjectives help the model understand which objects are exactly relevant to the\nquestion and which are making sounds. Extensive experiments on the MUSIC-AVQA\ndataset demonstrate the proposed method is effective in finding favorable\naudio-visual clues and also achieves new state-of-the-art question-answering\nperformance.\n","authors":["Zhangbin Li","Dan Guo","Jinxing Zhou","Jing Zhang","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12816v1.pdf","comment":"Accepted by AAAI-2024"},{"id":"http://arxiv.org/abs/2312.12815v1","updated":"2023-12-20T07:34:20Z","published":"2023-12-20T07:34:20Z","title":"OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using\n Semantic Understanding in Mixed Reality","summary":" One key challenge in augmented reality is the placement of virtual content in\nnatural locations. Existing automated techniques are only able to work with a\nclosed-vocabulary, fixed set of objects. In this paper, we introduce a new\nopen-vocabulary method for object placement. Our eight-stage pipeline leverages\nrecent advances in segmentation models, vision-language models, and LLMs to\nplace any virtual object in any AR camera frame or scene. In a preliminary user\nstudy, we show that our method performs at least as well as human experts 57%\nof the time.\n","authors":["Luke Yoffe","Aditya Sharma","Tobias Höllerer"],"pdf_url":"https://arxiv.org/pdf/2312.12815v1.pdf","comment":"IEEE International Symposium on Mixed and Augmented Reality (ISMAR)\n 2023"},{"id":"http://arxiv.org/abs/2308.03108v2","updated":"2023-12-20T07:32:44Z","published":"2023-08-06T13:29:42Z","title":"SAAM: Stealthy Adversarial Attack on Monocular Depth Estimation","summary":" In this paper, we investigate the vulnerability of MDE to adversarial\npatches. We propose a novel \\underline{S}tealthy \\underline{A}dversarial\n\\underline{A}ttacks on \\underline{M}DE (SAAM) that compromises MDE by either\ncorrupting the estimated distance or causing an object to seamlessly blend into\nits surroundings. Our experiments, demonstrate that the designed stealthy patch\nsuccessfully causes a DNN-based MDE to misestimate the depth of objects. In\nfact, our proposed adversarial patch achieves a significant 60\\% depth error\nwith 99\\% ratio of the affected region. Importantly, despite its adversarial\nnature, the patch maintains a naturalistic appearance, making it inconspicuous\nto human observers. We believe that this work sheds light on the threat of\nadversarial attacks in the context of MDE on edge devices. We hope it raises\nawareness within the community about the potential real-life harm of such\nattacks and encourages further research into developing more robust and\nadaptive defense mechanisms.\n","authors":["Amira Guesmi","Muhammad Abdullah Hanif","Bassem Ouni","Muhammad Shafique"],"pdf_url":"https://arxiv.org/pdf/2308.03108v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10461v2","updated":"2023-12-20T07:27:27Z","published":"2023-12-16T14:27:06Z","title":"Rethinking the Up-Sampling Operations in CNN-based Generative Network\n for Generalizable Deepfake Detection","summary":" Recently, the proliferation of highly realistic synthetic images, facilitated\nthrough a variety of GANs and Diffusions, has significantly heightened the\nsusceptibility to misuse. While the primary focus of deepfake detection has\ntraditionally centered on the design of detection algorithms, an investigative\ninquiry into the generator architectures has remained conspicuously absent in\nrecent years. This paper contributes to this lacuna by rethinking the\narchitectures of CNN-based generators, thereby establishing a generalized\nrepresentation of synthetic artifacts. Our findings illuminate that the\nup-sampling operator can, beyond frequency-based artifacts, produce generalized\nforgery artifacts. In particular, the local interdependence among image pixels\ncaused by upsampling operators is significantly demonstrated in synthetic\nimages generated by GAN or diffusion. Building upon this observation, we\nintroduce the concept of Neighboring Pixel Relationships(NPR) as a means to\ncapture and characterize the generalized structural artifacts stemming from\nup-sampling operations. A comprehensive analysis is conducted on an open-world\ndataset, comprising samples generated by \\tft{28 distinct generative models}.\nThis analysis culminates in the establishment of a novel state-of-the-art\nperformance, showcasing a remarkable \\tft{11.6\\%} improvement over existing\nmethods. The code is available at\nhttps://github.com/chuangchuangtan/NPR-DeepfakeDetection.\n","authors":["Chuangchuang Tan","Huan Liu","Yao Zhao","Shikui Wei","Guanghua Gu","Ping Liu","Yunchao Wei"],"pdf_url":"https://arxiv.org/pdf/2312.10461v2.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.11562v2","updated":"2023-12-20T07:25:58Z","published":"2023-12-17T15:16:13Z","title":"A Survey of Reasoning with Foundation Models: Concepts, Methodologies,\n and Outlook","summary":" Reasoning, a crucial ability for complex problem-solving, plays a pivotal\nrole in various real-world settings such as negotiation, medical diagnosis, and\ncriminal investigation. It serves as a fundamental methodology in the field of\nArtificial General Intelligence (AGI). With the ongoing development of\nfoundation models, there is a growing interest in exploring their abilities in\nreasoning tasks. In this paper, we introduce seminal foundation models proposed\nor adaptable for reasoning, highlighting the latest advancements in various\nreasoning tasks, methods, and benchmarks. We then delve into the potential\nfuture directions behind the emergence of reasoning abilities within foundation\nmodels. We also discuss the relevance of multimodal learning, autonomous\nagents, and super alignment in the context of reasoning. By discussing these\nfuture research directions, we hope to inspire researchers in their exploration\nof this field, stimulate further advancements in reasoning with foundation\nmodels, and contribute to the development of AGI.\n","authors":["Jiankai Sun","Chuanyang Zheng","Enze Xie","Zhengying Liu","Ruihang Chu","Jianing Qiu","Jiaqi Xu","Mingyu Ding","Hongyang Li","Mengzhe Geng","Yue Wu","Wenhai Wang","Junsong Chen","Zhangyue Yin","Xiaozhe Ren","Jie Fu","Junxian He","Wu Yuan","Qi Liu","Xihui Liu","Yu Li","Hao Dong","Yu Cheng","Ming Zhang","Pheng Ann Heng","Jifeng Dai","Ping Luo","Jingdong Wang","Ji-Rong Wen","Xipeng Qiu","Yike Guo","Hui Xiong","Qun Liu","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2312.11562v2.pdf","comment":"20 Figures, 159 Pages, 740 References, Project Page\n https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models"},{"id":"http://arxiv.org/abs/2303.11938v2","updated":"2023-12-20T07:12:06Z","published":"2023-03-21T15:38:26Z","title":"3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion","summary":" We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs\n(NeRFs that generate 3D objects given input latent code). Recent works such as\nDreamFusion and Magic3D have shown great success in generating 3D content using\nNeRFs and text prompts, but the current approach of optimizing a NeRF for every\ntext prompt is 1) extremely time-consuming and 2) often leads to low-resolution\noutputs. To address these challenges, we propose a novel method named\n3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs\nfast 3D content creation in less than a minute. In particular, we introduce a\nlatent diffusion prior network for learning the w latent from the input CLIP\ntext/image embeddings. This pipeline allows us to produce the w latent without\nfurther optimization during inference and the pre-trained NeRF is able to\nperform multi-view high-resolution 3D synthesis based on the latent. We note\nthat the novelty of our model lies in that we introduce contrastive learning\nduring training the diffusion prior which enables the generation of the valid\nview-invariant latent code. We demonstrate through experiments the\neffectiveness of our proposed view-invariant diffusion process for fast\ntext-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our\nmodel is able to serve as the role of a plug-and-play tool for text-to-3D with\npre-trained NeRFs.\n","authors":["Yu-Jhe Li","Tao Xu","Ji Hou","Bichen Wu","Xiaoliang Dai","Albert Pumarola","Peizhao Zhang","Peter Vajda","Kris Kitani"],"pdf_url":"https://arxiv.org/pdf/2303.11938v2.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2305.16172v2","updated":"2023-12-20T07:10:27Z","published":"2023-05-25T15:31:02Z","title":"Masked and Permuted Implicit Context Learning for Scene Text Recognition","summary":" Scene Text Recognition (STR) is difficult because of the variations in text\nstyles, shapes, and backgrounds. Though the integration of linguistic\ninformation enhances models' performance, existing methods based on either\npermuted language modeling (PLM) or masked language modeling (MLM) have their\npitfalls. PLM's autoregressive decoding lacks foresight into subsequent\ncharacters, while MLM overlooks inter-character dependencies. Addressing these\nproblems, we propose a masked and permuted implicit context learning network\nfor STR, which unifies PLM and MLM within a single decoder, inheriting the\nadvantages of both approaches. We utilize the training procedure of PLM, and to\nintegrate MLM, we incorporate word length information into the decoding process\nand replace the undetermined characters with mask tokens. Besides, perturbation\ntraining is employed to train a more robust model against potential length\nprediction errors. Our empirical evaluations demonstrate the performance of our\nmodel. It not only achieves superior performance on the common benchmarks but\nalso achieves a substantial improvement of $9.1\\%$ on the more challenging\nUnion14M-Benchmark.\n","authors":["Xiaomeng Yang","Zhi Qiao","Jin Wei","Dongbao Yang","Yu Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.16172v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12807v1","updated":"2023-12-20T07:04:33Z","published":"2023-12-20T07:04:33Z","title":"All but One: Surgical Concept Erasing with Model Preservation in\n Text-to-Image Diffusion Models","summary":" Text-to-Image models such as Stable Diffusion have shown impressive image\ngeneration synthesis, thanks to the utilization of large-scale datasets.\nHowever, these datasets may contain sexually explicit, copyrighted, or\nundesirable content, which allows the model to directly generate them. Given\nthat retraining these large models on individual concept deletion requests is\ninfeasible, fine-tuning algorithms have been developed to tackle concept\nerasing in diffusion models. While these algorithms yield good concept erasure,\nthey all present one of the following issues: 1) the corrupted feature space\nyields synthesis of disintegrated objects, 2) the initially synthesized content\nundergoes a divergence in both spatial structure and semantics in the generated\nimages, and 3) sub-optimal training updates heighten the model's susceptibility\nto utility harm. These issues severely degrade the original utility of\ngenerative models. In this work, we present a new approach that solves all of\nthese challenges. We take inspiration from the concept of classifier guidance\nand propose a surgical update on the classifier guidance term while\nconstraining the drift of the unconditional score term. Furthermore, our\nalgorithm empowers the user to select an alternative to the erasing concept,\nallowing for more controllability. Our experimental results show that our\nalgorithm not only erases the target concept effectively but also preserves the\nmodel's generation capability.\n","authors":["Seunghoo Hong","Juhun Lee","Simon S. Woo"],"pdf_url":"https://arxiv.org/pdf/2312.12807v1.pdf","comment":"Main paper with supplementary materials"},{"id":"http://arxiv.org/abs/2312.12804v1","updated":"2023-12-20T06:52:38Z","published":"2023-12-20T06:52:38Z","title":"Multi-stages attention Breast cancer classification based on nonlinear\n spiking neural P neurons with autapses","summary":" Breast cancer(BC) is a prevalent type of malignant tumor in women. Early\ndiagnosis and treatment are vital for enhancing the patients' survival rate.\nDownsampling in deep networks may lead to loss of information, so for\ncompensating the detail and edge information and allowing convolutional neural\nnetworks to pay more attention to seek the lesion region, we propose a\nmulti-stages attention architecture based on NSNP neurons with autapses. First,\nunlike the single-scale attention acquisition methods of existing methods, we\nset up spatial attention acquisition at each feature map scale of the\nconvolutional network to obtain an fusion global information on attention\nguidance. Then we introduce a new type of NSNP variants called NSNP neurons\nwith autapses. Specifically, NSNP systems are modularized as feature encoders,\nrecoding the features extracted from convolutional neural network as well as\nthe fusion of attention information and preserve the key characteristic\nelements in feature maps. This ensures the retention of valuable data while\ngradually transforming high-dimensional complicated info into low-dimensional\nones. The proposed method is evaluated on the public dataset BreakHis at\nvarious magnifications and classification tasks. It achieves a classification\naccuracy of 96.32% at all magnification cases, outperforming state-of-the-art\nmethods. Ablation studies are also performed, verifying the proposed model's\nefficacy. The source code is available at\nXhuBobYoung/Breast-cancer-Classification.\n","authors":["Bo Yang","Hong Peng","Xiaohui Luo","Jun Wang","Xianzhong Long"],"pdf_url":"https://arxiv.org/pdf/2312.12804v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12789v1","updated":"2023-12-20T06:22:21Z","published":"2023-12-20T06:22:21Z","title":"SLP-Net:An efficient lightweight network for segmentation of skin\n lesions","summary":" Prompt treatment for melanoma is crucial. To assist physicians in identifying\nlesion areas precisely in a quick manner, we propose a novel skin lesion\nsegmentation technique namely SLP-Net, an ultra-lightweight segmentation\nnetwork based on the spiking neural P(SNP) systems type mechanism. Most\nexisting convolutional neural networks achieve high segmentation accuracy while\nneglecting the high hardware cost. SLP-Net, on the contrary, has a very small\nnumber of parameters and a high computation speed. We design a lightweight\nmulti-scale feature extractor without the usual encoder-decoder structure.\nRather than a decoder, a feature adaptation module is designed to replace it\nand implement multi-scale information decoding. Experiments at the ISIC2018\nchallenge demonstrate that the proposed model has the highest Acc and DSC among\nthe state-of-the-art methods, while experiments on the PH2 dataset also\ndemonstrate a favorable generalization ability. Finally, we compare the\ncomputational complexity as well as the computational speed of the models in\nexperiments, where SLP-Net has the highest overall superiority\n","authors":["Bo Yang","Hong Peng","Chenggang Guo","Xiaohui Luo","Jun Wang","Xianzhong Long"],"pdf_url":"https://arxiv.org/pdf/2312.12789v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.15563v2","updated":"2023-12-20T06:18:19Z","published":"2022-10-27T15:53:38Z","title":"Multimodal Transformer Distillation for Audio-Visual Synchronization","summary":" Audio-visual synchronization aims to determine whether the mouth movements\nand speech in the video are synchronized. VocaLiST reaches state-of-the-art\nperformance by incorporating multimodal Transformers to model audio-visual\ninteract information. However, it requires high computing resources, making it\nimpractical for real-world applications. This paper proposed an MTDVocaLiST\nmodel, which is trained by our proposed multimodal Transformer distillation\n(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the\ncross-attention distribution and value-relation in the Transformer of VocaLiST.\nAdditionally, we harness uncertainty weighting to fully exploit the interaction\ninformation across all layers. Our proposed method is effective in two aspects:\nFrom the distillation method perspective, MTD loss outperforms other strong\ndistillation baselines. From the distilled model's performance perspective: 1)\nMTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match\nmodels by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST\nby 83.52%, yet still maintaining similar performance.\n","authors":["Xuanjun Chen","Haibin Wu","Chung-Che Wang","Hung-yi Lee","Jyh-Shing Roger Jang"],"pdf_url":"https://arxiv.org/pdf/2210.15563v2.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2308.13739v2","updated":"2023-12-20T05:55:10Z","published":"2023-08-26T02:55:12Z","title":"Devignet: High-Resolution Vignetting Removal via a Dual Aggregated\n Fusion Transformer With Adaptive Channel Expansion","summary":" Vignetting commonly occurs as a degradation in images resulting from factors\nsuch as lens design, improper lens hood usage, and limitations in camera\nsensors. This degradation affects image details, color accuracy, and presents\nchallenges in computational photography. Existing vignetting removal algorithms\npredominantly rely on ideal physics assumptions and hand-crafted parameters,\nresulting in the ineffective removal of irregular vignetting and suboptimal\nresults. Moreover, the substantial lack of real-world vignetting datasets\nhinders the objective and comprehensive evaluation of vignetting removal. To\naddress these challenges, we present Vigset, a pioneering dataset for\nvignetting removal. Vigset includes 983 pairs of both vignetting and\nvignetting-free high-resolution ($5340\\times3697$) real-world images under\nvarious conditions. In addition, We introduce DeVigNet, a novel frequency-aware\nTransformer architecture designed for vignetting removal. Through the Laplacian\nPyramid decomposition, we propose the Dual Aggregated Fusion Transformer to\nhandle global features and remove vignetting in the low-frequency domain.\nAdditionally, we propose the Adaptive Channel Expansion Module to enhance\ndetails in the high-frequency domain. The experiments demonstrate that the\nproposed model outperforms existing state-of-the-art methods. The code, models,\nand dataset are available at \\url{https://github.com/CXH-Research/DeVigNet}.\n","authors":["Shenghong Luo","Xuhang Chen","Weiwen Chen","Zinuo Li","Shuqiang Wang","Chi-Man Pun"],"pdf_url":"https://arxiv.org/pdf/2308.13739v2.pdf","comment":"Accepted by AAAI Conference on Artificial Intelligence 2024 (AAAI\n 2024)"},{"id":"http://arxiv.org/abs/2305.10701v3","updated":"2023-12-20T05:52:41Z","published":"2023-05-18T04:28:47Z","title":"Personalization as a Shortcut for Few-Shot Backdoor Attack against\n Text-to-Image Diffusion Models","summary":" Although recent personalization methods have democratized high-resolution\nimage synthesis by enabling swift concept acquisition with minimal examples and\nlightweight computation, they also present an exploitable avenue for high\naccessible backdoor attacks. This paper investigates a critical and unexplored\naspect of text-to-image (T2I) diffusion models - their potential vulnerability\nto backdoor attacks via personalization. Our study focuses on a zero-day\nbackdoor vulnerability prevalent in two families of personalization methods,\nepitomized by Textual Inversion and DreamBooth.Compared to traditional backdoor\nattacks, our proposed method can facilitate more precise, efficient, and easily\naccessible attacks with a lower barrier to entry. We provide a comprehensive\nreview of personalization in T2I diffusion models, highlighting the operation\nand exploitation potential of this backdoor vulnerability. To be specific, by\nstudying the prompt processing of Textual Inversion and DreamBooth, we have\ndevised dedicated backdoor attacks according to the different ways of dealing\nwith unseen tokens and analyzed the influence of triggers and concept images on\nthe attack effect. Through comprehensive empirical study, we endorse the\nutilization of the nouveau-token backdoor attack due to its impressive\neffectiveness, stealthiness, and integrity, markedly outperforming the\nlegacy-token backdoor attack.\n","authors":["Yihao Huang","Felix Juefei-Xu","Qing Guo","Jie Zhang","Yutong Wu","Ming Hu","Tianlin Li","Geguang Pu","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2305.10701v3.pdf","comment":"16 pages, accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12096v2","updated":"2023-12-20T05:21:26Z","published":"2023-12-19T12:19:20Z","title":"DLCA-Recon: Dynamic Loose Clothing Avatar Reconstruction from Monocular\n Videos","summary":" Reconstructing a dynamic human with loose clothing is an important but\ndifficult task. To address this challenge, we propose a method named DLCA-Recon\nto create human avatars from monocular videos. The distance from loose clothing\nto the underlying body rapidly changes in every frame when the human freely\nmoves and acts. Previous methods lack effective geometric initialization and\nconstraints for guiding the optimization of deformation to explain this\ndramatic change, resulting in the discontinuous and incomplete reconstruction\nsurface. To model the deformation more accurately, we propose to initialize an\nestimated 3D clothed human in the canonical space, as it is easier for\ndeformation fields to learn from the clothed human than from SMPL. With both\nrepresentations of explicit mesh and implicit SDF, we utilize the physical\nconnection information between consecutive frames and propose a dynamic\ndeformation field (DDF) to optimize deformation fields. DDF accounts for\ncontributive forces on loose clothing to enhance the interpretability of\ndeformations and effectively capture the free movement of loose clothing.\nMoreover, we propagate SMPL skinning weights to each individual and refine pose\nand skinning weights during the optimization to improve skinning\ntransformation. Based on more reasonable initialization and DDF, we can\nsimulate real-world physics more accurately. Extensive experiments on public\nand our own datasets validate that our method can produce superior results for\nhumans with loose clothing compared to the SOTA methods.\n","authors":["Chunjie Luo","Fei Luo","Yusen Wang","Enxu Zhao","Chunxia Xiao"],"pdf_url":"https://arxiv.org/pdf/2312.12096v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12773v1","updated":"2023-12-20T05:17:06Z","published":"2023-12-20T05:17:06Z","title":"Segmenting Messy Text: Detecting Boundaries in Text Derived from\n Historical Newspaper Images","summary":" Text segmentation, the task of dividing a document into sections, is often a\nprerequisite for performing additional natural language processing tasks.\nExisting text segmentation methods have typically been developed and tested\nusing clean, narrative-style text with segments containing distinct topics.\nHere we consider a challenging text segmentation task: dividing newspaper\nmarriage announcement lists into units of one announcement each. In many cases\nthe information is not structured into sentences, and adjacent segments are not\ntopically distinct from each other. In addition, the text of the announcements,\nwhich is derived from images of historical newspapers via optical character\nrecognition, contains many typographical errors. As a result, these\nannouncements are not amenable to segmentation with existing techniques. We\npresent a novel deep learning-based model for segmenting such text and show\nthat it significantly outperforms an existing state-of-the-art method on our\ntask.\n","authors":["Carol Anderson","Phil Crone"],"pdf_url":"https://arxiv.org/pdf/2312.12773v1.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.12768v1","updated":"2023-12-20T05:06:01Z","published":"2023-12-20T05:06:01Z","title":"Mutual-modality Adversarial Attack with Semantic Perturbation","summary":" Adversarial attacks constitute a notable threat to machine learning systems,\ngiven their potential to induce erroneous predictions and classifications.\nHowever, within real-world contexts, the essential specifics of the deployed\nmodel are frequently treated as a black box, consequently mitigating the\nvulnerability to such attacks. Thus, enhancing the transferability of the\nadversarial samples has become a crucial area of research, which heavily relies\non selecting appropriate surrogate models. To address this challenge, we\npropose a novel approach that generates adversarial attacks in a\nmutual-modality optimization scheme. Our approach is accomplished by leveraging\nthe pre-trained CLIP model. Firstly, we conduct a visual attack on the clean\nimage that causes semantic perturbations on the aligned embedding space with\nthe other textual modality. Then, we apply the corresponding defense on the\ntextual modality by updating the prompts, which forces the re-matching on the\nperturbed embedding space. Finally, to enhance the attack transferability, we\nutilize the iterative training strategy on the visual attack and the textual\ndefense, where the two processes optimize from each other. We evaluate our\napproach on several benchmark datasets and demonstrate that our mutual-modal\nattack strategy can effectively produce high-transferable attacks, which are\nstable regardless of the target networks. Our approach outperforms\nstate-of-the-art attack methods and can be readily deployed as a plug-and-play\nsolution.\n","authors":["Jingwen Ye","Ruonan Yu","Songhua Liu","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12768v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.07937v3","updated":"2023-12-20T04:50:14Z","published":"2023-12-13T07:30:19Z","title":"BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics","summary":" The recently emerging text-to-motion advances have spired numerous attempts\nfor convenient and interactive human motion generation. Yet, existing methods\nare largely limited to generating body motions only without considering the\nrich two-hand motions, let alone handling various conditions like body dynamics\nor texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal\ndataset for two-hand motion generation. Our dataset includes accurate motion\ntracking for the human body and hands and provides pair-wised finger-level hand\nannotations and body descriptions. We further provide a strong baseline method,\nBOTH2Hands, for the novel task: generating vivid two-hand motions from both\nimplicit body dynamics and explicit text prompts. We first warm up two parallel\nbody-to-hand and text-to-hand diffusion models and then utilize the\ncross-attention transformer for motion blending. Extensive experiments and\ncross-validations demonstrate the effectiveness of our approach and dataset for\ngenerating convincing two-hand motions from the hybrid body-and-textual\nconditions. Our dataset and code will be disseminated to the community for\nfuture research.\n","authors":["Wenqian Zhang","Molin Huang","Yuxuan Zhou","Juze Zhang","Jingyi Yu","Jingya Wang","Lan Xu"],"pdf_url":"https://arxiv.org/pdf/2312.07937v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12763v1","updated":"2023-12-20T04:49:45Z","published":"2023-12-20T04:49:45Z","title":"AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition\n and Fusion","summary":" Generating realistic human motion sequences from text descriptions is a\nchallenging task that requires capturing the rich expressiveness of both\nnatural language and human motion.Recent advances in diffusion models have\nenabled significant progress in human motion synthesis.However, existing\nmethods struggle to handle text inputs that describe complex or long motions.In\nthis paper, we propose the Adaptable Motion Diffusion (AMD) model, which\nleverages a Large Language Model (LLM) to parse the input text into a sequence\nof concise and interpretable anatomical scripts that correspond to the target\nmotion.This process exploits the LLM's ability to provide anatomical guidance\nfor complex motion synthesis.We then devise a two-branch fusion scheme that\nbalances the influence of the input text and the anatomical scripts on the\ninverse diffusion process, which adaptively ensures the semantic fidelity and\ndiversity of the synthesized motion.Our method can effectively handle texts\nwith complex or long motion descriptions, where existing methods often fail.\nExperiments on datasets with relatively more complex motions, such as CLCD1 and\nCLCD2, demonstrate that our AMD significantly outperforms existing\nstate-of-the-art models.\n","authors":["Beibei Jing","Youjia Zhang","Zikai Song","Junqing Yu","Wei Yang"],"pdf_url":"https://arxiv.org/pdf/2312.12763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08866v2","updated":"2023-12-20T04:31:00Z","published":"2023-12-14T12:41:08Z","title":"MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention","summary":" Efficiently capturing multi-scale information and building long-range\ndependencies among pixels are essential for medical image segmentation because\nof the various sizes and shapes of the lesion regions or organs. In this paper,\nwe present Multi-scale Cross-axis Attention (MCA) to solve the above\nchallenging issues based on the efficient axial attention. Instead of simply\nconnecting axial attention along the horizontal and vertical directions\nsequentially, we propose to calculate dual cross attentions between two\nparallel axial attentions to capture global information better. To process the\nsignificant variations of lesion regions or organs in individual sizes and\nshapes, we also use multiple convolutions of strip-shape kernels with different\nkernel sizes in each axial attention path to improve the efficiency of the\nproposed MCA in encoding spatial information. We build the proposed MCA upon\nthe MSCAN backbone, yielding our network, termed MCANet. Our MCANet with only\n4M+ parameters performs even better than most previous works with heavy\nbackbones (e.g., Swin Transformer) on four challenging tasks, including skin\nlesion segmentation, nuclei segmentation, abdominal multi-organ segmentation,\nand polyp segmentation. Code is available at\nhttps://github.com/haoshao-nku/medical_seg.\n","authors":["Hao Shao","Quansheng Zeng","Qibin Hou","Jufeng Yang"],"pdf_url":"https://arxiv.org/pdf/2312.08866v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12754v1","updated":"2023-12-20T04:27:13Z","published":"2023-12-20T04:27:13Z","title":"Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic\n Segmentation","summary":" Recently, CLIP has found practical utility in the domain of pixel-level\nzero-shot segmentation tasks. The present landscape features two-stage\nmethodologies beset by issues such as intricate pipelines and elevated\ncomputational costs. While current one-stage approaches alleviate these\nconcerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's\ngeneralization capacity, they still fall short in fully harnessing CLIP's\npotential for pixel-level unseen class demarcation and precise pixel\npredictions. To further stimulate CLIP's zero-shot dense prediction capability,\nwe propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from\nimage to pixel. Specifically, we initially introduce Spectral Prompt Tuning\n(SPT), incorporating spectral prompts into the CLIP visual encoder's shallow\nlayers to capture structural intricacies of images, thereby enhancing\ncomprehension of unseen classes. Subsequently, we introduce the Spectral Guided\nDecoder (SGD), utilizing both high and low-frequency information to steer the\nnetwork's spatial focus towards more prominent classification features,\nenabling precise pixel-level prediction outcomes. Through extensive experiments\non two public datasets, we demonstrate the superiority of our method over\nstate-of-the-art approaches, performing well across all classes and\nparticularly excelling in handling unseen classes. Code is available\nat:https://github.com/clearxu/SPT.\n","authors":["Wenhao Xu","Rongtao Xu","Changwei Wang","Shibiao Xu","Li Guo","Man Zhang","Xiaopeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12754v1.pdf","comment":"AAAI2024 Accepted"},{"id":"http://arxiv.org/abs/2306.12045v6","updated":"2023-12-20T04:22:24Z","published":"2023-06-21T06:30:18Z","title":"Temporal Conditioning Spiking Latent Variable Models of the Neural\n Response to Natural Visual Scenes","summary":" Developing computational models of neural response is crucial for\nunderstanding sensory processing and neural computations. Current\nstate-of-the-art neural network methods use temporal filters to handle temporal\ndependencies, resulting in an unrealistic and inflexible processing paradigm.\nMeanwhile, these methods target trial-averaged firing rates and fail to capture\nimportant features in spike trains. This work presents the temporal\nconditioning spiking latent variable models (TeCoS-LVM) to simulate the neural\nresponse to natural visual stimuli. We use spiking neurons to produce spike\noutputs that directly match the recorded trains. This approach helps to avoid\nlosing information embedded in the original spike trains. We exclude the\ntemporal dimension from the model parameter space and introduce a temporal\nconditioning operation to allow the model to adaptively explore and exploit\ntemporal dependencies in stimuli sequences in a {\\it natural paradigm}. We show\nthat TeCoS-LVM models can produce more realistic spike activities and\naccurately fit spike statistics than powerful alternatives. Additionally,\nlearned TeCoS-LVM models can generalize well to longer time scales. Overall,\nwhile remaining computationally tractable, our model effectively captures key\nfeatures of neural coding systems. It thus provides a useful tool for building\naccurate predictive computational accounts for various sensory perception\ncircuits.\n","authors":["Gehua Ma","Runhao Jiang","Rui Yan","Huajin Tang"],"pdf_url":"https://arxiv.org/pdf/2306.12045v6.pdf","comment":"Accepted at NeurIPS 2023\n (https://openreview.net/forum?id=V4YeOvsQfu). 22 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2312.12263v2","updated":"2023-12-20T03:59:45Z","published":"2023-12-19T15:46:47Z","title":"FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy\n Labels","summary":" Federated learning with noisy labels (F-LNL) aims at seeking an optimal\nserver model via collaborative distributed learning by aggregating multiple\nclient models trained with local noisy or clean samples. On the basis of a\nfederated learning framework, recent advances primarily adopt label noise\nfiltering to separate clean samples from noisy ones on each client, thereby\nmitigating the negative impact of label noise. However, these prior methods do\nnot learn noise filters by exploiting knowledge across all clients, leading to\nsub-optimal and inferior noise filtering performance and thus damaging training\nstability. In this paper, we present FedDiv to tackle the challenges of F-LNL.\nSpecifically, we propose a global noise filter called Federated Noise Filter\nfor effectively identifying samples with noisy labels on every client, thereby\nraising stability during local training sessions. Without sacrificing data\nprivacy, this is achieved by modeling the global distribution of label noise\nacross all clients. Then, in an effort to make the global model achieve higher\nperformance, we introduce a Predictive Consistency based Sampler to identify\nmore credible local data for local model training, thus preventing noise\nmemorization and further boosting the training stability. Extensive experiments\non CIFAR-10, CIFAR-100, and Clothing1M demonstrate that \\texttt{FedDiv}\nachieves superior performance over state-of-the-art F-LNL methods under\ndifferent label noise settings for both IID and non-IID data partitions. Source\ncode is publicly available at https://github.com/lijichang/FLNL-FedDiv.\n","authors":["Jichang Li","Guanbin Li","Hui Cheng","Zicheng Liao","Yizhou Yu"],"pdf_url":"https://arxiv.org/pdf/2312.12263v2.pdf","comment":"To appear in AAAI-2024; correct minor typos"},{"id":"http://arxiv.org/abs/2308.12535v2","updated":"2023-12-20T03:46:13Z","published":"2023-08-24T03:44:05Z","title":"SCP: Spherical-Coordinate-based Learned Point Cloud Compression","summary":" In recent years, the task of learned point cloud compression has gained\nprominence. An important type of point cloud, the spinning LiDAR point cloud,\nis generated by spinning LiDAR on vehicles. This process results in numerous\ncircular shapes and azimuthal angle invariance features within the point\nclouds. However, these two features have been largely overlooked by previous\nmethodologies. In this paper, we introduce a model-agnostic method called\nSpherical-Coordinate-based learned Point cloud compression (SCP), designed to\nleverage the aforementioned features fully. Additionally, we propose a\nmulti-level Octree for SCP to mitigate the reconstruction error for distant\nareas within the Spherical-coordinate-based Octree. SCP exhibits excellent\nuniversality, making it applicable to various learned point cloud compression\ntechniques. Experimental results demonstrate that SCP surpasses previous\nstate-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.\n","authors":["Ao Luo","Linxin Song","Keisuke Nonaka","Kyohei Unno","Heming Sun","Masayuki Goto","Jiro Katto"],"pdf_url":"https://arxiv.org/pdf/2308.12535v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12743v1","updated":"2023-12-20T03:34:48Z","published":"2023-12-20T03:34:48Z","title":"PointeNet: A Lightweight Framework for Effective and Efficient Point\n Cloud Analysis","summary":" Current methodologies in point cloud analysis predominantly explore 3D\ngeometries, often achieved through the introduction of intricate learnable\ngeometric extractors in the encoder or by deepening networks with repeated\nblocks. However, these approaches inevitably lead to a significant number of\nlearnable parameters, resulting in substantial computational costs and imposing\nmemory burdens on CPU/GPU. Additionally, the existing strategies are primarily\ntailored for object-level point cloud classification and segmentation tasks,\nwith limited extensions to crucial scene-level applications, such as autonomous\ndriving. In response to these limitations, we introduce PointeNet, an efficient\nnetwork designed specifically for point cloud analysis. PointeNet distinguishes\nitself with its lightweight architecture, low training cost, and plug-and-play\ncapability, effectively capturing representative features. The network consists\nof a Multivariate Geometric Encoding (MGE) module and an optional\nDistance-aware Semantic Enhancement (DSE) module. The MGE module employs\noperations of sampling, grouping, and multivariate geometric aggregation to\nlightweightly capture and adaptively aggregate multivariate geometric features,\nproviding a comprehensive depiction of 3D geometries. The DSE module, designed\nfor real-world autonomous driving scenarios, enhances the semantic perception\nof point clouds, particularly for distant points. Our method demonstrates\nflexibility by seamlessly integrating with a classification/segmentation head\nor embedding into off-the-shelf 3D object detection networks, achieving notable\nperformance improvements at a minimal cost. Extensive experiments on\nobject-level datasets, including ModelNet40, ScanObjectNN, ShapeNetPart, and\nthe scene-level dataset KITTI, demonstrate the superior performance of\nPointeNet over state-of-the-art methods in point cloud analysis.\n","authors":["Lipeng Gu","Xuefeng Yan","Liangliang Nan","Dingkun Zhu","Honghua Chen","Weiming Wang","Mingqiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12743v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12742v1","updated":"2023-12-20T03:30:51Z","published":"2023-12-20T03:30:51Z","title":"Cached Transformers: Improving Transformers with Differentiable Memory\n Cache","summary":" This work introduces a new Transformer model called Cached Transformer, which\nuses Gated Recurrent Cached (GRC) attention to extend the self-attention\nmechanism with a differentiable memory cache of tokens. GRC attention enables\nattending to both past and current tokens, increasing the receptive field of\nattention and allowing for exploring long-range dependencies. By utilizing a\nrecurrent gating unit to continuously update the cache, our model achieves\nsignificant advancements in \\textbf{six} language and vision tasks, including\nlanguage modeling, machine translation, ListOPs, image classification, object\ndetection, and instance segmentation. Furthermore, our approach surpasses\nprevious memory-based techniques in tasks such as language modeling and\ndisplays the ability to be applied to a broader range of situations.\n","authors":["Zhaoyang Zhang","Wenqi Shao","Yixiao Ge","Xiaogang Wang","Jinwei Gu","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2312.12742v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2206.07207v3","updated":"2023-12-20T03:22:02Z","published":"2022-06-14T23:24:15Z","title":"Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across\n Modalities","summary":" Events describe happenings in our world that are of importance. Naturally,\nunderstanding events mentioned in multimedia content and how they are related\nforms an important way of comprehending our world. Existing literature can\ninfer if events across textual and visual (video) domains are identical (via\ngrounding) and thus, on the same semantic level. However, grounding fails to\ncapture the intricate cross-event relations that exist due to the same events\nbeing referred to on many semantic levels. For example, in Figure 1, the\nabstract event of \"war\" manifests at a lower semantic level through subevents\n\"tanks firing\" (in video) and airplane \"shot\" (in text), leading to a\nhierarchical, multimodal relationship between the events.\n In this paper, we propose the task of extracting event hierarchies from\nmultimodal (video and text) data to capture how the same event manifests itself\nin different modalities at different semantic levels. This reveals the\nstructure of events and is critical to understanding them. To support research\non this task, we introduce the Multimodal Hierarchical Events (MultiHiEve)\ndataset. Unlike prior video-language datasets, MultiHiEve is composed of news\nvideo-article pairs, which makes it rich in event hierarchies. We densely\nannotate a part of the dataset to construct the test benchmark. We show the\nlimitations of state-of-the-art unimodal and multimodal baselines on this task.\nFurther, we address these limitations via a new weakly supervised model,\nleveraging only unannotated video-article pairs from MultiHiEve. We perform a\nthorough evaluation of our proposed method which demonstrates improved\nperformance on this task and highlight opportunities for future research.\n","authors":["Hammad A. Ayyubi","Christopher Thomas","Lovish Chum","Rahul Lokesh","Long Chen","Yulei Niu","Xudong Lin","Xuande Feng","Jaywon Koo","Sounak Ray","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2206.07207v3.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12735v1","updated":"2023-12-20T03:16:34Z","published":"2023-12-20T03:16:34Z","title":"MetaSegNet: Metadata-collaborative Vision-Language Representation\n Learning for Semantic Segmentation of Remote Sensing Images","summary":" Semantic segmentation of remote sensing images plays a vital role in a wide\nrange of Earth Observation (EO) applications, such as land use land cover\nmapping, environment monitoring, and sustainable development. Driven by rapid\ndevelopments in Artificial Intelligence (AI), deep learning (DL) has emerged as\nthe mainstream tool for semantic segmentation and achieved many breakthroughs\nin the field of remote sensing. However, the existing DL-based methods mainly\nfocus on unimodal visual data while ignoring the rich multimodal information\ninvolved in the real world, usually demonstrating weak reliability and\ngenerlization. Inspired by the success of Vision Transformers and large\nlanguage models, we propose a novel metadata-collaborative multimodal\nsegmentation network (MetaSegNet) that applies vision-language representation\nlearning for semantic segmentation of remote sensing images. Unlike the common\nmodel structure that only uses unimodal visual data, we extract the key\ncharacteristic (i.e. the climate zone) from freely available remote sensing\nimage metadata and transfer it into knowledge-based text prompts via the\ngeneric ChatGPT. Then, we construct an image encoder, a text encoder and a\ncrossmodal attention fusion subnetwork to extract the image and text feature\nand apply image-text interaction. Benefiting from such a design, the proposed\nMetaSegNet demonstrates superior generalization and achieves competitive\naccuracy with state-of-the-art semantic segmentation methods on the large-scale\nOpenEarthMap dataset (68.6% mIoU) and Potsdam dataset (93.3% mean F1 score) as\nwell as LoveDA dataset (52.2% mIoU).\n","authors":["Libo Wang","Sijun Dong","Ying Chen","Xiaoliang Meng","Shenghui Fang"],"pdf_url":"https://arxiv.org/pdf/2312.12735v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11841v2","updated":"2023-12-20T03:14:40Z","published":"2023-12-19T04:14:11Z","title":"MixRT: Mixed Neural Representations For Real-Time NeRF Rendering","summary":" Neural Radiance Field (NeRF) has emerged as a leading technique for novel\nview synthesis, owing to its impressive photorealistic reconstruction and\nrendering capability. Nevertheless, achieving real-time NeRF rendering in\nlarge-scale scenes has presented challenges, often leading to the adoption of\neither intricate baked mesh representations with a substantial number of\ntriangles or resource-intensive ray marching in baked representations. We\nchallenge these conventions, observing that high-quality geometry, represented\nby meshes with substantial triangles, is not necessary for achieving\nphotorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF\nrepresentation that includes a low-quality mesh, a view-dependent displacement\nmap, and a compressed NeRF model. This design effectively harnesses the\ncapabilities of existing graphics hardware, thus enabling real-time NeRF\nrendering on edge devices. Leveraging a highly-optimized WebGL-based rendering\nframework, our proposed MixRT attains real-time rendering speeds on edge\ndevices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop),\nbetter rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360\ndatasets), and a smaller storage size (less than 80% compared to\nstate-of-the-art methods).\n","authors":["Chaojian Li","Bichen Wu","Peter Vajda"," Yingyan"," Lin"],"pdf_url":"https://arxiv.org/pdf/2312.11841v2.pdf","comment":"Accepted by 3DV'24. Project Page: https://licj15.github.io/MixRT/"},{"id":"http://arxiv.org/abs/2309.10689v2","updated":"2023-12-20T03:06:10Z","published":"2023-09-19T15:23:52Z","title":"ReShader: View-Dependent Highlights for Single Image View-Synthesis","summary":" In recent years, novel view synthesis from a single image has seen\nsignificant progress thanks to the rapid advancements in 3D scene\nrepresentation and image inpainting techniques. While the current approaches\nare able to synthesize geometrically consistent novel views, they often do not\nhandle the view-dependent effects properly. Specifically, the highlights in\ntheir synthesized images usually appear to be glued to the surfaces, making the\nnovel views unrealistic. To address this major problem, we make a key\nobservation that the process of synthesizing novel views requires changing the\nshading of the pixels based on the novel camera, and moving them to appropriate\nlocations. Therefore, we propose to split the view synthesis process into two\nindependent tasks of pixel reshading and relocation. During the reshading\nprocess, we take the single image as the input and adjust its shading based on\nthe novel camera. This reshaded image is then used as the input to an existing\nview synthesis method to relocate the pixels and produce the final novel view\nimage. We propose to use a neural network to perform reshading and generate a\nlarge set of synthetic input-reshaded pairs to train our network. We\ndemonstrate that our approach produces plausible novel view images with\nrealistic moving highlights on a variety of real world scenes.\n","authors":["Avinash Paliwal","Brandon Nguyen","Andrii Tsarov","Nima Khademi Kalantari"],"pdf_url":"https://arxiv.org/pdf/2309.10689v2.pdf","comment":"SIGGRAPH Asia 2023. Project page at\n https://people.engr.tamu.edu/nimak/Papers/SIGAsia2023_Reshader/index.html and\n video at https://www.youtube.com/watch?v=XW-tl48D3Ok"},{"id":"http://arxiv.org/abs/2312.12730v1","updated":"2023-12-20T02:58:25Z","published":"2023-12-20T02:58:25Z","title":"A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models","summary":" Efficient transfer learning (ETL) is receiving increasing attention to adapt\nlarge pre-trained language-vision models on downstream tasks with a few labeled\nsamples. While significant progress has been made, we reveal that\nstate-of-the-art ETL approaches exhibit strong performance only in\nnarrowly-defined experimental setups, and with a careful adjustment of\nhyperparameters based on a large corpus of labeled samples. In particular, we\nmake two interesting, and surprising empirical observations. First, to\noutperform a simple Linear Probing baseline, these methods require to optimize\ntheir hyper-parameters on each target task. And second, they typically\nunderperform -- sometimes dramatically -- standard zero-shot predictions in the\npresence of distributional drifts. Motivated by the unrealistic assumptions\nmade in the existing literature, i.e., access to a large validation set and\ncase-specific grid-search for optimal hyperparameters, we propose a novel\napproach that meets the requirements of real-world scenarios. More concretely,\nwe introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing\nterm is optimized via an adaptation of the general Augmented Lagrangian method\ntailored to this context. We comprehensively evaluate CLAP on a broad span of\ndatasets and scenarios, demonstrating that it consistently outperforms SoTA\napproaches, while yet being a much more efficient alternative.\n","authors":["Julio Silva-Rodriguez","Sina Hajimiri","Ismail Ben Ayed","Jose Dolz"],"pdf_url":"https://arxiv.org/pdf/2312.12730v1.pdf","comment":"Code available at https://github.com/jusiro/CLAP"},{"id":"http://arxiv.org/abs/2312.12729v1","updated":"2023-12-20T02:57:21Z","published":"2023-12-20T02:57:21Z","title":"Segment Anything Model Meets Image Harmonization","summary":" Image harmonization is a crucial technique in image composition that aims to\nseamlessly match the background by adjusting the foreground of composite\nimages. Current methods adopt either global-level or pixel-level feature\nmatching. Global-level feature matching ignores the proximity prior, treating\nforeground and background as separate entities. On the other hand, pixel-level\nfeature matching loses contextual information. Therefore, it is necessary to\nuse the information from semantic maps that describe different objects to guide\nharmonization. In this paper, we propose Semantic-guided Region-aware Instance\nNormalization (SRIN) that can utilize the semantic segmentation maps output by\na pre-trained Segment Anything Model (SAM) to guide the visual consistency\nlearning of foreground and background features. Abundant experiments\ndemonstrate the superiority of our method for image harmonization over\nstate-of-the-art methods.\n","authors":["Haoxing Chen","Yaohui Li","Zhangxuan Gu","Zhuoer Xu","Jun Lan","Huaxiong Li"],"pdf_url":"https://arxiv.org/pdf/2312.12729v1.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.12726v1","updated":"2023-12-20T02:50:03Z","published":"2023-12-20T02:50:03Z","title":"Reducing Shape-Radiance Ambiguity in Radiance Fields with a Closed-Form\n Color Estimation Method","summary":" Neural radiance field (NeRF) enables the synthesis of cutting-edge realistic\nnovel view images of a 3D scene. It includes density and color fields to model\nthe shape and radiance of a scene, respectively. Supervised by the photometric\nloss in an end-to-end training manner, NeRF inherently suffers from the\nshape-radiance ambiguity problem, i.e., it can perfectly fit training views but\ndoes not guarantee decoupling the two fields correctly. To deal with this\nissue, existing works have incorporated prior knowledge to provide an\nindependent supervision signal for the density field, including total variation\nloss, sparsity loss, distortion loss, etc. These losses are based on general\nassumptions about the density field, e.g., it should be smooth, sparse, or\ncompact, which are not adaptive to a specific scene. In this paper, we propose\na more adaptive method to reduce the shape-radiance ambiguity. The key is a\nrendering method that is only based on the density field. Specifically, we\nfirst estimate the color field based on the density field and posed images in a\nclosed form. Then NeRF's rendering process can proceed. We address the problems\nin estimating the color field, including occlusion and non-uniformly\ndistributed views. Afterward, it is applied to regularize NeRF's density field.\nAs our regularization is guided by photometric loss, it is more adaptive\ncompared to existing ones. Experimental results show that our method improves\nthe density field of NeRF both qualitatively and quantitatively. Our code is\navailable at https://github.com/qihangGH/Closed-form-color-field.\n","authors":["Qihang Fang","Yafei Song","Keqiang Li","Liefeng Bo"],"pdf_url":"https://arxiv.org/pdf/2312.12726v1.pdf","comment":"This work has been published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.03373v2","updated":"2023-12-20T02:42:13Z","published":"2023-06-06T03:22:22Z","title":"CiT-Net: Convolutional Neural Networks Hand in Hand with Vision\n Transformers for Medical Image Segmentation","summary":" The hybrid architecture of convolutional neural networks (CNNs) and\nTransformer are very popular for medical image segmentation. However, it\nsuffers from two challenges. First, although a CNNs branch can capture the\nlocal image features using vanilla convolution, it cannot achieve adaptive\nfeature learning. Second, although a Transformer branch can capture the global\nfeatures, it ignores the channel and cross-dimensional self-attention,\nresulting in a low segmentation accuracy on complex-content images. To address\nthese challenges, we propose a novel hybrid architecture of convolutional\nneural networks hand in hand with vision Transformers (CiT-Net) for medical\nimage segmentation. Our network has two advantages. First, we design a dynamic\ndeformable convolution and apply it to the CNNs branch, which overcomes the\nweak feature extraction ability due to fixed-size convolution kernels and the\nstiff design of sharing kernel parameters among different inputs. Second, we\ndesign a shifted-window adaptive complementary attention module and a compact\nconvolutional projection. We apply them to the Transformer branch to learn the\ncross-dimensional long-term dependency for medical images. Experimental results\nshow that our CiT-Net provides better medical image segmentation results than\npopular SOTA methods. Besides, our CiT-Net requires lower parameters and less\ncomputational costs and does not rely on pre-training. The code is publicly\navailable at https://github.com/SR0920/CiT-Net.\n","authors":["Tao Lei","Rui Sun","Xuan Wang","Yingbo Wang","Xi He","Asoke Nandi"],"pdf_url":"https://arxiv.org/pdf/2306.03373v2.pdf","comment":"9 pages, 3 figures, 3 tables"},{"id":"http://arxiv.org/abs/2312.12723v1","updated":"2023-12-20T02:35:18Z","published":"2023-12-20T02:35:18Z","title":"Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual\n Question Answering","summary":" Visual Question Answering (VQA) has emerged as one of the most challenging\ntasks in artificial intelligence due to its multi-modal nature. However, most\nexisting VQA methods are incapable of handling Knowledge-based Visual Question\nAnswering (KB-VQA), which requires external knowledge beyond visible contents\nto answer questions about a given image. To address this issue, we propose a\nnovel framework that endows the model with capabilities of answering more\ngeneral questions, and achieves a better exploitation of external knowledge\nthrough generating Multiple Clues for Reasoning with Memory Neural Networks\n(MCR-MemNN). Specifically, a well-defined detector is adopted to predict\nimage-question related relation phrases, each of which delivers two\ncomplementary clues to retrieve the supporting facts from external knowledge\nbase (KB), which are further encoded into a continuous embedding space using a\ncontent-addressable memory. Afterwards, mutual interactions between\nvisual-semantic representation and the supporting facts stored in memory are\ncaptured to distill the most relevant information in three modalities (i.e.,\nimage, question, and KB). Finally, the optimal answer is predicted by choosing\nthe supporting fact with the highest score. We conduct extensive experiments on\ntwo widely-used benchmarks. The experimental results well justify the\neffectiveness of MCR-MemNN, as well as its superiority over other KB-VQA\nmethods.\n","authors":["Chengxiang Yin","Zhengping Che","Kun Wu","Zhiyuan Xu","Jian Tang"],"pdf_url":"https://arxiv.org/pdf/2312.12723v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04086v3","updated":"2023-12-20T02:34:49Z","published":"2023-06-07T01:14:16Z","title":"TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for\n Medical Image Segmentation","summary":" The hybrid architecture of convolution neural networks (CNN) and Transformer\nhas been the most popular method for medical image segmentation. However, the\nexisting networks based on the hybrid architecture suffer from two problems.\nFirst, although the CNN branch can capture image local features by using\nconvolution operation, the vanilla convolution is unable to achieve adaptive\nextraction of image features. Second, although the Transformer branch can model\nthe global information of images, the conventional self-attention only focuses\non the spatial self-attention of images and ignores the channel and\ncross-dimensional self-attention leading to low segmentation accuracy for\nmedical images with complex backgrounds. To solve these problems, we propose\nvision Transformer embrace convolutional neural networks for medical image\nsegmentation (TEC-Net). Our network has two advantages. First, dynamic\ndeformable convolution (DDConv) is designed in the CNN branch, which not only\novercomes the difficulty of adaptive feature extraction using fixed-size\nconvolution kernels, but also solves the defect that different inputs share the\nsame convolution kernel parameters, effectively improving the feature\nexpression ability of CNN branch. Second, in the Transformer branch, a\n(shifted)-window adaptive complementary attention module ((S)W-ACAM) and\ncompact convolutional projection are designed to enable the network to fully\nlearn the cross-dimensional long-range dependency of medical images with few\nparameters and calculations. Experimental results show that the proposed\nTEC-Net provides better medical image segmentation results than SOTA methods\nincluding CNN and Transformer networks. In addition, our TEC-Net requires fewer\nparameters and computational costs and does not rely on pre-training. The code\nis publicly available at https://github.com/SR0920/TEC-Net.\n","authors":["Rui Sun","Tao Lei","Weichuan Zhang","Yong Wan","Yong Xia","Asoke K. Nandi"],"pdf_url":"https://arxiv.org/pdf/2306.04086v3.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2306.03373"},{"id":"http://arxiv.org/abs/2312.12722v1","updated":"2023-12-20T02:34:11Z","published":"2023-12-20T02:34:11Z","title":"Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class\n Incremental Learning","summary":" Non-exemplar class incremental learning aims to learn both the new and old\ntasks without accessing any training data from the past. This strict\nrestriction enlarges the difficulty of alleviating catastrophic forgetting\nsince all techniques can only be applied to current task data. Considering this\nchallenge, we propose a novel framework of fine-grained knowledge selection and\nrestoration. The conventional knowledge distillation-based methods place too\nstrict constraints on the network parameters and features to prevent\nforgetting, which limits the training of new tasks. To loose this constraint,\nwe proposed a novel fine-grained selective patch-level distillation to\nadaptively balance plasticity and stability. Some task-agnostic patches can be\nused to preserve the decision boundary of the old task. While some patches\ncontaining the important foreground are favorable for learning the new task.\n Moreover, we employ a task-agnostic mechanism to generate more realistic\nprototypes of old tasks with the current task sample for reducing classifier\nbias for fine-grained knowledge restoration. Extensive experiments on CIFAR100,\nTinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method.\nCode is available at https://github.com/scok30/vit-cil.\n","authors":["Jiang-Tian Zhai","Xialei Liu","Lu Yu","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2312.12722v1.pdf","comment":"to appear at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12721v1","updated":"2023-12-20T02:30:39Z","published":"2023-12-20T02:30:39Z","title":"Cross-Modal Reasoning with Event Correlation for Video Question\n Answering","summary":" Video Question Answering (VideoQA) is a very attractive and challenging\nresearch direction aiming to understand complex semantics of heterogeneous data\nfrom two domains, i.e., the spatio-temporal video content and the word sequence\nin question. Although various attention mechanisms have been utilized to manage\ncontextualized representations by modeling intra- and inter-modal relationships\nof the two modalities, one limitation of the predominant VideoQA methods is the\nlack of reasoning with event correlation, that is, sensing and analyzing\nrelationships among abundant and informative events contained in the video. In\nthis paper, we introduce the dense caption modality as a new auxiliary and\ndistill event-correlated information from it to infer the correct answer. To\nthis end, we propose a novel end-to-end trainable model, Event-Correlated Graph\nNeural Networks (EC-GNNs), to perform cross-modal reasoning over information\nfrom the three modalities (i.e., caption, video, and question). Besides the\nexploitation of a brand new modality, we employ cross-modal reasoning modules\nfor explicitly modeling inter-modal relationships and aggregating relevant\ninformation across different modalities, and we propose a question-guided\nself-adaptive multi-modal fusion module to collect the question-oriented and\nevent-correlated evidence through multi-step reasoning. We evaluate our model\non two widely-used benchmark datasets and conduct an ablation study to justify\nthe effectiveness of each proposed component.\n","authors":["Chengxiang Yin","Zhengping Che","Kun Wu","Zhiyuan Xu","Qinru Qiu","Jian Tang"],"pdf_url":"https://arxiv.org/pdf/2312.12721v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12720v1","updated":"2023-12-20T02:29:31Z","published":"2023-12-20T02:29:31Z","title":"AdvST: Revisiting Data Augmentations for Single Domain Generalization","summary":" Single domain generalization (SDG) aims to train a robust model against\nunknown target domain shifts using data from a single source domain. Data\naugmentation has been proven an effective approach to SDG. However, the utility\nof standard augmentations, such as translate, or invert, has not been fully\nexploited in SDG; practically, these augmentations are used as a part of a data\npreprocessing procedure. Although it is intuitive to use many such\naugmentations to boost the robustness of a model to out-of-distribution domain\nshifts, we lack a principled approach to harvest the benefit brought from\nmultiple these augmentations. Here, we conceptualize standard data\naugmentations with learnable parameters as semantics transformations that can\nmanipulate certain semantics of a sample, such as the geometry or color of an\nimage. Then, we propose Adversarial learning with Semantics Transformations\n(AdvST) that augments the source domain data with semantics transformations and\nlearns a robust model with the augmented data. We theoretically show that AdvST\nessentially optimizes a distributionally robust optimization objective defined\non a set of semantics distributions induced by the parameters of semantics\ntransformations. We demonstrate that AdvST can produce samples that expand the\ncoverage on target domain data. Compared with the state-of-the-art methods,\nAdvST, despite being a simple method, is surprisingly competitive and achieves\nthe best average SDG performance on the Digits, PACS, and DomainNet datasets.\nOur code is available at https://github.com/gtzheng/AdvST.\n","authors":["Guangtao Zheng","Mengdi Huai","Aidong Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.12720v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12716v1","updated":"2023-12-20T02:22:49Z","published":"2023-12-20T02:22:49Z","title":"BloomVQA: Assessing Hierarchical Multi-modal Comprehension","summary":" We propose a novel VQA dataset, based on picture stories designed for\neducating young children, that aims to facilitate comprehensive evaluation and\ncharacterization of vision-language models on comprehension tasks. Unlike\ncurrent VQA datasets that often focus on fact-based memorization and simple\nreasoning tasks without principled scientific grounding, we collect data\ncontaining tasks reflecting different levels of comprehension and underlying\ncognitive processes, as laid out in Bloom's Taxonomy, a classic framework\nwidely adopted in education research. The proposed BloomVQA dataset can be\nmapped to a hierarchical graph-based representation of visual stories, enabling\nautomatic data augmentation and novel measures characterizing model consistency\nacross the underlying taxonomy. We demonstrate graded evaluation and\nreliability analysis based on our proposed consistency metrics on\nstate-of-the-art vision-language models. Our results suggest that, while\ncurrent models achieve the most gain on low-level comprehension tasks, they\ngenerally fall short on high-level tasks requiring more advanced comprehension\nand cognitive skills, as 38.0% drop in VQA accuracy is observed comparing\nlowest and highest level tasks. Furthermore, current models show consistency\npatterns misaligned with human comprehension in various scenarios, suggesting\nemergent structures of model behaviors.\n","authors":["Yunye Gong","Robik Shrestha","Jared Claypoole","Michael Cogswell","Arijit Ray","Christopher Kanan","Ajay Divakaran"],"pdf_url":"https://arxiv.org/pdf/2312.12716v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08511v3","updated":"2023-12-20T02:21:20Z","published":"2023-08-16T17:07:40Z","title":"Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse\n Problems","summary":" Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial\ntechnologies in the field of medical imaging. Score-based models have proven to\nbe effective in addressing different inverse problems encountered in CT and\nMRI, such as sparse-view CT and fast MRI reconstruction. However, these models\nface challenges in achieving accurate three dimensional (3D) volumetric\nreconstruction. The existing score-based models primarily focus on\nreconstructing two dimensional (2D) data distribution, leading to\ninconsistencies between adjacent slices in the reconstructed 3D volumetric\nimages. To overcome this limitation, we propose a novel two-and-a-half order\nscore-based model (TOSM). During the training phase, our TOSM learns data\ndistributions in 2D space, which reduces the complexity of training compared to\ndirectly working on 3D volumes. However, in the reconstruction phase, the TOSM\nupdates the data distribution in 3D space, utilizing complementary scores along\nthree directions (sagittal, coronal, and transaxial) to achieve a more precise\nreconstruction. The development of TOSM is built on robust theoretical\nprinciples, ensuring its reliability and efficacy. Through extensive\nexperimentation on large-scale sparse-view CT and fast MRI datasets, our method\ndemonstrates remarkable advancements and attains state-of-the-art results in\nsolving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively\naddresses the inter-slice inconsistency issue, resulting in high-quality 3D\nvolumetric reconstruction.\n","authors":["Zirong Li","Yanyang Wang","Jianjia Zhang","Weiwen Wu","Hengyong Yu"],"pdf_url":"https://arxiv.org/pdf/2308.08511v3.pdf","comment":"10 pages, 13 figures"},{"id":"http://arxiv.org/abs/2303.12484v4","updated":"2023-12-20T02:14:25Z","published":"2023-03-22T11:51:49Z","title":"Label-Efficient Deep Learning in Medical Image Analysis: Challenges and\n Future Directions","summary":" Deep learning has seen rapid growth in recent years and achieved\nstate-of-the-art performance in a wide range of applications. However, training\nmodels typically requires expensive and time-consuming collection of large\nquantities of labeled data. This is particularly true within the scope of\nmedical imaging analysis (MIA), where data are limited and labels are expensive\nto be acquired. Thus, label-efficient deep learning methods are developed to\nmake comprehensive use of the labeled data as well as the abundance of\nunlabeled and weak-labeled data. In this survey, we extensively investigated\nover 300 recent papers to provide a comprehensive overview of recent progress\non label-efficient learning strategies in MIA. We first present the background\nof label-efficient learning and categorize the approaches into different\nschemes. Next, we examine the current state-of-the-art methods in detail\nthrough each scheme. Specifically, we provide an in-depth investigation,\ncovering not only canonical semi-supervised, self-supervised, and\nmulti-instance learning schemes, but also recently emerged active and\nannotation-efficient learning strategies. Moreover, as a comprehensive\ncontribution to the field, this survey not only elucidates the commonalities\nand unique features of the surveyed methods but also presents a detailed\nanalysis of the current challenges in the field and suggests potential avenues\nfor future research.\n","authors":["Cheng Jin","Zhengrui Guo","Yi Lin","Luyang Luo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2303.12484v4.pdf","comment":"Update Few-shot Methods"},{"id":"http://arxiv.org/abs/2312.11057v2","updated":"2023-12-20T01:40:15Z","published":"2023-12-18T09:40:38Z","title":"DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via\n Diffusion Models","summary":" Dataset sanitization is a widely adopted proactive defense against\npoisoning-based backdoor attacks, aimed at filtering out and removing poisoned\nsamples from training datasets. However, existing methods have shown limited\nefficacy in countering the ever-evolving trigger functions, and often leading\nto considerable degradation of benign accuracy. In this paper, we propose\nDataElixir, a novel sanitization approach tailored to purify poisoned datasets.\nWe leverage diffusion models to eliminate trigger features and restore benign\nfeatures, thereby turning the poisoned samples into benign ones. Specifically,\nwith multiple iterations of the forward and reverse process, we extract\nintermediary images and their predicted labels for each sample in the original\ndataset. Then, we identify anomalous samples in terms of the presence of label\ntransition of the intermediary images, detect the target label by quantifying\ndistribution discrepancy, select their purified images considering pixel and\nfeature distance, and determine their ground-truth labels by training a benign\nmodel. Experiments conducted on 9 popular attacks demonstrates that DataElixir\neffectively mitigates various complex attacks while exerting minimal impact on\nbenign accuracy, surpassing the performance of baseline defense methods.\n","authors":["Jiachen Zhou","Peizhuo Lv","Yibing Lan","Guozhu Meng","Kai Chen","Hualong Ma"],"pdf_url":"https://arxiv.org/pdf/2312.11057v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.12691v1","updated":"2023-12-20T01:29:11Z","published":"2023-12-20T01:29:11Z","title":"How Good Are Deep Generative Models for Solving Inverse Problems?","summary":" Deep generative models, such as diffusion models, GANs, and IMLE, have shown\nimpressive capability in tackling inverse problems. However, the validity of\nmodel-generated solutions w.r.t. the forward problem and the reliability of\nassociated uncertainty estimates remain understudied. This study evaluates\nrecent diffusion-based, GAN-based, and IMLE-based methods on three inverse\nproblems, i.e., $16\\times$ super-resolution, colourization, and image\ndecompression. We assess the validity of these models' outputs as solutions to\nthe inverse problems and conduct a thorough analysis of the reliability of the\nmodels' estimates of uncertainty over the solution. Overall, we find that the\nIMLE-based CHIMLE method outperforms other methods in terms of producing valid\nsolutions and reliable uncertainty estimates.\n","authors":["Shichong Peng","Alireza Moazeni","Ke Li"],"pdf_url":"https://arxiv.org/pdf/2312.12691v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06385v5","updated":"2023-12-20T01:28:57Z","published":"2023-04-13T10:37:41Z","title":"TransHP: Image Classification with Hierarchical Prompting","summary":" This paper explores a hierarchical prompting mechanism for the hierarchical\nimage classification (HIC) task. Different from prior HIC methods, our\nhierarchical prompting is the first to explicitly inject ancestor-class\ninformation as a tokenized hint that benefits the descendant-class\ndiscrimination. We think it well imitates human visual recognition, i.e.,\nhumans may use the ancestor class as a prompt to draw focus on the subtle\ndifferences among descendant classes. We model this prompting mechanism into a\nTransformer with Hierarchical Prompting (TransHP). TransHP consists of three\nsteps: 1) learning a set of prompt tokens to represent the coarse (ancestor)\nclasses, 2) on-the-fly predicting the coarse class of the input image at an\nintermediate block, and 3) injecting the prompt token of the predicted coarse\nclass into the intermediate feature. Though the parameters of TransHP maintain\nthe same for all input images, the injected coarse-class prompt conditions\n(modifies) the subsequent feature extraction and encourages a dynamic focus on\nrelatively subtle differences among the descendant classes. Extensive\nexperiments show that TransHP improves image classification on accuracy (e.g.,\nimproving ViT-B/16 by +2.83% ImageNet classification accuracy), training data\nefficiency (e.g., +12.69% improvement under 10% ImageNet training data), and\nmodel explainability. Moreover, TransHP also performs favorably against prior\nHIC methods, showing that TransHP well exploits the hierarchical information.\nThe code is available at: https://github.com/WangWenhao0716/TransHP.\n","authors":["Wenhao Wang","Yifan Sun","Wei Li","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2304.06385v5.pdf","comment":"Accepted to NeurIPS 2023; Released code"},{"id":"http://arxiv.org/abs/2303.05122v2","updated":"2023-12-20T01:08:15Z","published":"2023-03-09T09:05:47Z","title":"M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios","summary":" In realistic open-set scenarios where labels of a part of testing data are\ntotally unknown, when vision-language (VL) prompt learning methods encounter\ninputs related to unknown classes (i.e., not seen during training), they always\npredict them as one of the training classes. The exhibited label bias causes\ndifficulty in open set recognition (OSR), in which an image should be correctly\npredicted as one of the known classes or the unknown one. To achieve this goal,\nwe propose a vision-language prompt tuning method with mitigated label bias\n(M-Tuning). It introduces open words from the WordNet to extend the range of\nwords forming the prompt texts from only closed-set label words to more, and\nthus prompts are tuned in a simulated open-set scenario. Besides, inspired by\nthe observation that classifying directly on large datasets causes a much\nhigher false positive rate than on small datasets, we propose a Combinatorial\nTuning and Testing (CTT) strategy for improving performance. CTT decomposes\nM-Tuning on large datasets as multiple independent group-wise tuning on fewer\nclasses, then makes accurate and comprehensive predictions by selecting the\noptimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the\nliterature, especially for prompt methods, we contribute new baselines for fair\ncomparisons. Our method achieves the best performance on datasets with various\nscales, and extensive ablation studies also validate its effectiveness.\n","authors":["Ning Liao","Xiaopeng Zhang","Min Cao","Junchi Yan","Qi Tian"],"pdf_url":"https://arxiv.org/pdf/2303.05122v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12680v1","updated":"2023-12-20T00:44:04Z","published":"2023-12-20T00:44:04Z","title":"Trajectory Approximation of Video Based on Phase Correlation for Forward\n Facing Camera","summary":" In this paper, we introduce an innovative approach for extracting\ntrajectories from a camera sensor in GPS-denied environments, leveraging visual\nodometry. The system takes video footage captured by a forward-facing camera\nmounted on a vehicle as input, with the output being a chain code representing\nthe camera's trajectory. The proposed methodology involves several key steps.\nFirstly, we employ phase correlation between consecutive frames of the video to\nextract essential information. Subsequently, we introduce a novel chain code\nmethod termed \"dynamic chain code,\" which is based on the x-shift values\nderived from the phase correlation. The third step involves determining\ndirectional changes (forward, left, right) by establishing thresholds and\nextracting the corresponding chain code. This extracted code is then stored in\na buffer for further processing. Notably, our system outperforms traditional\nmethods reliant on spatial features, exhibiting greater speed and robustness in\nnoisy environments. Importantly, our approach operates without external camera\ncalibration information. Moreover, by incorporating visual odometry, our system\nenhances its accuracy in estimating camera motion, providing a more\ncomprehensive understanding of trajectory dynamics. Finally, the system\nculminates in the visualization of the normalized camera motion trajectory.\n","authors":["Abdulkadhem A. Abdulkadhem"],"pdf_url":"https://arxiv.org/pdf/2312.12680v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13489v1","updated":"2023-12-20T23:52:53Z","published":"2023-12-20T23:52:53Z","title":"Embedded Shape Matching in Photogrammetry Data for Modeling Making\n Knowledge","summary":" In three-dimensional models obtained by photogrammetry of existing\nstructures, all of the shapes that the eye can select cannot always find their\nequivalents in the geometric components of the model. However, the matching of\nmeaningful parts and assemblages with the records acquired with rapid and\ndetailed documentation methods will provide an advantage for the creation of\ninformation models of existing structures. While aiming to produce answers to\nthis problem and in order to overcome the difficulties of pattern recognition\nin three-dimensional models, we used two-dimensional samples obtained by\nprojection. Processing techniques such as ambient occlusion, curvature and\nnormal maps are commonly used in modern computer graphics applications that\nenable the representation of three-dimensional surface properties in\ntwo-dimensional data sets. The method we propose is based on the recognition of\npatterns through these mappings instead of the usual light-based visualization.\nThe first stage of the application is photogrammetric capture of a few examples\nof Zeugma mosaics and three-dimensional digital modeling of a set of Seljuk era\nbrick walls based on knowledge obtained through architectural history\nliterature. The second stage covers the creation of digital models byprocessing\nthe surface representation obtained from this data using Alice Vision,\nOpenCV-Python, and Autodesk Maya to include information on aspects of the\nmaking of the walls. What is envisioned for the next stages is that the mapping\ndata contributes and supports the knowledge for rule-based design and making\nprocessesof cultural heritage.\n","authors":["Demircan Tas","Mine Özkar"],"pdf_url":"https://arxiv.org/pdf/2312.13489v1.pdf","comment":"9 pages, in Turkish language. 6 figures. In: MSTAS 2019 - (XIII.\n Computational Design in Architecture National Symposium) pp. 313-326.,\n Kocaeli, Turkey (2019)"},{"id":"http://arxiv.org/abs/2311.18260v3","updated":"2023-12-20T23:08:32Z","published":"2023-11-30T05:38:34Z","title":"Consensus, dissensus and synergy between clinicians and specialist\n foundation models in radiology report generation","summary":" Radiology reports are an instrumental part of modern medicine, informing key\nclinical decisions such as diagnosis and treatment. The worldwide shortage of\nradiologists, however, restricts access to expert care and imposes heavy\nworkloads, contributing to avoidable errors and delays in report delivery.\nWhile recent progress in automated report generation with vision-language\nmodels offer clear potential in ameliorating the situation, the path to\nreal-world adoption has been stymied by the challenge of evaluating the\nclinical quality of AI-generated reports. In this study, we build a\nstate-of-the-art report generation system for chest radiographs,\n$\\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation\nmodel on radiology data. To evaluate the quality of the AI-generated reports, a\ngroup of 16 certified radiologists provide detailed evaluations of AI-generated\nand human written reports for chest X-rays from an intensive care setting in\nthe United States and an inpatient setting in India. At least one radiologist\n(out of two per case) preferred the AI report to the ground truth report in\nover 60$\\%$ of cases for both datasets. Amongst the subset of AI-generated\nreports that contain errors, the most frequently cited reasons were related to\nthe location and finding, whereas for human written reports, most mistakes were\nrelated to severity and finding. This disparity suggested potential\ncomplementarity between our AI system and human experts, prompting us to\ndevelop an assistive scenario in which Flamingo-CXR generates a first-draft\nreport, which is subsequently revised by a clinician. This is the first\ndemonstration of clinician-AI collaboration for report writing, and the\nresultant reports are assessed to be equivalent or preferred by at least one\nradiologist to reports written by experts alone in 80$\\%$ of in-patient cases\nand 60$\\%$ of intensive care cases.\n","authors":["Ryutaro Tanno","David G. T. Barrett","Andrew Sellergren","Sumedh Ghaisas","Sumanth Dathathri","Abigail See","Johannes Welbl","Karan Singhal","Shekoofeh Azizi","Tao Tu","Mike Schaekermann","Rhys May","Roy Lee","SiWai Man","Zahra Ahmed","Sara Mahdavi","Yossi Matias","Joelle Barral","Ali Eslami","Danielle Belgrave","Vivek Natarajan","Shravya Shetty","Pushmeet Kohli","Po-Sen Huang","Alan Karthikesalingam","Ira Ktena"],"pdf_url":"https://arxiv.org/pdf/2311.18260v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.05152v2","updated":"2023-12-20T23:06:09Z","published":"2023-11-09T05:24:20Z","title":"Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual\n Downstream Tasks","summary":" In recent years, the deployment of large-scale pre-trained models in\naudio-visual downstream tasks has yielded remarkable outcomes. However, these\nmodels, primarily trained on single-modality unconstrained datasets, still\nencounter challenges in feature extraction for multi-modal tasks, leading to\nsuboptimal performance. This limitation arises due to the introduction of\nirrelevant modality-specific information during encoding, which adversely\naffects the performance of downstream tasks. To address this challenge, this\npaper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention\nmechanism. This mechanism leverages audio and visual modalities as soft prompts\nto dynamically adjust the parameters of pre-trained models based on the current\nmulti-modal input features. Specifically, the DG-SCT module incorporates\ntrainable cross-modal interaction layers into pre-trained audio-visual\nencoders, allowing adaptive extraction of crucial information from the current\nmodality across spatial, channel, and temporal dimensions, while preserving the\nfrozen parameters of large-scale pre-trained models. Experimental evaluations\ndemonstrate that our proposed model achieves state-of-the-art results across\nmultiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our\nmodel exhibits promising performance in challenging few-shot and zero-shot\nscenarios. The source code and pre-trained models are available at\nhttps://github.com/haoyi-duan/DG-SCT.\n","authors":["Haoyi Duan","Yan Xia","Mingze Zhou","Li Tang","Jieming Zhu","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2311.05152v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2303.00586v2","updated":"2023-12-20T22:54:48Z","published":"2023-03-01T15:28:26Z","title":"FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling","summary":" Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way\nto improve top-line metrics and to outperform a larger single model. In this\nwork, we go beyond top-line metrics and instead explore the impact of\nensembling on subgroup performances. Surprisingly, we observe that even with a\nsimple homogeneous ensemble -- all the individual DNNs share the same training\nset, architecture, and design choices -- the minority group performance\ndisproportionately improves with the number of models compared to the majority\ngroup, i.e. fairness naturally emerges from ensembling. Even more surprising,\nwe find that this gain keeps occurring even when a large number of models is\nconsidered, e.g. $20$, despite the fact that the average performance of the\nensemble plateaus with fewer models. Our work establishes that simple DNN\nensembles can be a powerful tool for alleviating disparate impact from DNN\nclassifiers, thus curbing algorithmic harm. We also explore why this is the\ncase. We find that even in homogeneous ensembles, varying the sources of\nstochasticity through parameter initialization, mini-batch sampling, and\ndata-augmentation realizations, results in different fairness outcomes.\n","authors":["Wei-Yin Ko","Daniel D'souza","Karina Nguyen","Randall Balestriero","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2303.00586v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10600v2","updated":"2023-12-20T22:53:23Z","published":"2023-12-17T04:26:42Z","title":"How to Efficiently Annotate Images for Best-Performing Deep Learning\n Based Segmentation Models: An Empirical Study with Weak and Noisy Annotations\n and Segment Anything Model","summary":" Deep neural networks (DNNs) have been deployed for many image segmentation\ntasks and achieved outstanding performance. However, preparing a dataset for\ntraining segmentation DNNs is laborious and costly since typically pixel-level\nannotations are provided for each object of interest. To alleviate this issue,\none can provide only weak labels such as bounding boxes or scribbles, or less\naccurate (noisy) annotations of the objects. These are significantly faster to\ngenerate and thus result in more annotated images given the same time budget.\nHowever, the reduction in quality might negatively affect the segmentation\nperformance of the resulting model. In this study, we perform a thorough\ncost-effectiveness evaluation of several weak and noisy labels. We considered\n11 variants of annotation strategies and 4 datasets. We conclude that the\ncommon practice of accurately outlining the objects of interest is virtually\nnever the optimal approach when the annotation time is limited, even if notable\nannotation time is available (10s of hours). Annotation approaches that stood\nout in such scenarios were (1) contour-based annotation with rough continuous\ntraces, (2) polygon-based annotation with few vertices, and (3) box annotations\ncombined with the Segment Anything Model (SAM). In situations where unlimited\nannotation time was available, precise annotations still lead to the highest\nsegmentation model performance.\n","authors":["Yixin Zhang","Shen Zhao","Hanxue Gu","Maciej A. Mazurowski"],"pdf_url":"https://arxiv.org/pdf/2312.10600v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13471v1","updated":"2023-12-20T22:42:17Z","published":"2023-12-20T22:42:17Z","title":"NeRF-VO: Real-Time Sparse Visual Odometry with Neural Radiance Fields","summary":" We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that\nintegrates learning-based sparse visual odometry for low-latency camera\ntracking and a neural radiance scene representation for sophisticated dense\nreconstruction and novel view synthesis. Our system initializes camera poses\nusing sparse visual odometry and obtains view-dependent dense geometry priors\nfrom a monocular depth prediction network. We harmonize the scale of poses and\ndense geometry, treating them as supervisory cues to train a neural implicit\nscene representation. NeRF-VO demonstrates exceptional performance in both\nphotometric and geometric fidelity of the scene representation by jointly\noptimizing a sliding window of keyframed poses and the underlying dense\ngeometry, which is accomplished through training the radiance field with volume\nrendering. We surpass state-of-the-art methods in pose estimation accuracy,\nnovel view synthesis fidelity, and dense reconstruction quality across a\nvariety of synthetic and real-world datasets, while achieving a higher camera\ntracking frequency and consuming less GPU memory.\n","authors":["Jens Naumann","Binbin Xu","Stefan Leutenegger","Xingxing Zuo"],"pdf_url":"https://arxiv.org/pdf/2312.13471v1.pdf","comment":"10 tables, 4 figures"},{"id":"http://arxiv.org/abs/2312.13469v1","updated":"2023-12-20T22:36:37Z","published":"2023-12-20T22:36:37Z","title":"Neural feels with neural fields: Visuo-tactile perception for in-hand\n manipulation","summary":" To achieve human-level dexterity, robots must infer spatial awareness from\nmultimodal sensing to reason over contact interactions. During in-hand\nmanipulation of novel objects, such spatial awareness involves estimating the\nobject's pose and shape. The status quo for in-hand perception primarily\nemploys vision, and restricts to tracking a priori known objects. Moreover,\nvisual occlusion of objects in-hand is imminent during manipulation, preventing\ncurrent systems to push beyond tasks without occlusion. We combine vision and\ntouch sensing on a multi-fingered hand to estimate an object's pose and shape\nduring in-hand manipulation. Our method, NeuralFeels, encodes object geometry\nby learning a neural field online and jointly tracks it by optimizing a pose\ngraph problem. We study multimodal in-hand perception in simulation and the\nreal-world, interacting with different objects via a proprioception-driven\npolicy. Our experiments show final reconstruction F-scores of $81$% and average\npose drifts of $4.7\\,\\text{mm}$, further reduced to $2.3\\,\\text{mm}$ with known\nCAD models. Additionally, we observe that under heavy visual occlusion we can\nachieve up to $94$% improvements in tracking compared to vision-only methods.\nOur results demonstrate that touch, at the very least, refines and, at the very\nbest, disambiguates visual estimates during in-hand manipulation. We release\nour evaluation dataset of 70 experiments, FeelSight, as a step towards\nbenchmarking in this domain. Our neural representation driven by multimodal\nsensing can serve as a perception backbone towards advancing robot dexterity.\nVideos can be found on our project website\nhttps://suddhu.github.io/neural-feels/\n","authors":["Sudharshan Suresh","Haozhi Qi","Tingfan Wu","Taosha Fan","Luis Pineda","Mike Lambeta","Jitendra Malik","Mrinal Kalakrishnan","Roberto Calandra","Michael Kaess","Joseph Ortiz","Mustafa Mukadam"],"pdf_url":"https://arxiv.org/pdf/2312.13469v1.pdf","comment":"43 pages, 20 figures, 1 table; https://suddhu.github.io/neural-feels/"},{"id":"http://arxiv.org/abs/2309.08738v2","updated":"2023-12-20T22:20:46Z","published":"2023-09-15T19:56:15Z","title":"AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual\n Masked Autoencoder","summary":" Learning high-quality video representation has shown significant applications\nin computer vision and remains challenging. Previous work based on mask\nautoencoders such as ImageMAE and VideoMAE has proven the effectiveness of\nlearning representations in images and videos through reconstruction strategy\nin the visual modality. However, these models exhibit inherent limitations,\nparticularly in scenarios where extracting features solely from the visual\nmodality proves challenging, such as when dealing with low-resolution and\nblurry original videos. Based on this, we propose AV-MaskEnhancer for learning\nhigh-quality video representation by combining visual and audio information.\nOur approach addresses the challenge by demonstrating the complementary nature\nof audio and video features in cross-modality content. Moreover, our result of\nthe video classification task on the UCF101 dataset outperforms the existing\nwork and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a\ntop-5 accuracy of 99.9%.\n","authors":["Xingjian Diao","Ming Cheng","Shitong Cheng"],"pdf_url":"https://arxiv.org/pdf/2309.08738v2.pdf","comment":"2023 IEEE 35th International Conference on Tools with Artificial\n Intelligence (ICTAI)"},{"id":"http://arxiv.org/abs/2312.13449v1","updated":"2023-12-20T21:58:45Z","published":"2023-12-20T21:58:45Z","title":"Building Lane-Level Maps from Aerial Images","summary":" Detecting lane lines from sensors is becoming an increasingly significant\npart of autonomous driving systems. However, less development has been made on\nhigh-definition lane-level mapping based on aerial images, which could\nautomatically build and update offline maps for auto-driving systems. To this\nend, our work focuses on extracting fine-level detailed lane lines together\nwith their topological structures. This task is challenging since it requires\nlarge amounts of data covering different lane types, terrain and regions. In\nthis paper, we introduce for the first time a large-scale aerial image dataset\nbuilt for lane detection, with high-quality polyline lane annotations on\nhigh-resolution images of around 80 kilometers of road. Moreover, we developed\na baseline deep learning lane detection method from aerial images, called\nAerialLaneNet, consisting of two stages. The first stage is to produce\ncoarse-grained results at point level, and the second stage exploits the\ncoarse-grained results and feature to perform the vertex-matching task,\nproducing fine-grained lanes with topology. The experiments show our approach\nachieves significant improvement compared with the state-of-the-art methods on\nour new dataset. Our code and new dataset are available at\nhttps://github.com/Jiawei-Yao0812/AerialLaneNet.\n","authors":["Jiawei Yao","Xiaochao Pan","Tong Wu","Xiaofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13449v1.pdf","comment":"Accepted at ICASSP 2024. Project page:\n https://github.com/Jiawei-Yao0812/AerialLaneNet"},{"id":"http://arxiv.org/abs/2308.07528v2","updated":"2023-12-20T21:40:02Z","published":"2023-08-15T01:54:59Z","title":"Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic\n Segmentation","summary":" Medical image segmentation modeling is a high-stakes task where understanding\nof uncertainty is crucial for addressing visual ambiguity. Prior work has\ndeveloped segmentation models utilizing probabilistic or generative mechanisms\nto infer uncertainty from labels where annotators draw a singular boundary.\nHowever, as these annotations cannot represent an individual annotator's\nuncertainty, models trained on them produce uncertainty maps that are difficult\nto interpret. We propose a novel segmentation representation, Confidence\nContours, which uses high- and low-confidence ``contours'' to capture\nuncertainty directly, and develop a novel annotation system for collecting\ncontours. We conduct an evaluation on the Lung Image Dataset Consortium (LIDC)\nand a synthetic dataset. From an annotation study with 30 participants, results\nshow that Confidence Contours provide high representative capacity without\nconsiderably higher annotator effort. We also find that general-purpose\nsegmentation models can learn Confidence Contours at the same performance level\nas standard singular annotations. Finally, from interviews with 5 medical\nexperts, we find that Confidence Contour maps are more interpretable than\nBayesian maps due to representation of structural uncertainty.\n","authors":["Andre Ye","Quan Ze Chen","Amy Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.07528v2.pdf","comment":"10 pages content, 12 pages total. Accepted to HCOMP '23"},{"id":"http://arxiv.org/abs/2312.13440v1","updated":"2023-12-20T21:30:55Z","published":"2023-12-20T21:30:55Z","title":"MGAug: Multimodal Geometric Augmentation in Latent Spaces of Image\n Deformations","summary":" Geometric transformations have been widely used to augment the size of\ntraining images. Existing methods often assume a unimodal distribution of the\nunderlying transformations between images, which limits their power when data\nwith multimodal distributions occur. In this paper, we propose a novel model,\nMultimodal Geometric Augmentation (MGAug), that for the first time generates\naugmenting transformations in a multimodal latent space of geometric\ndeformations. To achieve this, we first develop a deep network that embeds the\nlearning of latent geometric spaces of diffeomorphic transformations (a.k.a.\ndiffeomorphisms) in a variational autoencoder (VAE). A mixture of multivariate\nGaussians is formulated in the tangent space of diffeomorphisms and serves as a\nprior to approximate the hidden distribution of image transformations. We then\naugment the original training dataset by deforming images using randomly\nsampled transformations from the learned multimodal latent space of VAE. To\nvalidate the efficiency of our model, we jointly learn the augmentation\nstrategy with two distinct domain-specific tasks: multi-class classification on\n2D synthetic datasets and segmentation on real 3D brain magnetic resonance\nimages (MRIs). We also compare MGAug with state-of-the-art transformation-based\nimage augmentation algorithms. Experimental results show that our proposed\napproach outperforms all baselines by significantly improved prediction\naccuracy. Our code is publicly available at\nhttps://github.com/tonmoy-hossain/MGAug.\n","authors":["Tonmoy Hossain","Jian Wang","Miaomiao Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13440v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.14404v5","updated":"2023-12-20T21:30:03Z","published":"2022-10-26T01:00:57Z","title":"Adversarial Purification with the Manifold Hypothesis","summary":" In this work, we formulate a novel framework for adversarial robustness using\nthe manifold hypothesis. This framework provides sufficient conditions for\ndefending against adversarial examples. We develop an adversarial purification\nmethod with this framework. Our method combines manifold learning with\nvariational inference to provide adversarial robustness without the need for\nexpensive adversarial training. Experimentally, our approach can provide\nadversarial robustness even if attackers are aware of the existence of the\ndefense. In addition, our method can also serve as a test-time defense\nmechanism for variational autoencoders.\n","authors":["Zhaoyuan Yang","Zhiwei Xu","Jing Zhang","Richard Hartley","Peter Tu"],"pdf_url":"https://arxiv.org/pdf/2210.14404v5.pdf","comment":"Extended version of paper accepted at AAAI 2024 with supplementary\n materials"},{"id":"http://arxiv.org/abs/2312.13422v1","updated":"2023-12-20T20:52:01Z","published":"2023-12-20T20:52:01Z","title":"Texture Matching GAN for CT Image Enhancement","summary":" Deep neural networks (DNN) are commonly used to denoise and sharpen X-ray\ncomputed tomography (CT) images with the goal of reducing patient X-ray dosage\nwhile maintaining reconstruction quality. However, naive application of\nDNN-based methods can result in image texture that is undesirable in clinical\napplications. Alternatively, generative adversarial network (GAN) based methods\ncan produce appropriate texture, but naive application of GANs can introduce\ninaccurate or even unreal image detail. In this paper, we propose a texture\nmatching generative adversarial network (TMGAN) that enhances CT images while\ngenerating an image texture that can be matched to a target texture. We use\nparallel generators to separate anatomical features from the generated texture,\nwhich allows the GAN to be trained to match the desired texture without\ndirectly affecting the underlying CT image. We demonstrate that TMGAN generates\nenhanced image quality while also producing image texture that is desirable for\nclinical application.\n","authors":["Madhuri Nagare","Gregery T. Buzzard","Charles A. Bouman"],"pdf_url":"https://arxiv.org/pdf/2312.13422v1.pdf","comment":"Submitted to IEEE Transactions on Medical Imaging"},{"id":"http://arxiv.org/abs/2312.13396v1","updated":"2023-12-20T19:56:53Z","published":"2023-12-20T19:56:53Z","title":"EPNet: An Efficient Pyramid Network for Enhanced Single-Image\n Super-Resolution with Reduced Computational Requirements","summary":" Single-image super-resolution (SISR) has seen significant advancements\nthrough the integration of deep learning. However, the substantial\ncomputational and memory requirements of existing methods often limit their\npractical application. This paper introduces a new Efficient Pyramid Network\n(EPNet) that harmoniously merges an Edge Split Pyramid Module (ESPM) with a\nPanoramic Feature Extraction Module (PFEM) to overcome the limitations of\nexisting methods, particularly in terms of computational efficiency. The ESPM\napplies a pyramid-based channel separation strategy, boosting feature\nextraction while maintaining computational efficiency. The PFEM, a novel fusion\nof CNN and Transformer structures, enables the concurrent extraction of local\nand global features, thereby providing a panoramic view of the image landscape.\nOur architecture integrates the PFEM in a manner that facilitates the\nstreamlined exchange of feature information and allows for the further\nrefinement of image texture details. Experimental results indicate that our\nmodel outperforms existing state-of-the-art methods in image resolution\nquality, while considerably decreasing computational and memory costs. This\nresearch contributes to the ongoing evolution of efficient and practical SISR\nmethodologies, bearing broader implications for the field of computer vision.\n","authors":["Xin Xu","Jinman Park","Paul Fieguth"],"pdf_url":"https://arxiv.org/pdf/2312.13396v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.08965v3","updated":"2023-12-20T19:47:36Z","published":"2022-06-17T18:40:11Z","title":"KitBit: A New AI Model for Solving Intelligence Tests and Numerical\n Series","summary":" The resolution of intelligence tests, in particular numerical sequences, has\nbeen of great interest in the evaluation of AI systems. We present a new\ncomputational model called KitBit that uses a reduced set of algorithms and\ntheir combinations to build a predictive model that finds the underlying\npattern in numerical sequences, such as those included in IQ tests and others\nof much greater complexity. We present the fundamentals of the model and its\napplication in different cases. First, the system is tested on a set of number\nseries used in IQ tests collected from various sources. Next, our model is\nsuccessfully applied on the sequences used to evaluate the models reported in\nthe literature. In both cases, the system is capable of solving these types of\nproblems in less than a second using standard computing power. Finally,\nKitBit's algorithms have been applied for the first time to the complete set of\nentire sequences of the well-known OEIS database. We find a pattern in the form\nof a list of algorithms and predict the following terms in the largest number\nof series to date. These results demonstrate the potential of KitBit to solve\ncomplex problems that could be represented numerically.\n","authors":["Víctor Corsino","José Manuel Gilpérez","Luis Herrera"],"pdf_url":"https://arxiv.org/pdf/2206.08965v3.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2301.00114v3","updated":"2023-12-20T19:41:24Z","published":"2022-12-31T04:11:25Z","title":"Skeletal Video Anomaly Detection using Deep Learning: Survey, Challenges\n and Future Directions","summary":" The existing methods for video anomaly detection mostly utilize videos\ncontaining identifiable facial and appearance-based features. The use of videos\nwith identifiable faces raises privacy concerns, especially when used in a\nhospital or community-based setting. Appearance-based features can also be\nsensitive to pixel-based noise, straining the anomaly detection methods to\nmodel the changes in the background and making it difficult to focus on the\nactions of humans in the foreground. Structural information in the form of\nskeletons describing the human motion in the videos is privacy-protecting and\ncan overcome some of the problems posed by appearance-based features. In this\npaper, we present a survey of privacy-protecting deep learning anomaly\ndetection methods using skeletons extracted from videos. We present a novel\ntaxonomy of algorithms based on the various learning approaches. We conclude\nthat skeleton-based approaches for anomaly detection can be a plausible\nprivacy-protecting alternative for video anomaly detection. Lastly, we identify\nmajor open research questions and provide guidelines to address them.\n","authors":["Pratik K. Mishra","Alex Mihailidis","Shehroz S. Khan"],"pdf_url":"https://arxiv.org/pdf/2301.00114v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2108.02893v2","updated":"2023-12-20T19:32:36Z","published":"2021-08-06T00:04:02Z","title":"Basis Scaling and Double Pruning for Efficient Inference in\n Network-Based Transfer Learning","summary":" Network-based transfer learning allows the reuse of deep learning features\nwith limited data, but the resulting models can be unnecessarily large.\nAlthough network pruning can improve inference efficiency, existing algorithms\nusually require fine-tuning that may not be suitable for small datasets. In\nthis paper, using the singular value decomposition, we decompose a\nconvolutional layer into two layers: a convolutional layer with the orthonormal\nbasis vectors as the filters, and a \"BasisScalingConv\" layer which is\nresponsible for rescaling the features and transforming them back to the\noriginal space. As the filters in each decomposed layer are linearly\nindependent, when using the proposed basis scaling factors with the Taylor\napproximation of importance, pruning can be more effective and fine-tuning\nindividual weights is unnecessary. Furthermore, as the numbers of input and\noutput channels of the original convolutional layer remain unchanged after\nbasis pruning, it is applicable to virtually all architectures and can be\ncombined with existing pruning algorithms for double pruning to further\nincrease the pruning capability. When transferring knowledge from ImageNet\npre-trained models to different target domains, with less than 1% reduction in\nclassification accuracies, we can achieve pruning ratios up to 74.6% for\nCIFAR-10 and 98.9% for MNIST in model parameters.\n","authors":["Ken C. L. Wong","Satyananda Kashyap","Mehdi Moradi"],"pdf_url":"https://arxiv.org/pdf/2108.02893v2.pdf","comment":"This paper was accepted by Pattern Recognition Letters"},{"id":"http://arxiv.org/abs/2312.13377v1","updated":"2023-12-20T19:08:49Z","published":"2023-12-20T19:08:49Z","title":"SADA: Semantic adversarial unsupervised domain adaptation for Temporal\n Action Localization","summary":" Temporal Action Localization (TAL) is a complex task that poses relevant\nchallenges, particularly when attempting to generalize on new -- unseen --\ndomains in real-world applications. These scenarios, despite realistic, are\noften neglected in the literature, exposing these solutions to important\nperformance degradation. In this work, we tackle this issue by introducing, for\nthe first time, an approach for Unsupervised Domain Adaptation (UDA) in sparse\nTAL, which we refer to as Semantic Adversarial unsupervised Domain Adaptation\n(SADA). Our contribution is threefold: (1) we pioneer the development of a\ndomain adaptation model that operates on realistic sparse action detection\nbenchmarks; (2) we tackle the limitations of global-distribution alignment\ntechniques by introducing a novel adversarial loss that is sensitive to local\nclass distributions, ensuring finer-grained adaptation; and (3) we present a\nnovel experimental setup, based on EpicKitchens100, that evaluates multiple\ntypes of domain shifts in a comprehensive manner. Our experimental results\nindicate that SADA improves the adaptation across domains when compared to\nfully supervised state-of-the-art and alternative UDA methods, attaining a\nrelative performance boost of up to 14%.\n","authors":["David Pujol-Perich","Albert Clapés","Sergio Escalera"],"pdf_url":"https://arxiv.org/pdf/2312.13377v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2312.13264v1","updated":"2023-12-20T18:41:44Z","published":"2023-12-20T18:41:44Z","title":"dIR -- Discrete Information Retrieval: Conversational Search over\n Unstructured (and Structured) Data with Large Language Models","summary":" Data is stored in both structured and unstructured form. Querying both, to\npower natural language conversations, is a challenge. This paper introduces\ndIR, Discrete Information Retrieval, providing a unified interface to query\nboth free text and structured knowledge. Specifically, a Large Language Model\n(LLM) transforms text into expressive representation. After the text is\nextracted into columnar form, it can then be queried via a text-to-SQL Semantic\nParser, with an LLM converting natural language into SQL. Where desired, such\nconversation may be effected by a multi-step reasoning conversational agent. We\nvalidate our approach via a proprietary question/answer data set, concluding\nthat dIR makes a whole new class of queries on free text possible when compared\nto traditionally fine-tuned dense-embedding-model-based Information Retrieval\n(IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR\ncan succeed where no other method stands a chance.\n","authors":["Pablo M. Rodriguez Bertorello","Jean Rodmond Junior Laguerre"],"pdf_url":"https://arxiv.org/pdf/2312.13264v1.pdf","comment":"8 pages, 5 figures, Association for Computational Linguistics"},{"id":"http://arxiv.org/abs/2306.01266v2","updated":"2023-12-20T17:01:04Z","published":"2023-06-02T04:43:21Z","title":"Self Contrastive Learning for Session-based Recommendation","summary":" Session-based recommendation, which aims to predict the next item of users'\ninterest as per an existing sequence interaction of items, has attracted\ngrowing applications of Contrastive Learning (CL) with improved user and item\nrepresentations. However, these contrastive objectives: (1) serve a similar\nrole as the cross-entropy loss while ignoring the item representation space\noptimisation; and (2) commonly require complicated modelling, including complex\npositive/negative sample constructions and extra data augmentation. In this\nwork, we introduce Self-Contrastive Learning (SCL), which simplifies the\napplication of CL and enhances the performance of state-of-the-art CL-based\nrecommendation techniques. Specifically, SCL is formulated as an objective\nfunction that directly promotes a uniform distribution among item\nrepresentations and efficiently replaces all the existing contrastive objective\ncomponents of state-of-the-art models. Unlike previous works, SCL eliminates\nthe need for any positive/negative sample construction or data augmentation,\nleading to enhanced interpretability of the item representation space and\nfacilitating its extensibility to existing recommender systems. Through\nexperiments on three benchmark datasets, we demonstrate that SCL consistently\nimproves the performance of state-of-the-art models with statistical\nsignificance. Notably, our experiments show that SCL improves the performance\nof two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and\n11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks.\nAdditionally, our analysis elucidates the improvement in terms of alignment and\nuniformity of representations, as well as the effectiveness of SCL with a low\ncomputational cost.\n","authors":["Zhengxiang Shi","Xi Wang","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2306.01266v2.pdf","comment":"ECIR 2024 (Full Paper) Camera-ready Version. Code is available at\n https://github.com/ZhengxiangShi/SelfContrastiveLearningRecSys"},{"id":"http://arxiv.org/abs/2312.10743v2","updated":"2023-12-20T16:11:14Z","published":"2023-12-17T15:28:06Z","title":"A Unified Framework for Multi-Domain CTR Prediction via Large Language\n Models","summary":" Click-Through Rate (CTR) prediction is a crucial task in online\nrecommendation platforms as it involves estimating the probability of user\nengagement with advertisements or items by clicking on them. Given the\navailability of various services like online shopping, ride-sharing, food\ndelivery, and professional services on commercial platforms, recommendation\nsystems in these platforms are required to make CTR predictions across multiple\ndomains rather than just a single domain. However, multi-domain click-through\nrate (MDCTR) prediction remains a challenging task in online recommendation due\nto the complex mutual influence between domains. Traditional MDCTR models\ntypically encode domains as discrete identifiers, ignoring rich semantic\ninformation underlying. Consequently, they can hardly generalize to new\ndomains. Besides, existing models can be easily dominated by some specific\ndomains, which results in significant performance drops in the other domains\n(\\ie the ``seesaw phenomenon``). In this paper, we propose a novel solution\nUni-CTR to address the above challenges. Uni-CTR leverages a backbone Large\nLanguage Model (LLM) to learn layer-wise semantic representations that capture\ncommonalities between domains. Uni-CTR also uses several domain-specific\nnetworks to capture the characteristics of each domain. Note that we design a\nmasked loss strategy so that these domain-specific networks are decoupled from\nbackbone LLM. This allows domain-specific networks to remain unchanged when\nincorporating new or removing domains, thereby enhancing the flexibility and\nscalability of the system significantly. Experimental results on three public\ndatasets show that Uni-CTR outperforms the state-of-the-art (SOTA) MDCTR models\nsignificantly. Furthermore, Uni-CTR demonstrates remarkable effectiveness in\nzero-shot prediction. We have applied Uni-CTR in industrial scenarios,\nconfirming its efficiency.\n","authors":["Zichuan Fu","Xiangyang Li","Chuhan Wu","Yichao Wang","Kuicai Dong","Xiangyu Zhao","Mengchen Zhao","Huifeng Guo","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2312.10743v2.pdf","comment":"Still being revised"},{"id":"http://arxiv.org/abs/2312.10080v2","updated":"2023-12-20T12:01:45Z","published":"2023-12-10T18:33:45Z","title":"No prejudice! Fair Federated Graph Neural Networks for Personalized\n Recommendation","summary":" Ensuring fairness in Recommendation Systems (RSs) across demographic groups\nis critical due to the increased integration of RSs in applications such as\npersonalized healthcare, finance, and e-commerce. Graph-based RSs play a\ncrucial role in capturing intricate higher-order interactions among entities.\nHowever, integrating these graph models into the Federated Learning (FL)\nparadigm with fairness constraints poses formidable challenges as this requires\naccess to the entire interaction graph and sensitive user information (such as\ngender, age, etc.) at the central server. This paper addresses the pervasive\nissue of inherent bias within RSs for different demographic groups without\ncompromising the privacy of sensitive user attributes in FL environment with\nthe graph-based model. To address the group bias, we propose F2PGNN (Fair\nFederated Personalized Graph Neural Network), a novel framework that leverages\nthe power of Personalized Graph Neural Network (GNN) coupled with fairness\nconsiderations. Additionally, we use differential privacy techniques to fortify\nprivacy protection. Experimental evaluation on three publicly available\ndatasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47%\n- 99% compared to the state-of-the-art while preserving privacy and maintaining\nthe utility. The results validate the significance of our framework in\nachieving equitable and personalized recommendations using GNN within the FL\nlandscape.\n","authors":["Nimesh Agrawal","Anuj Kumar Sirohi"," Jayadeva","Sandeep Kumar"],"pdf_url":"https://arxiv.org/pdf/2312.10080v2.pdf","comment":"To appear as a full paper in AAAI 2024"},{"id":"http://arxiv.org/abs/2311.04590v2","updated":"2023-12-20T11:22:05Z","published":"2023-11-08T10:44:20Z","title":"Rethinking Cross-Domain Sequential Recommendation under Open-World\n Assumptions","summary":" Cross-Domain Sequential Recommendation (CDSR) methods aim to tackle the data\nsparsity and cold-start problems present in Single-Domain Sequential\nRecommendation (SDSR). Existing CDSR works design their elaborate structures\nrelying on overlapping users to propagate the cross-domain information.\nHowever, current CDSR methods make closed-world assumptions, assuming fully\noverlapping users across multiple domains and that the data distribution\nremains unchanged from the training environment to the test environment. As a\nresult, these methods typically result in lower performance on online\nreal-world platforms due to the data distribution shifts. To address these\nchallenges under open-world assumptions, we design an \\textbf{A}daptive\n\\textbf{M}ulti-\\textbf{I}nterest \\textbf{D}ebiasing framework for cross-domain\nsequential recommendation (\\textbf{AMID}), which consists of a multi-interest\ninformation module (\\textbf{MIM}) and a doubly robust estimator (\\textbf{DRE}).\nOur framework is adaptive for open-world environments and can improve the model\nof most off-the-shelf single-domain sequential backbone models for CDSR. Our\nMIM establishes interest groups that consider both overlapping and\nnon-overlapping users, allowing us to effectively explore user intent and\nexplicit interest. To alleviate biases across multiple domains, we developed\nthe DRE for the CDSR methods. We also provide a theoretical analysis that\ndemonstrates the superiority of our proposed estimator in terms of bias and\ntail bound, compared to the IPS estimator used in previous work.\n","authors":["Wujiang Xu","Qitian Wu","Runzhong Wang","Mingming Ha","Qiongxu Ma","Linxun Chen","Bing Han","Junchi Yan"],"pdf_url":"https://arxiv.org/pdf/2311.04590v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10885v2","updated":"2023-12-20T10:05:44Z","published":"2023-12-18T02:18:33Z","title":"A novel diffusion recommendation algorithm based on multi-scale cnn and\n residual lstm","summary":" Sequential recommendation aims to infer user preferences from historical\ninteraction sequences and predict the next item that users may be interested in\nthe future. The current mainstream design approach is to represent items as\nfixed vectors, capturing the underlying relationships between items and user\npreferences based on the order of interactions. However, relying on a single\nfixed-item embedding may weaken the modeling capability of the system, and the\nglobal dynamics and local saliency exhibited by user preferences need to be\ndistinguished. To address these issues, this paper proposes a novel diffusion\nrecommendation algorithm based on multi-scale cnn and residual lstm (AREAL). We\nintroduce diffusion models into the recommend system, representing items as\nprobability distributions instead of fixed vectors. This approach enables\nadaptive reflection of multiple aspects of the items and generates item\ndistributions in a denoising manner. We use multi-scale cnn and residual lstm\nmethods to extract the local and global dependency features of user history\ninteractions, and use attention mechanism to distinguish weights as the guide\nfeatures of reverse diffusion recovery. The effectiveness of the proposed\nmethod is validated through experiments conducted on two real-world datasets.\nSpecifically, AREAL obtains improvements over the best baselines by 2.63% and\n4.25% in terms of HR@20 and 5.05% and 3.94% in terms of NDCG@20 on all\ndatasets.\n","authors":["Yong Niu","Xing Xing","Zhichun Jia","Ruidi Liu","Mindong Xin"],"pdf_url":"https://arxiv.org/pdf/2312.10885v2.pdf","comment":"This paper needs to be further modified, including the ablation\n experiment, model framework and other information in Chapter 5. There are\n some inaccuracies in the presentation of this paper. Two datasets are used\n instead of three, and there are many inaccuracies in the presentation, which\n need to be further corrected"},{"id":"http://arxiv.org/abs/2312.12882v1","updated":"2023-12-20T09:46:42Z","published":"2023-12-20T09:46:42Z","title":"BSL: Understanding and Improving Softmax Loss for Recommendation","summary":" Loss functions steer the optimization direction of recommendation models and\nare critical to model performance, but have received relatively little\nattention in recent recommendation research. Among various losses, we find\nSoftmax loss (SL) stands out for not only achieving remarkable accuracy but\nalso better robustness and fairness. Nevertheless, the current literature lacks\na comprehensive explanation for the efficacy of SL. Toward addressing this\nresearch gap, we conduct theoretical analyses on SL and uncover three insights:\n1) Optimizing SL is equivalent to performing Distributionally Robust\nOptimization (DRO) on the negative data, thereby learning against perturbations\non the negative distribution and yielding robustness to noisy negatives. 2)\nComparing with other loss functions, SL implicitly penalizes the prediction\nvariance, resulting in a smaller gap between predicted values and and thus\nproducing fairer results. Building on these insights, we further propose a\nnovel loss function Bilateral SoftMax Loss (BSL) that extends the advantage of\nSL to both positive and negative sides. BSL augments SL by applying the same\nLog-Expectation-Exp structure to positive examples as is used for negatives,\nmaking the model robust to the noisy positives as well. Remarkably, BSL is\nsimple and easy-to-implement -- requiring just one additional line of code\ncompared to SL. Experiments on four real-world datasets and three\nrepresentative backbones demonstrate the effectiveness of our proposal. The\ncode is available at https://github.com/junkangwu/BSL\n","authors":["Junkang Wu","Jiawei Chen","Jiancan Wu","Wentao Shi","Jizhi Zhang","Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16716v2","updated":"2023-12-20T07:23:14Z","published":"2023-11-28T12:00:06Z","title":"GraphPro: Graph Pre-training and Prompt Learning for Recommendation","summary":" GNN-based recommenders have excelled in modeling intricate user-item\ninteractions through multi-hop message passing. However, existing methods often\noverlook the dynamic nature of evolving user-item interactions, which impedes\nthe adaption to changing user preferences and distribution shifts in newly\narriving data. Thus, their scalability and performances in real-world dynamic\nenvironments are limited. In this study, we propose GraphPro, a framework that\nincorporates parameter-efficient and dynamic graph pre-training with prompt\nlearning. This novel combination empowers GNNs to effectively capture both\nlong-term user preferences and short-term behavior dynamics, enabling the\ndelivery of accurate and timely recommendations. Our GraphPro framework\naddresses the challenge of evolving user preferences by seamlessly integrating\na temporal prompt mechanism and a graph-structural prompt learning mechanism\ninto the pre-trained GNN model. The temporal prompt mechanism encodes time\ninformation on user-item interaction, allowing the model to naturally capture\ntemporal context, while the graph-structural prompt learning mechanism enables\nthe transfer of pre-trained knowledge to adapt to behavior dynamics without the\nneed for continuous incremental training. We further bring in a dynamic\nevaluation setting for recommendation to mimic real-world dynamic scenarios and\nbridge the offline-online gap to a better level. Our extensive experiments\nincluding a large-scale industrial deployment showcases the lightweight plug-in\nscalability of our GraphPro when integrated with various state-of-the-art\nrecommenders, emphasizing the advantages of GraphPro in terms of effectiveness,\nrobustness and efficiency.\n","authors":["Yuhao Yang","Lianghao Xia","Da Luo","Kangyi Lin","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2311.16716v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.15563v2","updated":"2023-12-20T06:18:19Z","published":"2022-10-27T15:53:38Z","title":"Multimodal Transformer Distillation for Audio-Visual Synchronization","summary":" Audio-visual synchronization aims to determine whether the mouth movements\nand speech in the video are synchronized. VocaLiST reaches state-of-the-art\nperformance by incorporating multimodal Transformers to model audio-visual\ninteract information. However, it requires high computing resources, making it\nimpractical for real-world applications. This paper proposed an MTDVocaLiST\nmodel, which is trained by our proposed multimodal Transformer distillation\n(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the\ncross-attention distribution and value-relation in the Transformer of VocaLiST.\nAdditionally, we harness uncertainty weighting to fully exploit the interaction\ninformation across all layers. Our proposed method is effective in two aspects:\nFrom the distillation method perspective, MTD loss outperforms other strong\ndistillation baselines. From the distilled model's performance perspective: 1)\nMTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match\nmodels by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST\nby 83.52%, yet still maintaining similar performance.\n","authors":["Xuanjun Chen","Haibin Wu","Chung-Che Wang","Hung-yi Lee","Jyh-Shing Roger Jang"],"pdf_url":"https://arxiv.org/pdf/2210.15563v2.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2310.14884v3","updated":"2023-12-20T05:30:06Z","published":"2023-10-23T12:53:22Z","title":"Budgeted Embedding Table For Recommender Systems","summary":" At the heart of contemporary recommender systems (RSs) are latent factor\nmodels that provide quality recommendation experience to users. These models\nuse embedding vectors, which are typically of a uniform and fixed size, to\nrepresent users and items. As the number of users and items continues to grow,\nthis design becomes inefficient and hard to scale. Recent lightweight embedding\nmethods have enabled different users and items to have diverse embedding sizes,\nbut are commonly subject to two major drawbacks. Firstly, they limit the\nembedding size search to optimizing a heuristic balancing the recommendation\nquality and the memory complexity, where the trade-off coefficient needs to be\nmanually tuned for every memory budget requested. The implicitly enforced\nmemory complexity term can even fail to cap the parameter usage, making the\nresultant embedding table fail to meet the memory budget strictly. Secondly,\nmost solutions, especially reinforcement learning based ones derive and\noptimize the embedding size for each each user/item on an instance-by-instance\nbasis, which impedes the search efficiency. In this paper, we propose Budgeted\nEmbedding Table (BET), a novel method that generates table-level actions (i.e.,\nembedding sizes for all users and items) that is guaranteed to meet\npre-specified memory budgets. Furthermore, by leveraging a set-based action\nformulation and engaging set representation learning, we present an innovative\naction search strategy powered by an action fitness predictor that efficiently\nevaluates each table-level action. Experiments have shown state-of-the-art\nperformance on two real-world datasets when BET is paired with three popular\nrecommender models under different memory budgets.\n","authors":["Yunke Qu","Tong Chen","Quoc Viet Hung Nguyen","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2310.14884v3.pdf","comment":"Accepted by WSDM 2024"},{"id":"http://arxiv.org/abs/2312.12750v1","updated":"2023-12-20T04:05:21Z","published":"2023-12-20T04:05:21Z","title":"Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems","summary":" \"Creativity is the heart and soul of advertising services\". Effective\ncreatives can create a win-win scenario: advertisers can reach target users and\nachieve marketing objectives more effectively, users can more quickly find\nproducts of interest, and platforms can generate more advertising revenue. With\nthe advent of AI-Generated Content, advertisers now can produce vast amounts of\ncreative content at a minimal cost. The current challenge lies in how\nadvertising systems can select the most pertinent creative in real-time for\neach user personally. Existing methods typically perform serial ranking of ads\nor creatives, limiting the creative module in terms of both effectiveness and\nefficiency. In this paper, we propose for the first time a novel architecture\nfor online parallel estimation of ads and creatives ranking, as well as the\ncorresponding offline joint optimization model. The online architecture enables\nsophisticated personalized creative modeling while reducing overall latency.\nThe offline joint model for CTR estimation allows mutual awareness and\ncollaborative optimization between ads and creatives. Additionally, we optimize\nthe offline evaluation metrics for the implicit feedback sorting task involved\nin ad creative ranking. We conduct extensive experiments to compare ours with\ntwo state-of-the-art approaches. The results demonstrate the effectiveness of\nour approach in both offline evaluations and real-world advertising platforms\nonline in terms of response time, CTR, and CPM.\n","authors":["Zhiguang Yang","Lu Wang","Chun Gan","Liufang Sang","Haoran Wang","Wenlong Chen","Jie He","Changping Peng","Zhangang Lin","Jingping Shao"],"pdf_url":"https://arxiv.org/pdf/2312.12750v1.pdf","comment":"9 pages, 4 figures, AAAI2024"},{"id":"http://arxiv.org/abs/2312.12430v2","updated":"2023-12-20T03:33:54Z","published":"2023-12-19T18:56:52Z","title":"Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP","summary":" We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel\ntitle reranking technique to achieve efficient title reranking 20x-40x faster\nthan vanilla passage reranker. However, one of the challenges with the training\nof Efficient Title Reranker is the instability. Analyzing the issue, we found\nsome very difficult ground truths might act as noisy labels causing accuracy to\ndrop as well as some extreme values in model probability output causing nan. To\naddress these issues, we introduce the Sigmoid Trick, a novel technique that\nreduces the gradient update of both cases resulting in better retrieval\nefficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we\nachieved four state-of-the-art positions on the kilt knowledge benchmark.\n","authors":["Ziyi Chen","Heyi Tao","Daqian Zuo","Jize Jiang","Jun Yang","Yuxiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12430v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12740v1","updated":"2023-12-20T03:21:48Z","published":"2023-12-20T03:21:48Z","title":"Fine-tuning Large Language Models for Adaptive Machine Translation","summary":" This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose\nlarge language model (LLM), for adaptive machine translation (MT). The\nfine-tuning process involves utilising a combination of zero-shot and one-shot\ntranslation prompts within the medical domain. The primary objective is to\nenhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt\ntranslations to the required domain at inference time. The results,\nparticularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned\nmodel, demonstrating quality improvements in both zero-shot and one-shot\ntranslation scenarios, surpassing Mistral 7B's baseline performance. Notably,\nthe fine-tuned Mistral outperforms ChatGPT \"gpt-3.5-turbo\" in zero-shot\ntranslation while achieving comparable one-shot translation quality. Moreover,\nthe zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B's\nperformance, and its one-shot translation quality surpasses that of NLLB 3.3B.\nThese findings emphasise the significance of fine-tuning efficient LLMs like\nMistral 7B to yield high-quality zero-shot translations comparable to\ntask-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved\nin one-shot translation are comparable to those of commercial LLMs such as\nChatGPT. Our experiments demonstrate that, with a relatively small dataset of\n20,000 segments that incorporate a mix of zero-shot and one-shot prompts,\nfine-tuning significantly enhances Mistral's in-context learning ability,\nespecially for real-time adaptive MT.\n","authors":["Yasmin Moslem","Rejwanul Haque","Andy Way"],"pdf_url":"https://arxiv.org/pdf/2312.12740v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12728v1","updated":"2023-12-20T02:55:15Z","published":"2023-12-20T02:55:15Z","title":"Lookahead: An Inference Acceleration Framework for Large Language Model\n with Lossless Generation Accuracy","summary":" As Large Language Models (LLMs) have made significant advancements across\nvarious tasks, such as question answering, translation, text summarization, and\ndialogue systems, the need for accuracy in information becomes crucial,\nespecially for serious financial products serving billions of users like\nAlipay. To address this, Alipay has developed a Retrieval-Augmented Generation\n(RAG) system that grounds LLMs on the most accurate and up-to-date information.\nHowever, for a real-world product serving millions of users, the inference\nspeed of LLMs becomes a critical factor compared to a mere experimental model.\n Hence, this paper presents a generic framework for accelerating the inference\nprocess, resulting in a substantial increase in speed and cost reduction for\nour RAG system, with lossless generation accuracy. In the traditional inference\nprocess, each token is generated sequentially by the LLM, leading to a time\nconsumption proportional to the number of generated tokens. To enhance this\nprocess, our framework, named \\textit{lookahead}, introduces a\n\\textit{multi-branch} strategy. Instead of generating a single token at a time,\nwe propose a \\textit{Trie-based Retrieval} (TR) process that enables the\ngeneration of multiple branches simultaneously, each of which is a sequence of\ntokens. Subsequently, for each branch, a \\textit{Verification and Accept} (VA)\nprocess is performed to identify the longest correct sub-sequence as the final\noutput. Our strategy offers two distinct advantages: (1) it guarantees absolute\ncorrectness of the output, avoiding any approximation algorithms, and (2) the\nworst-case performance of our approach is equivalent to the conventional\nprocess. We conduct extensive experiments to demonstrate the significant\nimprovements achieved by applying our inference acceleration framework.\n","authors":["Yao Zhao","Zhitian Xie","Chenyi Zhuang","Jinjie Gu"],"pdf_url":"https://arxiv.org/pdf/2312.12728v1.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2312.12672v1","updated":"2023-12-20T00:07:43Z","published":"2023-12-20T00:07:43Z","title":"Categorical, Ratio, and Professorial Data: The Case for Reciprocal Rank","summary":" Search engine results pages are usually abstracted as binary relevance\nvectors and hence are categorical data, meaning that only a limited set of\noperations is permitted, most notably tabulation of occurrence frequencies,\nwith determination of medians and averages not possible. To compare retrieval\nsystems it is thus usual to make use of a categorical-to-numeric effectiveness\nmapping. A previous paper has argued that any desired categorical-to-numeric\nmapping may be used, provided only that there is an argued connection between\neach category of SERP and the score that is assigned to that category by the\nmapping. Further, once that plausible connection has been established, then the\nmapped values can be treated as real-valued observations on a ratio scale,\nallowing the computation of averages. This article is written in support of\nthat point of view, and to respond to ongoing claims that SERP scores may only\nbe averaged if very restrictive conditions are imposed on the effectiveness\nmapping.\n","authors":["Alistair Moffat"],"pdf_url":"https://arxiv.org/pdf/2312.12672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13473v1","updated":"2023-12-20T22:48:38Z","published":"2023-12-20T22:48:38Z","title":"Accuracy vs Memory Advantage in the Quantum Simulation of Stochastic\n Processes","summary":" Many inference scenarios rely on extracting relevant information from known\ndata in order to make future predictions. When the underlying stochastic\nprocess satisfies certain assumptions, there is a direct mapping between its\nexact classical and quantum simulators, with the latter asymptotically using\nless memory. Here we focus on studying whether such quantum advantage persists\nwhen those assumptions are not satisfied, and the model is doomed to have\nimperfect accuracy. By studying the trade-off between accuracy and memory\nrequirements, we show that quantum models can reach the same accuracy with less\nmemory, or alternatively, better accuracy with the same memory. Finally, we\ndiscuss the implications of this result for learning tasks.\n","authors":["Leonardo Banchi"],"pdf_url":"https://arxiv.org/pdf/2312.13473v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13434v1","updated":"2023-12-20T21:20:23Z","published":"2023-12-20T21:20:23Z","title":"Zero-1-to-3: Domain-level Zero-shot Cognitive Diagnosis via One Batch of\n Early-bird Students towards Three Diagnostic Objectives","summary":" Cognitive diagnosis seeks to estimate the cognitive states of students by\nexploring their logged practice quiz data. It plays a pivotal role in\npersonalized learning guidance within intelligent education systems. In this\npaper, we focus on an important, practical, yet often underexplored task:\ndomain-level zero-shot cognitive diagnosis (DZCD), which arises due to the\nabsence of student practice logs in newly launched domains. Recent cross-domain\ndiagnostic models have been demonstrated to be a promising strategy for DZCD.\nThese methods primarily focus on how to transfer student states across domains.\nHowever, they might inadvertently incorporate non-transferable information into\nstudent representations, thereby limiting the efficacy of knowledge transfer.\nTo tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive\ndiagnosis framework via one batch of early-bird students towards three\ndiagnostic objectives. Our approach initiates with pre-training a diagnosis\nmodel with dual regularizers, which decouples student states into domain-shared\nand domain-specific parts. The shared cognitive signals can be transferred to\nthe target domain, enriching the cognitive priors for the new domain, which\nensures the cognitive state propagation objective. Subsequently, we devise a\nstrategy to generate simulated practice logs for cold-start students through\nanalyzing the behavioral patterns from early-bird students, fulfilling the\ndomain-adaption goal. Consequently, we refine the cognitive states of\ncold-start students as diagnostic outcomes via virtual data, aligning with the\ndiagnosis-oriented goal. Finally, extensive experiments on six real-world\ndatasets highlight the efficacy of our model for DZCD and its practical\napplication in question recommendation.\n","authors":["Weibo Gao","Qi Liu","Hao Wang","Linan Yue","Haoyang Bi","Yin Gu","Fangzhou Yao","Zheng Zhangm Xin Li","Yuanjing He"],"pdf_url":"https://arxiv.org/pdf/2312.13434v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.13423v1","updated":"2023-12-20T21:02:09Z","published":"2023-12-20T21:02:09Z","title":"VADIS -- a VAriable Detection, Interlinking and Summarization system","summary":" The VADIS system addresses the demand of providing enhanced information\naccess in the domain of the social sciences. This is achieved by allowing users\nto search and use survey variables in context of their underlying research data\nand scholarly publications which have been interlinked with each other.\n","authors":["Yavuz Selim Kartal","Muhammad Ahsan Shahid","Sotaro Takeshita","Tornike Tsereteli","Andrea Zielinski","Benjamin Zapilko","Philipp Mayr"],"pdf_url":"https://arxiv.org/pdf/2312.13423v1.pdf","comment":"It is 4 pages and 2 figures. This paper has recently been accepted by\n ECIR 2024 Demo Track and this version is the camera-ready version of the\n paper"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2303.16521v2","updated":"2023-12-20T18:56:55Z","published":"2023-03-29T08:23:26Z","title":"Hard Regularization to Prevent Deep Online Clustering Collapse without\n Data Augmentation","summary":" Online deep clustering refers to the joint use of a feature extraction\nnetwork and a clustering model to assign cluster labels to each new data point\nor batch as it is processed. While faster and more versatile than offline\nmethods, online clustering can easily reach the collapsed solution where the\nencoder maps all inputs to the same point and all are put into a single\ncluster. Successful existing models have employed various techniques to avoid\nthis problem, most of which require data augmentation or which aim to make the\naverage soft assignment across the dataset the same for each cluster. We\npropose a method that does not require data augmentation, and that, differently\nfrom existing methods, regularizes the hard assignments. Using a Bayesian\nframework, we derive an intuitive optimization objective that can be\nstraightforwardly included in the training of the encoder network. Tested on\nfour image datasets and one human-activity recognition dataset, it consistently\navoids collapse more robustly than other methods and leads to more accurate\nclustering. We also conduct further experiments and analyses justifying our\nchoice to regularize the hard cluster assignments. Code is available at\nhttps://github.com/Lou1sM/online_hard_clustering.\n","authors":["Louis Mahon","Thomas Lukasiewicz"],"pdf_url":"https://arxiv.org/pdf/2303.16521v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15296v3","updated":"2023-12-20T18:52:00Z","published":"2023-05-24T16:22:18Z","title":"MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal\n Image Generation","summary":" The recent popularity of text-to-image diffusion models (DM) can largely be\nattributed to the intuitive interface they provide to users. The intended\ngeneration can be expressed in natural language, with the model producing\nfaithful interpretations of text prompts. However, expressing complex or\nnuanced ideas in text alone can be difficult. To ease image generation, we\npropose MultiFusion that allows one to express complex and nuanced concepts\nwith arbitrarily interleaved inputs of multiple modalities and languages.\nMutliFusion leverages pre-trained models and aligns them for integration into a\ncohesive system, thereby avoiding the need for extensive training from scratch.\nOur experimental results demonstrate the efficient transfer of capabilities\nfrom individual modules to the downstream model. Specifically, the fusion of\nall independent components allows the image generation module to utilize\nmultilingual, interleaved multimodal inputs despite being trained solely on\nmonomodal data in a single language.\n","authors":["Marco Bellagente","Manuel Brack","Hannah Teufel","Felix Friedrich","Björn Deiseroth","Constantin Eichenberg","Andrew Dai","Robert Baldock","Souradeep Nanda","Koen Oostermeijer","Andres Felipe Cruz-Salinas","Patrick Schramowski","Kristian Kersting","Samuel Weinbach"],"pdf_url":"https://arxiv.org/pdf/2305.15296v3.pdf","comment":"Proceedings of Advances in Neural Information Processing Systems:\n Annual Conference on Neural Information Processing Systems (NeurIPS)"},{"id":"http://arxiv.org/abs/2312.13264v1","updated":"2023-12-20T18:41:44Z","published":"2023-12-20T18:41:44Z","title":"dIR -- Discrete Information Retrieval: Conversational Search over\n Unstructured (and Structured) Data with Large Language Models","summary":" Data is stored in both structured and unstructured form. Querying both, to\npower natural language conversations, is a challenge. This paper introduces\ndIR, Discrete Information Retrieval, providing a unified interface to query\nboth free text and structured knowledge. Specifically, a Large Language Model\n(LLM) transforms text into expressive representation. After the text is\nextracted into columnar form, it can then be queried via a text-to-SQL Semantic\nParser, with an LLM converting natural language into SQL. Where desired, such\nconversation may be effected by a multi-step reasoning conversational agent. We\nvalidate our approach via a proprietary question/answer data set, concluding\nthat dIR makes a whole new class of queries on free text possible when compared\nto traditionally fine-tuned dense-embedding-model-based Information Retrieval\n(IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR\ncan succeed where no other method stands a chance.\n","authors":["Pablo M. Rodriguez Bertorello","Jean Rodmond Junior Laguerre"],"pdf_url":"https://arxiv.org/pdf/2312.13264v1.pdf","comment":"8 pages, 5 figures, Association for Computational Linguistics"},{"id":"http://arxiv.org/abs/2312.13259v1","updated":"2023-12-20T18:36:05Z","published":"2023-12-20T18:36:05Z","title":"A note on regularised NTK dynamics with an application to PAC-Bayesian\n training","summary":" We establish explicit dynamics for neural networks whose training objective\nhas a regularising term that constrains the parameters to remain close to their\ninitial value. This keeps the network in a lazy training regime, where the\ndynamics can be linearised around the initialisation. The standard neural\ntangent kernel (NTK) governs the evolution during the training in the\ninfinite-width limit, although the regularisation yields an additional term\nappears in the differential equation describing the dynamics. This setting\nprovides an appropriate framework to study the evolution of wide networks\ntrained to optimise generalisation objectives such as PAC-Bayes bounds, and\nhence potentially contribute to a deeper theoretical understanding of such\nnetworks.\n","authors":["Eugenio Clerico","Benjamin Guedj"],"pdf_url":"https://arxiv.org/pdf/2312.13259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13253v1","updated":"2023-12-20T18:27:53Z","published":"2023-12-20T18:27:53Z","title":"Conditional Image Generation with Pretrained Generative Model","summary":" In recent years, diffusion models have gained popularity for their ability to\ngenerate higher-quality images in comparison to GAN models. However, like any\nother large generative models, these models require a huge amount of data,\ncomputational resources, and meticulous tuning for successful training. This\nposes a significant challenge, rendering it infeasible for most individuals. As\na result, the research community has devised methods to leverage pre-trained\nunconditional diffusion models with additional guidance for the purpose of\nconditional image generative. These methods enable conditional image\ngenerations on diverse inputs and, most importantly, circumvent the need for\ntraining the diffusion model. In this paper, our objective is to reduce the\ntime-required and computational overhead introduced by the addition of guidance\nin diffusion models -- while maintaining comparable image quality. We propose a\nset of methods based on our empirical analysis, demonstrating a reduction in\ncomputation time by approximately threefold.\n","authors":["Rajesh Shrestha","Bowen Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13253v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13250v1","updated":"2023-12-20T18:25:15Z","published":"2023-12-20T18:25:15Z","title":"The role of data embedding in equivariant quantum convolutional neural\n networks","summary":" Geometric deep learning refers to the scenario in which the symmetries of a\ndataset are used to constrain the parameter space of a neural network and thus,\nimprove their trainability and generalization. Recently this idea has been\nincorporated into the field of quantum machine learning, which has given rise\nto equivariant quantum neural networks (EQNNs). In this work, we investigate\nthe role of classical-to-quantum embedding on the performance of equivariant\nquantum convolutional neural networks (EQCNNs) for the classification of\nimages. We discuss the connection between the data embedding method and the\nresulting representation of a symmetry group and analyze how changing\nrepresentation affects the expressibility of an EQCNN. We numerically compare\nthe classification accuracy of EQCNNs with three different basis-permuted\namplitude embeddings to the one obtained from a non-equivariant quantum\nconvolutional neural network (QCNN). Our results show that all the EQCNNs\nachieve higher classification accuracy than the non-equivariant QCNN for small\nnumbers of training iterations, while for large iterations this improvement\ncrucially depends on the used embedding. It is expected that the results of\nthis work can be useful to the community for a better understanding of the\nimportance of data embedding choice in the context of geometric quantum machine\nlearning.\n","authors":["Sreetama Das","Stefano Martina","Filippo Caruso"],"pdf_url":"https://arxiv.org/pdf/2312.13250v1.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.13247v1","updated":"2023-12-20T18:22:49Z","published":"2023-12-20T18:22:49Z","title":"Enhancing Neural Training via a Correlated Dynamics Model","summary":" As neural networks grow in scale, their training becomes both computationally\ndemanding and rich in dynamics. Amidst the flourishing interest in these\ntraining dynamics, we present a novel observation: Parameters during training\nexhibit intrinsic correlations over time. Capitalizing on this, we introduce\nCorrelation Mode Decomposition (CMD). This algorithm clusters the parameter\nspace into groups, termed modes, that display synchronized behavior across\nepochs. This enables CMD to efficiently represent the training dynamics of\ncomplex networks, like ResNets and Transformers, using only a few modes.\nMoreover, test set generalization is enhanced. We introduce an efficient CMD\nvariant, designed to run concurrently with training. Our experiments indicate\nthat CMD surpasses the state-of-the-art method for compactly modeled dynamics\non image classification. Our modeling can improve training efficiency and lower\ncommunication overhead, as shown by our preliminary experiments in the context\nof federated learning.\n","authors":["Jonathan Brokman","Roy Betser","Rotem Turjeman","Tom Berkov","Ido Cohen","Guy Gilboa"],"pdf_url":"https://arxiv.org/pdf/2312.13247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07811v2","updated":"2023-12-20T18:09:29Z","published":"2023-10-11T18:50:25Z","title":"Online RL in Linearly $q^π$-Realizable MDPs Is as Easy as in Linear\n MDPs If You Learn What to Ignore","summary":" We consider online reinforcement learning (RL) in episodic Markov decision\nprocesses (MDPs) under the linear $q^\\pi$-realizability assumption, where it is\nassumed that the action-values of all policies can be expressed as linear\nfunctions of state-action features. This class is known to be more general than\nlinear MDPs, where the transition kernel and the reward function are assumed to\nbe linear functions of the feature vectors. As our first contribution, we show\nthat the difference between the two classes is the presence of states in\nlinearly $q^\\pi$-realizable MDPs where for any policy, all the actions have\napproximately equal values, and skipping over these states by following an\narbitrarily fixed policy in those states transforms the problem to a linear\nMDP. Based on this observation, we derive a novel (computationally inefficient)\nlearning algorithm for linearly $q^\\pi$-realizable MDPs that simultaneously\nlearns what states should be skipped over and runs another learning algorithm\non the linear MDP hidden in the problem. The method returns an\n$\\epsilon$-optimal policy after $\\text{polylog}(H, d)/\\epsilon^2$ interactions\nwith the MDP, where $H$ is the time horizon and $d$ is the dimension of the\nfeature vectors, giving the first polynomial-sample-complexity online RL\nalgorithm for this setting. The results are proved for the misspecified case,\nwhere the sample complexity is shown to degrade gracefully with the\nmisspecification error.\n","authors":["Gellért Weisz","András György","Csaba Szepesvári"],"pdf_url":"https://arxiv.org/pdf/2310.07811v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13236v1","updated":"2023-12-20T18:00:16Z","published":"2023-12-20T18:00:16Z","title":"Diffusion Models With Learned Adaptive Noise","summary":" Diffusion models have gained traction as powerful algorithms for synthesizing\nhigh-quality images. Central to these algorithms is the diffusion process,\nwhich maps data to noise according to equations inspired by thermodynamics and\ncan significantly impact performance. A widely held assumption is that the ELBO\nobjective of a diffusion model is invariant to the noise process (Kingma et\nal.,2021). In this work, we dispel this assumption -- we propose multivariate\nlearned adaptive noise (MuLAN), a learned diffusion process that applies\nGaussian noise at different rates across an image. Our method consists of three\ncomponents -- a multivariate noise schedule, instance-conditional diffusion,\nand auxiliary variables -- which ensure that the learning objective is no\nlonger invariant to the choice of the noise schedule as in previous works. Our\nwork is grounded in Bayesian inference and casts the learned diffusion process\nas an approximate variational posterior that yields a tighter lower bound on\nmarginal likelihood. Empirically, MuLAN sets a new state-of-the-art in density\nestimation on CIFAR-10 and ImageNet compared to classical diffusion. Code is\navailable at https://github.com/s-sahoo/MuLAN\n","authors":["Subham Sekhar Sahoo","Aaron Gokaslan","Chris De Sa","Volodymyr Kuleshov"],"pdf_url":"https://arxiv.org/pdf/2312.13236v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13234v1","updated":"2023-12-20T17:59:11Z","published":"2023-12-20T17:59:11Z","title":"Position Paper: Bridging the Gap Between Machine Learning and\n Sensitivity Analysis","summary":" We argue that interpretations of machine learning (ML) models or the\nmodel-building process can bee seen as a form of sensitivity analysis (SA), a\ngeneral methodology used to explain complex systems in many fields such as\nenvironmental modeling, engineering, or economics. We address both researchers\nand practitioners, calling attention to the benefits of a unified SA-based view\nof explanations in ML and the necessity to fully credit related work. We bridge\nthe gap between both fields by formally describing how (a) the ML process is a\nsystem suitable for SA, (b) how existing ML interpretation methods relate to\nthis perspective, and (c) how other SA techniques could be applied to ML.\n","authors":["Christian A. Scholbeck","Julia Moosbauer","Giuseppe Casalicchio","Hoshin Gupta","Bernd Bischl","Christian Heumann"],"pdf_url":"https://arxiv.org/pdf/2312.13234v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16984v2","updated":"2023-12-20T17:44:40Z","published":"2023-11-28T17:35:38Z","title":"FedECA: A Federated External Control Arm Method for Causal Inference\n with Time-To-Event Data in Distributed Settings","summary":" External control arms (ECA) can inform the early clinical development of\nexperimental drugs and provide efficacy evidence for regulatory approval in\nnon-randomized settings. However, the main challenge of implementing ECA lies\nin accessing real-world data or historical clinical trials. Indeed, data\nsharing is often not feasible due to privacy considerations related to data\nleaving the original collection centers, along with pharmaceutical companies'\ncompetitive motives. In this paper, we leverage a privacy-enhancing technology\ncalled federated learning (FL) to remove some of the barriers to data sharing.\nWe introduce a federated learning inverse probability of treatment weighted\n(IPTW) method for time-to-event outcomes called FedECA which eases the\nimplementation of ECA by limiting patients' data exposure. We show with\nextensive experiments that FedECA outperforms its closest competitor,\nmatching-adjusted indirect comparison (MAIC), in terms of statistical power and\nability to balance the treatment and control groups. To encourage the use of\nsuch methods, we publicly release our code which relies on Substra, an\nopen-source FL software with proven experience in privacy-sensitive contexts.\n","authors":["Jean Ogier du Terrail","Quentin Klopfenstein","Honghao Li","Imke Mayer","Nicolas Loiseau","Mohammad Hallal","Félix Balazard","Mathieu Andreux"],"pdf_url":"https://arxiv.org/pdf/2311.16984v2.pdf","comment":"code available at: https://github.com/owkin/fedeca, fixed some typos,\n figures and acknowledgments in v2"},{"id":"http://arxiv.org/abs/2312.13218v1","updated":"2023-12-20T17:36:36Z","published":"2023-12-20T17:36:36Z","title":"FiFAR: A Fraud Detection Dataset for Learning to Defer","summary":" Public dataset limitations have significantly hindered the development and\nbenchmarking of learning to defer (L2D) algorithms, which aim to optimally\ncombine human and AI capabilities in hybrid decision-making systems. In such\nsystems, human availability and domain-specific concerns introduce\ndifficulties, while obtaining human predictions for training and evaluation is\ncostly. Financial fraud detection is a high-stakes setting where algorithms and\nhuman experts often work in tandem; however, there are no publicly available\ndatasets for L2D concerning this important application of human-AI teaming. To\nfill this gap in L2D research, we introduce the Financial Fraud Alert Review\nDataset (FiFAR), a synthetic bank account fraud detection dataset, containing\nthe predictions of a team of 50 highly complex and varied synthetic fraud\nanalysts, with varied bias and feature dependence. We also provide a realistic\ndefinition of human work capacity constraints, an aspect of L2D systems that is\noften overlooked, allowing for extensive testing of assignment systems under\nreal-world conditions. We use our dataset to develop a capacity-aware L2D\nmethod and rejection learning approach under realistic data availability\nconditions, and benchmark these baselines under an array of 300 distinct\ntesting scenarios. We believe that this dataset will serve as a pivotal\ninstrument in facilitating a systematic, rigorous, reproducible, and\ntransparent evaluation and comparison of L2D methods, thereby fostering the\ndevelopment of more synergistic human-AI collaboration in decision-making\nsystems. The public dataset and detailed synthetic expert information are\navailable at: https://github.com/feedzai/fifar-dataset\n","authors":["Jean V. Alves","Diogo Leitão","Sérgio Jesus","Marco O. P. Sampaio","Pedro Saleiro","Mário A. T. Figueiredo","Pedro Bizarro"],"pdf_url":"https://arxiv.org/pdf/2312.13218v1.pdf","comment":"The public dataset and detailed synthetic expert information are\n available at: https://github.com/feedzai/fifar-dataset"},{"id":"http://arxiv.org/abs/2312.13212v1","updated":"2023-12-20T17:28:21Z","published":"2023-12-20T17:28:21Z","title":"A 3D super-resolution of wind fields via physics-informed pixel-wise\n self-attention generative adversarial network","summary":" To mitigate global warming, greenhouse gas sources need to be resolved at a\nhigh spatial resolution and monitored in time to ensure the reduction and\nultimately elimination of the pollution source. However, the complexity of\ncomputation in resolving high-resolution wind fields left the simulations\nimpractical to test different time lengths and model configurations. This study\npresents a preliminary development of a physics-informed super-resolution (SR)\ngenerative adversarial network (GAN) that super-resolves the three-dimensional\n(3D) low-resolution wind fields by upscaling x9 times. We develop a pixel-wise\nself-attention (PWA) module that learns 3D weather dynamics via a\nself-attention computation followed by a 2D convolution. We also employ a loss\nterm that regularizes the self-attention map during pretraining, capturing the\nvertical convection process from input wind data. The new PWA SR-GAN shows the\nhigh-fidelity super-resolved 3D wind data, learns a wind structure at the\nhigh-frequency domain, and reduces the computational cost of a high-resolution\nwind simulation by x89.7 times.\n","authors":["Takuya Kurihana","Kyongmin Yeo","Daniela Szwarcman","Bruce Elmegreen","Karthik Mukkavilli","Johannes Schmude","Levente Klein"],"pdf_url":"https://arxiv.org/pdf/2312.13212v1.pdf","comment":"7 pages, 4 figures, NeurIPS 2023 Workshop: Tackling Climate Change\n with Machine Learning"},{"id":"http://arxiv.org/abs/2306.01266v2","updated":"2023-12-20T17:01:04Z","published":"2023-06-02T04:43:21Z","title":"Self Contrastive Learning for Session-based Recommendation","summary":" Session-based recommendation, which aims to predict the next item of users'\ninterest as per an existing sequence interaction of items, has attracted\ngrowing applications of Contrastive Learning (CL) with improved user and item\nrepresentations. However, these contrastive objectives: (1) serve a similar\nrole as the cross-entropy loss while ignoring the item representation space\noptimisation; and (2) commonly require complicated modelling, including complex\npositive/negative sample constructions and extra data augmentation. In this\nwork, we introduce Self-Contrastive Learning (SCL), which simplifies the\napplication of CL and enhances the performance of state-of-the-art CL-based\nrecommendation techniques. Specifically, SCL is formulated as an objective\nfunction that directly promotes a uniform distribution among item\nrepresentations and efficiently replaces all the existing contrastive objective\ncomponents of state-of-the-art models. Unlike previous works, SCL eliminates\nthe need for any positive/negative sample construction or data augmentation,\nleading to enhanced interpretability of the item representation space and\nfacilitating its extensibility to existing recommender systems. Through\nexperiments on three benchmark datasets, we demonstrate that SCL consistently\nimproves the performance of state-of-the-art models with statistical\nsignificance. Notably, our experiments show that SCL improves the performance\nof two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and\n11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks.\nAdditionally, our analysis elucidates the improvement in terms of alignment and\nuniformity of representations, as well as the effectiveness of SCL with a low\ncomputational cost.\n","authors":["Zhengxiang Shi","Xi Wang","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2306.01266v2.pdf","comment":"ECIR 2024 (Full Paper) Camera-ready Version. Code is available at\n https://github.com/ZhengxiangShi/SelfContrastiveLearningRecSys"},{"id":"http://arxiv.org/abs/2312.13185v1","updated":"2023-12-20T16:54:05Z","published":"2023-12-20T16:54:05Z","title":"Measurement-based quantum computation from Clifford quantum cellular\n automata","summary":" Measurement-based quantum computation (MBQC) is a paradigm for quantum\ncomputation where computation is driven by local measurements on a suitably\nentangled resource state. In this work we show that MBQC is related to a model\nof quantum computation based on Clifford quantum cellular automata (CQCA).\nSpecifically, we show that certain MBQCs can be directly constructed from CQCAs\nwhich yields a simple and intuitive circuit model representation of MBQC in\nterms of quantum computation based on CQCA. We apply this description to\nconstruct various MBQC-based Ans\\\"atze for parameterized quantum circuits,\ndemonstrating that the different Ans\\\"atze may lead to significantly different\nperformances on different learning tasks. In this way, MBQC yields a family of\nHardware-efficient Ans\\\"atze that may be adapted to specific problem settings\nand is particularly well suited for architectures with translationally\ninvariant gates such as neutral atoms.\n","authors":["Hendrik Poulsen Nautrup","Hans J. Briegel"],"pdf_url":"https://arxiv.org/pdf/2312.13185v1.pdf","comment":"16 pages, 12 figures"},{"id":"http://arxiv.org/abs/2206.08615v2","updated":"2023-12-20T16:47:57Z","published":"2022-06-17T08:17:28Z","title":"On the Number of Regions of Piecewise Linear Neural Networks","summary":" Many feedforward neural networks (NNs) generate continuous and\npiecewise-linear (CPWL) mappings. Specifically, they partition the input domain\ninto regions on which the mapping is affine. The number of these so-called\nlinear regions offers a natural metric to characterize the expressiveness of\nCPWL NNs. The precise determination of this quantity is often out of reach in\npractice, and bounds have been proposed for specific architectures, including\nfor ReLU and Maxout NNs. In this work, we generalize these bounds to NNs with\narbitrary and possibly multivariate CPWL activation functions. We first provide\nupper and lower bounds on the maximal number of linear regions of a CPWL NN\ngiven its depth, width, and the number of linear regions of its activation\nfunctions. Our results rely on the combinatorial structure of convex partitions\nand confirm the distinctive role of depth which, on its own, is able to\nexponentially increase the number of regions. We then introduce a complementary\nstochastic framework to estimate the average number of linear regions produced\nby a CPWL NN. Under reasonable assumptions, the expected density of linear\nregions along any 1D path is bounded by the product of depth, width, and a\nmeasure of activation complexity (up to a scaling factor). This yields an\nidentical role to the three sources of expressiveness: no exponential growth\nwith depth is observed anymore.\n","authors":["Alexis Goujon","Arian Etemadi","Michael Unser"],"pdf_url":"https://arxiv.org/pdf/2206.08615v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11517v2","updated":"2023-12-20T16:43:54Z","published":"2023-12-12T19:34:23Z","title":"Unlocking Musculoskeletal Disorder Risk Factors: NLP-Based\n Classification and Mode-Based Ranking","summary":" This research delves into the intricate landscape of Musculoskeletal Disorder\n(MSD) risk factors, employing a novel fusion of Natural Language Processing\n(NLP) techniques and mode-based ranking methodologies. The primary objective is\nto advance the comprehension of MSD risk factors, their classification, and\ntheir relative severity, facilitating more targeted preventive and management\ninterventions. The study utilizes eight diverse models, integrating pre-trained\ntransformers, cosine similarity, and various distance metrics to classify risk\nfactors into personal, biomechanical, workplace, psychological, and\norganizational classes. Key findings reveal that the BERT model with cosine\nsimilarity attains an overall accuracy of 28%, while the sentence transformer,\ncoupled with Euclidean, Bray-Curtis, and Minkowski distances, achieves a\nflawless accuracy score of 100%. In tandem with the classification efforts, the\nresearch employs a mode-based ranking approach on survey data to discern the\nseverity hierarchy of MSD risk factors. Intriguingly, the rankings align\nprecisely with the previous literature, reaffirming the consistency and\nreliability of the approach. ``Working posture\" emerges as the most severe risk\nfactor, emphasizing the critical role of proper posture in preventing MSDs. The\ncollective perceptions of survey participants underscore the significance of\nfactors like \"Job insecurity,\" \"Effort reward imbalance,\" and \"Poor employee\nfacility\" in contributing to MSD risks. The convergence of rankings provides\nactionable insights for organizations aiming to reduce the prevalence of MSDs.\nThe study concludes with implications for targeted interventions,\nrecommendations for improving workplace conditions, and avenues for future\nresearch.\n","authors":["Md Abrar Jahin","Subrata Talapatra"],"pdf_url":"https://arxiv.org/pdf/2312.11517v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11534v2","updated":"2023-12-20T16:39:15Z","published":"2023-12-15T17:59:16Z","title":"Improved Differentially Private and Lazy Online Convex Optimization","summary":" We study the task of $(\\epsilon, \\delta)$-differentially private online\nconvex optimization (OCO). In the online setting, the release of each distinct\ndecision or iterate carries with it the potential for privacy loss. This\nproblem has a long history of research starting with Jain et al. [2012] and the\nbest known results for the regime of {\\epsilon} not being very small are\npresented in Agarwal et al. [2023]. In this paper we improve upon the results\nof Agarwal et al. [2023] in terms of the dimension factors as well as removing\nthe requirement of smoothness. Our results are now the best known rates for\nDP-OCO in this regime.\n Our algorithms builds upon the work of [Asi et al., 2023] which introduced\nthe idea of explicitly limiting the number of switches via rejection sampling.\nThe main innovation in our algorithm is the use of sampling from a strongly\nlog-concave density which allows us to trade-off the dimension factors better\nleading to improved results.\n","authors":["Naman Agarwal","Satyen Kale","Karan Singh","Abhradeep Guha Thakurta"],"pdf_url":"https://arxiv.org/pdf/2312.11534v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13173v1","updated":"2023-12-20T16:33:15Z","published":"2023-12-20T16:33:15Z","title":"Learning Fair Policies for Multi-stage Selection Problems from\n Observational Data","summary":" We consider the problem of learning fair policies for multi-stage selection\nproblems from observational data. This problem arises in several high-stakes\ndomains such as company hiring, loan approval, or bail decisions where outcomes\n(e.g., career success, loan repayment, recidivism) are only observed for those\nselected. We propose a multi-stage framework that can be augmented with various\nfairness constraints, such as demographic parity or equal opportunity. This\nproblem is a highly intractable infinite chance-constrained program involving\nthe unknown joint distribution of covariates and outcomes. Motivated by the\npotential impact of selection decisions on people's lives and livelihoods, we\npropose to focus on interpretable linear selection rules. Leveraging tools from\ncausal inference and sample average approximation, we obtain an asymptotically\nconsistent solution to this selection problem by solving a mixed binary conic\noptimization problem, which can be solved using standard off-the-shelf solvers.\nWe conduct extensive computational experiments on a variety of datasets adapted\nfrom the UCI repository on which we show that our proposed approaches can\nachieve an 11.6% improvement in precision and a 38% reduction in the measure of\nunfairness compared to the existing selection policy.\n","authors":["Zhuangzhuang Jia","Grani A. Hanasusanto","Phebe Vayanos","Weijun Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13173v1.pdf","comment":"38th Annual AAAI Conference on Artificial Intelligence, 2024"},{"id":"http://arxiv.org/abs/2209.11144v2","updated":"2023-12-20T16:30:34Z","published":"2022-09-22T16:42:14Z","title":"Automatic and effective discovery of quantum kernels","summary":" Quantum computing can empower machine learning models by enabling kernel\nmachines to leverage quantum kernels for representing similarity measures\nbetween data. Quantum kernels are able to capture relationships in the data\nthat are not efficiently computable on classical devices. However, there is no\nstraightforward method to engineer the optimal quantum kernel for each specific\nuse case. While recent literature has focused on exploiting the potential\noffered by the presence of symmetries in the data to guide the construction of\nquantum kernels, we adopt here a different approach, which employs optimization\ntechniques, similar to those used in neural architecture search and AutoML, to\nautomatically find an optimal kernel in a heuristic manner. The algorithm we\npresent constructs a quantum circuit implementing the similarity measure as a\ncombinatorial object, which is evaluated based on a cost function and is then\niteratively modified using a meta-heuristic optimization technique. The cost\nfunction can encode many criteria ensuring favorable statistical properties of\nthe candidate solution, such as the rank of the Dynamical Lie Algebra.\nImportantly, our approach is independent of the optimization technique\nemployed. The results obtained by testing our approach on a high-energy physics\nproblem demonstrate that, in the best-case scenario, we can either match or\nimprove testing accuracy with respect to the manual design approach, showing\nthe potential of our technique to deliver superior results with reduced effort.\n","authors":["Massimiliano Incudini","Daniele Lizzio Bosco","Francesco Martini","Michele Grossi","Giuseppe Serra","Alessandra Di Pierro"],"pdf_url":"https://arxiv.org/pdf/2209.11144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13155v1","updated":"2023-12-20T16:18:51Z","published":"2023-12-20T16:18:51Z","title":"Gappy local conformal auto-encoders for heterogeneous data fusion: in\n praise of rigidity","summary":" Fusing measurements from multiple, heterogeneous, partial sources, observing\na common object or process, poses challenges due to the increasing availability\nof numbers and types of sensors. In this work we propose, implement and\nvalidate an end-to-end computational pipeline in the form of a\nmultiple-auto-encoder neural network architecture for this task. The inputs to\nthe pipeline are several sets of partial observations, and the result is a\nglobally consistent latent space, harmonizing (rigidifying, fusing) all\nmeasurements. The key enabler is the availability of multiple slightly\nperturbed measurements of each instance:, local measurement, \"bursts\", that\nallows us to estimate the local distortion induced by each instrument. We\ndemonstrate the approach in a sequence of examples, starting with simple\ntwo-dimensional data sets and proceeding to a Wi-Fi localization problem and to\nthe solution of a \"dynamical puzzle\" arising in spatio-temporal observations of\nthe solutions of Partial Differential Equations.\n","authors":["Erez Peterfreund","Iryna Burak","Ofir Lindenbaum","Jim Gimlett","Felix Dietrich","Ronald R. Coifman","Ioannis G. Kevrekidis"],"pdf_url":"https://arxiv.org/pdf/2312.13155v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13152v1","updated":"2023-12-20T16:16:29Z","published":"2023-12-20T16:16:29Z","title":"Neural Stochastic Differential Equations with Change Points: A\n Generative Adversarial Approach","summary":" Stochastic differential equations (SDEs) have been widely used to model real\nworld random phenomena. Existing works mainly focus on the case where the time\nseries is modeled by a single SDE, which might be restrictive for modeling time\nseries with distributional shift. In this work, we propose a change point\ndetection algorithm for time series modeled as neural SDEs. Given a time series\ndataset, the proposed method jointly learns the unknown change points and the\nparameters of distinct neural SDE models corresponding to each change point.\nSpecifically, the SDEs are learned under the framework of generative\nadversarial networks (GANs) and the change points are detected based on the\noutput of the GAN discriminator in a forward pass. At each step of the proposed\nalgorithm, the change points and the SDE model parameters are updated in an\nalternating fashion. Numerical results on both synthetic and real datasets are\nprovided to validate the performance of our algorithm in comparison to\nclassical change point detection benchmarks, standard GAN-based neural SDEs,\nand other state-of-the-art deep generative models for time series data.\n","authors":["Zhongchang Sun","Yousef El-Laham","Svitlana Vyetrenko"],"pdf_url":"https://arxiv.org/pdf/2312.13152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13143v1","updated":"2023-12-20T16:04:02Z","published":"2023-12-20T16:04:02Z","title":"Underwater Acoustic Signal Recognition Based on Salient Features","summary":" With the rapid advancement of technology, the recognition of underwater\nacoustic signals in complex environments has become increasingly crucial.\nCurrently, mainstream underwater acoustic signal recognition relies primarily\non time-frequency analysis to extract spectral features, finding widespread\napplications in the field. However, existing recognition methods heavily depend\non expert systems, facing limitations such as restricted knowledge bases and\nchallenges in handling complex relationships. These limitations stem from the\ncomplexity and maintenance difficulties associated with rules or inference\nengines. Recognizing the potential advantages of deep learning in handling\nintricate relationships, this paper proposes a method utilizing neural networks\nfor underwater acoustic signal recognition. The proposed approach involves\ncontinual learning of features extracted from spectra for the classification of\nunderwater acoustic signals. Deep learning models can automatically learn\nabstract features from data and continually adjust weights during training to\nenhance classification performance.\n","authors":["Minghao Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13143v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10469v2","updated":"2023-12-20T16:02:32Z","published":"2023-12-16T14:59:11Z","title":"One step closer to unbiased aleatoric uncertainty estimation","summary":" Neural networks are powerful tools in various applications, and quantifying\ntheir uncertainty is crucial for reliable decision-making. In the deep learning\nfield, the uncertainties are usually categorized into aleatoric (data) and\nepistemic (model) uncertainty. In this paper, we point out that the existing\npopular variance attenuation method highly overestimates aleatoric uncertainty.\nTo address this issue, we propose a new estimation method by actively\nde-noising the observed data. By conducting a broad range of experiments, we\ndemonstrate that our proposed approach provides a much closer approximation to\nthe actual data uncertainty than the standard method.\n","authors":["Wang Zhang","Ziwen Ma","Subhro Das","Tsui-Wei Weng","Alexandre Megretski","Luca Daniel","Lam M. Nguyen"],"pdf_url":"https://arxiv.org/pdf/2312.10469v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13141v1","updated":"2023-12-20T16:02:25Z","published":"2023-12-20T16:02:25Z","title":"Augment on Manifold: Mixup Regularization with UMAP","summary":" Data augmentation techniques play an important role in enhancing the\nperformance of deep learning models. Despite their proven benefits in computer\nvision tasks, their application in the other domains remains limited. This\npaper proposes a Mixup regularization scheme, referred to as UMAP Mixup,\ndesigned for \"on-manifold\" automated data augmentation for deep learning\npredictive models. The proposed approach ensures that the Mixup operations\nresult in synthesized samples that lie on the data manifold of the features and\nlabels by utilizing a dimensionality reduction technique known as uniform\nmanifold approximation and projection. Evaluations across diverse regression\ntasks show that UMAP Mixup is competitive with or outperforms other Mixup\nvariants, show promise for its potential as an effective tool for enhancing the\ngeneralization performance of deep learning models.\n","authors":["Yousef El-Laham","Elizabeth Fons","Dillon Daudert","Svitlana Vyetrenko"],"pdf_url":"https://arxiv.org/pdf/2312.13141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.13073v2","updated":"2023-12-20T15:58:26Z","published":"2023-11-22T00:26:15Z","title":"FusionFrames: Efficient Architectural Aspects for Text-to-Video\n Generation Pipeline","summary":" Multimedia generation approaches occupy a prominent place in artificial\nintelligence research. Text-to-image models achieved high-quality results over\nthe last few years. However, video synthesis methods recently started to\ndevelop. This paper presents a new two-stage latent diffusion text-to-video\ngeneration architecture based on the text-to-image diffusion model. The first\nstage concerns keyframes synthesis to figure the storyline of a video, while\nthe second one is devoted to interpolation frames generation to make movements\nof the scene and objects smooth. We compare several temporal conditioning\napproaches for keyframes generation. The results show the advantage of using\nseparate temporal blocks over temporal layers in terms of metrics reflecting\nvideo generation quality aspects and human preference. The design of our\ninterpolation model significantly reduces computational costs compared to other\nmasked frame interpolation approaches. Furthermore, we evaluate different\nconfigurations of MoVQ-based video decoding scheme to improve consistency and\nachieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our\npipeline with existing solutions and achieve top-2 scores overall and top-1\namong open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:\nhttps://ai-forever.github.io/kandinsky-video/\n","authors":["Vladimir Arkhipkin","Zein Shaheen","Viacheslav Vasilev","Elizaveta Dakhova","Andrey Kuznetsov","Denis Dimitrov"],"pdf_url":"https://arxiv.org/pdf/2311.13073v2.pdf","comment":"Project page: https://ai-forever.github.io/kandinsky-video/"},{"id":"http://arxiv.org/abs/2312.13136v1","updated":"2023-12-20T15:56:40Z","published":"2023-12-20T15:56:40Z","title":"Molecular Hypergraph Neural Networks","summary":" Graph neural networks (GNNs) have demonstrated promising performance across\nvarious chemistry-related tasks. However, conventional graphs only model the\npairwise connectivity in molecules, failing to adequately represent\nhigher-order connections like multi-center bonds and conjugated structures. To\ntackle this challenge, we introduce molecular hypergraphs and propose Molecular\nHypergraph Neural Networks (MHNN) to predict the optoelectronic properties of\norganic semiconductors, where hyperedges represent conjugated structures. A\ngeneral algorithm is designed for irregular high-order connections, which can\nefficiently operate on molecular hypergraphs with hyperedges of various orders.\nThe results show that MHNN outperforms all baseline models on most tasks of\nOPV, OCELOTv1 and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D\ngeometric information, surpassing the baseline model that utilizes atom\npositions. Moreover, MHNN achieves better performance than pretrained GNNs\nunder limited training data, underscoring its excellent data efficiency. This\nwork provides a new strategy for more general molecular representations and\nproperty prediction tasks related to high-order connections.\n","authors":["Junwu Chen","Philippe Schwaller"],"pdf_url":"https://arxiv.org/pdf/2312.13136v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13131v1","updated":"2023-12-20T15:51:46Z","published":"2023-12-20T15:51:46Z","title":"Scaling Compute Is Not All You Need for Adversarial Robustness","summary":" The last six years have witnessed significant progress in adversarially\nrobust deep learning. As evidenced by the CIFAR-10 dataset category in\nRobustBench benchmark, the accuracy under $\\ell_\\infty$ adversarial\nperturbations improved from 44\\% in \\citet{Madry2018Towards} to 71\\% in\n\\citet{peng2023robust}. Although impressive, existing state-of-the-art is still\nfar from satisfactory. It is further observed that best-performing models are\noften very large models adversarially trained by industrial labs with\nsignificant computational budgets. In this paper, we aim to understand: ``how\nmuch longer can computing power drive adversarial robustness advances?\" To\nanswer this question, we derive \\emph{scaling laws for adversarial robustness}\nwhich can be extrapolated in the future to provide an estimate of how much cost\nwe would need to pay to reach a desired level of robustness. We show that\nincreasing the FLOPs needed for adversarial training does not bring as much\nadvantage as it does for standard training in terms of performance\nimprovements. Moreover, we find that some of the top-performing techniques are\ndifficult to exactly reproduce, suggesting that they are not robust enough for\nminor changes in the training setup. Our analysis also uncovers potentially\nworthwhile directions to pursue in future research. Finally, we make our\nbenchmarking framework (built on top of \\texttt{timm}~\\citep{rw2019timm})\npublicly available to facilitate future analysis in efficient robust deep\nlearning.\n","authors":["Edoardo Debenedetti","Zishen Wan","Maksym Andriushchenko","Vikash Sehwag","Kshitij Bhardwaj","Bhavya Kailkhura"],"pdf_url":"https://arxiv.org/pdf/2312.13131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13130v1","updated":"2023-12-20T15:50:16Z","published":"2023-12-20T15:50:16Z","title":"Distribution-Dependent Rates for Multi-Distribution Learning","summary":" To address the needs of modeling uncertainty in sensitive machine learning\napplications, the setup of distributionally robust optimization (DRO) seeks\ngood performance uniformly across a variety of tasks. The recent\nmulti-distribution learning (MDL) framework tackles this objective in a dynamic\ninteraction with the environment, where the learner has sampling access to each\ntarget distribution. Drawing inspiration from the field of pure-exploration\nmulti-armed bandits, we provide distribution-dependent guarantees in the MDL\nregime, that scale with suboptimality gaps and result in superior dependence on\nthe sample size when compared to the existing distribution-independent\nanalyses. We investigate two non-adaptive strategies, uniform and non-uniform\nexploration, and present non-asymptotic regret bounds using novel tools from\nempirical process theory. Furthermore, we devise an adaptive optimistic\nalgorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring\nthe contrast between uniform and optimistic allocation in the multi-armed\nbandit literature.\n","authors":["Rafael Hanashiro","Patrick Jaillet"],"pdf_url":"https://arxiv.org/pdf/2312.13130v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13119v1","updated":"2023-12-20T15:38:59Z","published":"2023-12-20T15:38:59Z","title":"Prometheus: Infrastructure Security Posture Analysis with AI-generated\n Attack Graphs","summary":" The rampant occurrence of cybersecurity breaches imposes substantial\nlimitations on the progress of network infrastructures, leading to compromised\ndata, financial losses, potential harm to individuals, and disruptions in\nessential services. The current security landscape demands the urgent\ndevelopment of a holistic security assessment solution that encompasses\nvulnerability analysis and investigates the potential exploitation of these\nvulnerabilities as attack paths. In this paper, we propose Prometheus, an\nadvanced system designed to provide a detailed analysis of the security posture\nof computing infrastructures. Using user-provided information, such as device\ndetails and software versions, Prometheus performs a comprehensive security\nassessment. This assessment includes identifying associated vulnerabilities and\nconstructing potential attack graphs that adversaries can exploit. Furthermore,\nPrometheus evaluates the exploitability of these attack paths and quantifies\nthe overall security posture through a scoring mechanism. The system takes a\nholistic approach by analyzing security layers encompassing hardware, system,\nnetwork, and cryptography. Furthermore, Prometheus delves into the\ninterconnections between these layers, exploring how vulnerabilities in one\nlayer can be leveraged to exploit vulnerabilities in others. In this paper, we\npresent the end-to-end pipeline implemented in Prometheus, showcasing the\nsystematic approach adopted for conducting this thorough security analysis.\n","authors":["Xin Jin","Charalampos Katsis","Fan Sang","Jiahao Sun","Elisa Bertino","Ramana Rao Kompella","Ashish Kundu"],"pdf_url":"https://arxiv.org/pdf/2312.13119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13118v1","updated":"2023-12-20T15:37:50Z","published":"2023-12-20T15:37:50Z","title":"LRS: Enhancing Adversarial Transferability through Lipschitz Regularized\n Surrogate","summary":" The transferability of adversarial examples is of central importance to\ntransfer-based black-box adversarial attacks. Previous works for generating\ntransferable adversarial examples focus on attacking \\emph{given} pretrained\nsurrogate models while the connections between surrogate models and adversarial\ntrasferability have been overlooked. In this paper, we propose {\\em Lipschitz\nRegularized Surrogate} (LRS) for transfer-based black-box attacks, a novel\napproach that transforms surrogate models towards favorable adversarial\ntransferability. Using such transformed surrogate models, any existing\ntransfer-based black-box attack can run without any change, yet achieving much\nbetter performance. Specifically, we impose Lipschitz regularization on the\nloss landscape of surrogate models to enable a smoother and more controlled\noptimization process for generating more transferable adversarial examples. In\naddition, this paper also sheds light on the connection between the inner\nproperties of surrogate models and adversarial transferability, where three\nfactors are identified: smaller local Lipschitz constant, smoother loss\nlandscape, and stronger adversarial robustness. We evaluate our proposed LRS\napproach by attacking state-of-the-art standard deep neural networks and\ndefense models. The results demonstrate significant improvement on the attack\nsuccess rates and transferability. Our code is available at\nhttps://github.com/TrustAIoT/LRS.\n","authors":["Tao Wu","Tie Luo","Donald C. Wunsch"],"pdf_url":"https://arxiv.org/pdf/2312.13118v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13110v1","updated":"2023-12-20T15:30:15Z","published":"2023-12-20T15:30:15Z","title":"Pre-training of Molecular GNNs as Conditional Boltzmann Generator","summary":" Learning representations of molecular structures using deep learning is a\nfundamental problem in molecular property prediction tasks. Molecules\ninherently exist in the real world as three-dimensional structures;\nfurthermore, they are not static but in continuous motion in the 3D Euclidean\nspace, forming a potential energy surface. Therefore, it is desirable to\ngenerate multiple conformations in advance and extract molecular\nrepresentations using a 4D-QSAR model that incorporates multiple conformations.\nHowever, this approach is impractical for drug and material discovery tasks\nbecause of the computational cost of obtaining multiple conformations. To\naddress this issue, we propose a pre-training method for molecular GNNs using\nan existing dataset of molecular conformations to generate a latent vector\nuniversal to multiple conformations from a 2D molecular graph. Our method,\ncalled Boltzmann GNN, is formulated by maximizing the conditional marginal\nlikelihood of a conditional generative model for conformations generation. We\nshow that our model has a better prediction performance for molecular\nproperties than existing pre-training methods using molecular graphs and\nthree-dimensional molecular structures.\n","authors":["Daiki Koge","Naoaki Ono","Shigehiko Kanaya"],"pdf_url":"https://arxiv.org/pdf/2312.13110v1.pdf","comment":"4 pages. Short paper submitted to AAAI workshop (AI2ASE) 2023"},{"id":"http://arxiv.org/abs/2312.03807v2","updated":"2023-12-20T15:21:56Z","published":"2023-12-06T16:34:58Z","title":"Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free\n Stochastic Bilevel Optimization","summary":" In this paper, we revisit the bilevel optimization problem, in which the\nupper-level objective function is generally nonconvex and the lower-level\nobjective function is strongly convex. Although this type of problem has been\nstudied extensively, it still remains an open question how to achieve an\n${O}(\\epsilon^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic\nbilevel optimization without any second-order derivative computation. To fill\nthis gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named\nFdeHBO, which features a simple fully single-loop structure, a projection-aided\nfinite-difference Hessian/Jacobian-vector approximation, and momentum-based\nupdates. Theoretically, we show that FdeHBO requires ${O}(\\epsilon^{-1.5})$\niterations (each using ${O}(1)$ samples and only first-order gradient\ninformation) to find an $\\epsilon$-accurate stationary point. As far as we\nknow, this is the first Hessian/Jacobian-free method with an\n${O}(\\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex\nstochastic bilevel optimization.\n","authors":["Yifan Yang","Peiyao Xiao","Kaiyi Ji"],"pdf_url":"https://arxiv.org/pdf/2312.03807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12145v2","updated":"2023-12-20T15:16:32Z","published":"2023-12-19T13:28:34Z","title":"OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in\n Noisy Environments","summary":" In reinforcement learning, the optimism in the face of uncertainty (OFU) is a\nmainstream principle for directing exploration towards less explored areas,\ncharacterized by higher uncertainty. However, in the presence of environmental\nstochasticity (noise), purely optimistic exploration may lead to excessive\nprobing of high-noise areas, consequently impeding exploration efficiency.\nHence, in exploring noisy environments, while optimism-driven exploration\nserves as a foundation, prudent attention to alleviating unnecessary\nover-exploration in high-noise areas becomes beneficial. In this work, we\npropose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a\nnoise-aware optimistic exploration for continuous control. OVD-Explorer\nproposes a new measurement of the policy's exploration ability considering\nnoise in optimistic perspectives, and leverages gradient ascent to drive\nexploration. Practically, OVD-Explorer can be easily integrated with continuous\ncontrol RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks\ndemonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic\nexploration.\n","authors":["Jinyi Liu","Zhi Wang","Yan Zheng","Jianye Hao","Chenjia Bai","Junjie Ye","Zhen Wang","Haiyin Piao","Yang Sun"],"pdf_url":"https://arxiv.org/pdf/2312.12145v2.pdf","comment":"Accepted by AAAI 2024, with appendix"},{"id":"http://arxiv.org/abs/2312.13091v1","updated":"2023-12-20T15:12:53Z","published":"2023-12-20T15:12:53Z","title":"MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using\n Differentiable Shading","summary":" Reconstructing an avatar from a portrait image has many applications in\nmultimedia, but remains a challenging research problem. Extracting reflectance\nmaps and geometry from one image is ill-posed: recovering geometry is a\none-to-many mapping problem and reflectance and light are difficult to\ndisentangle. Accurate geometry and reflectance can be captured under the\ncontrolled conditions of a light stage, but it is costly to acquire large\ndatasets in this fashion. Moreover, training solely with this type of data\nleads to poor generalization with in-the-wild images. This motivates the\nintroduction of MoSAR, a method for 3D avatar generation from monocular images.\nWe propose a semi-supervised training scheme that improves generalization by\nlearning from both light stage and in-the-wild datasets. This is achieved using\na novel differentiable shading formulation. We show that our approach\neffectively disentangles the intrinsic face parameters, producing relightable\navatars. As a result, MoSAR estimates a richer set of skin reflectance maps,\nand generates more realistic avatars than existing state-of-the-art methods. We\nalso introduce a new dataset, named FFHQ-UV-Intrinsics, the first public\ndataset providing intrisic face attributes at scale (diffuse, specular, ambient\nocclusion and translucency maps) for a total of 10k subjects. The project\nwebsite and the dataset are available on the following link:\nhttps://ubisoftlaforge.github.io/character/mosar\n","authors":["Abdallah Dib","Luiz Gustavo Hafemann","Emeline Got","Trevor Anderson","Amin Fadaeinejad","Rafael M. O. Cruz","Marc-Andre Carbonneau"],"pdf_url":"https://arxiv.org/pdf/2312.13091v1.pdf","comment":"https://ubisoft-laforge.github.io/character/mosar/"},{"id":"http://arxiv.org/abs/2312.00626v2","updated":"2023-12-20T15:05:07Z","published":"2023-12-01T14:42:37Z","title":"Forecasting Trends in Food Security: a Reservoir Computing Approach","summary":" Early warning systems are an essential tool for effective humanitarian\naction. Advance warnings on impending disasters facilitate timely and targeted\nresponse which help save lives, livelihoods, and scarce financial resources. In\nthis work we present a new quantitative methodology to forecast levels of food\nconsumption for 60 consecutive days, at the sub-national level, in four\ncountries: Mali, Nigeria, Syria, and Yemen. The methodology is built on\npublicly available data from the World Food Programme's integrated global\nhunger monitoring system which collects, processes, and displays daily updates\non key food security metrics, conflict, weather events, and other drivers of\nfood insecurity across 90 countries (https://hungermap.wfp.org/). In this\nstudy, we assessed the performance of various models including ARIMA, XGBoost,\nLSTMs, CNNs, and Reservoir Computing (RC), by comparing their Root Mean Squared\nError (RMSE) metrics. This comprehensive analysis spanned classical\nstatistical, machine learning, and deep learning approaches. Our findings\nhighlight Reservoir Computing as a particularly well-suited model in the field\nof food security given both its notable resistance to over-fitting on limited\ndata samples and its efficient training capabilities. The methodology we\nintroduce establishes the groundwork for a global, data-driven early warning\nsystem designed to anticipate and detect food insecurity.\n","authors":["Joschka Herteux","Christoph Räth","Amine Baha","Giulia Martini","Duccio Piovani"],"pdf_url":"https://arxiv.org/pdf/2312.00626v2.pdf","comment":"22 pages, 11 figures, typo in acknowledgements corrected"},{"id":"http://arxiv.org/abs/2312.13084v1","updated":"2023-12-20T15:04:52Z","published":"2023-12-20T15:04:52Z","title":"Pyreal: A Framework for Interpretable ML Explanations","summary":" Users in many domains use machine learning (ML) predictions to help them make\ndecisions. Effective ML-based decision-making often requires explanations of ML\nmodels and their predictions. While there are many algorithms that explain\nmodels, generating explanations in a format that is comprehensible and useful\nto decision-makers is a nontrivial task that can require extensive development\noverhead. We developed Pyreal, a highly extensible system with a corresponding\nPython implementation for generating a variety of interpretable ML\nexplanations. Pyreal converts data and explanations between the feature spaces\nexpected by the model, relevant explanation algorithms, and human users,\nallowing users to generate interpretable explanations in a low-code manner. Our\nstudies demonstrate that Pyreal generates more useful explanations than\nexisting systems while remaining both easy-to-use and efficient.\n","authors":["Alexandra Zytek","Wei-En Wang","Dongyu Liu","Laure Berti-Equille","Kalyan Veeramachaneni"],"pdf_url":"https://arxiv.org/pdf/2312.13084v1.pdf","comment":"12 pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2306.02630v2","updated":"2023-12-20T15:01:25Z","published":"2023-06-05T06:57:09Z","title":"Covariance Adaptive Best Arm Identification","summary":" We consider the problem of best arm identification in the multi-armed bandit\nmodel, under fixed confidence. Given a confidence input $\\delta$, the goal is\nto identify the arm with the highest mean reward with a probability of at least\n1 -- $\\delta$, while minimizing the number of arm pulls. While the literature\nprovides solutions to this problem under the assumption of independent arms\ndistributions, we propose a more flexible scenario where arms can be dependent\nand rewards can be sampled simultaneously. This framework allows the learner to\nestimate the covariance among the arms distributions, enabling a more efficient\nidentification of the best arm. The relaxed setting we propose is relevant in\nvarious applications, such as clinical trials, where similarities between\npatients or drugs suggest underlying correlations in the outcomes. We introduce\nnew algorithms that adapt to the unknown covariance of the arms and demonstrate\nthrough theoretical guarantees that substantial improvement can be achieved\nover the standard setting. Additionally, we provide new lower bounds for the\nrelaxed setting and present numerical simulations that support their\ntheoretical findings.\n","authors":["El Mehdi Saad","Gilles Blanchard","Nicolas Verzelen"],"pdf_url":"https://arxiv.org/pdf/2306.02630v2.pdf","comment":"New version with some minor corrections"},{"id":"http://arxiv.org/abs/2212.01039v2","updated":"2023-12-20T15:00:43Z","published":"2022-12-02T09:11:32Z","title":"SoftCorrect: Error Correction with Soft Detection for Automatic Speech\n Recognition","summary":" Error correction in automatic speech recognition (ASR) aims to correct those\nincorrect words in sentences generated by ASR models. Since recent ASR models\nusually have low word error rate (WER), to avoid affecting originally correct\ntokens, error correction models should only modify incorrect words, and\ntherefore detecting incorrect words is important for error correction. Previous\nworks on error correction either implicitly detect error words through\ntarget-source attention or CTC (connectionist temporal classification) loss, or\nexplicitly locate specific deletion/substitution/insertion errors. However,\nimplicit error detection does not provide clear signal about which tokens are\nincorrect and explicit error detection suffers from low detection accuracy. In\nthis paper, we propose SoftCorrect with a soft error detection mechanism to\navoid the limitations of both explicit and implicit error detection.\nSpecifically, we first detect whether a token is correct or not through a\nprobability produced by a dedicatedly designed language model, and then design\na constrained CTC loss that only duplicates the detected incorrect tokens to\nlet the decoder focus on the correction of error tokens. Compared with implicit\nerror detection with CTC loss, SoftCorrect provides explicit signal about which\nwords are incorrect and thus does not need to duplicate every token but only\nincorrect tokens; compared with explicit error detection, SoftCorrect does not\ndetect specific deletion/substitution/insertion errors but just leaves it to\nCTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that\nSoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming\nprevious works by a large margin, while still enjoying fast speed of parallel\ngeneration.\n","authors":["Yichong Leng","Xu Tan","Wenjie Liu","Kaitao Song","Rui Wang","Xiang-Yang Li","Tao Qin","Edward Lin","Tie-Yan Liu"],"pdf_url":"https://arxiv.org/pdf/2212.01039v2.pdf","comment":"AAAI 2023"},{"id":"http://arxiv.org/abs/2202.02249v2","updated":"2023-12-20T14:56:21Z","published":"2022-02-04T17:32:28Z","title":"Functional Mixtures-of-Experts","summary":" We consider the statistical analysis of heterogeneous data for prediction in\nsituations where the observations include functions, typically time series. We\nextend the modeling with Mixtures-of-Experts (ME), as a framework of choice in\nmodeling heterogeneity in data for prediction with vectorial observations, to\nthis functional data analysis context. We first present a new family of ME\nmodels, named functional ME (FME) in which the predictors are potentially noisy\nobservations, from entire functions. Furthermore, the data generating process\nof the predictor and the real response, is governed by a hidden discrete\nvariable representing an unknown partition. Second, by imposing sparsity on\nderivatives of the underlying functional parameters via Lasso-like\nregularizations, we provide sparse and interpretable functional representations\nof the FME models called iFME. We develop dedicated expectation--maximization\nalgorithms for Lasso-like (EM-Lasso) regularized maximum-likelihood parameter\nestimation strategies to fit the models. The proposed models and algorithms are\nstudied in simulated scenarios and in applications to two real data sets, and\nthe obtained results demonstrate their performance in accurately capturing\ncomplex nonlinear relationships and in clustering the heterogeneous regression\ndata.\n","authors":["Faïcel Chamroukhi","Nhat Thien Pham","Van Hà Hoang","Geoffrey J. McLachlan"],"pdf_url":"https://arxiv.org/pdf/2202.02249v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17330v3","updated":"2023-12-20T14:54:15Z","published":"2023-05-27T02:14:09Z","title":"MADiff: Offline Multi-agent Learning with Diffusion Models","summary":" Diffusion model (DM), as a powerful generative model, recently achieved huge\nsuccess in various scenarios including offline reinforcement learning, where\nthe policy learns to conduct planning by generating trajectory in the online\nevaluation. However, despite the effectiveness shown for single-agent learning,\nit remains unclear how DMs can operate in multi-agent problems, where agents\ncan hardly complete teamwork without good coordination by independently\nmodeling each agent's trajectories. In this paper, we propose MADiff, a novel\ngenerative multi-agent learning framework to tackle this problem. MADiff is\nrealized with an attention-based diffusion model to model the complex\ncoordination among behaviors of multiple diffusion agents. To the best of our\nknowledge, MADiff is the first diffusion-based multi-agent offline RL\nframework, which behaves as both a decentralized policy and a centralized\ncontroller. During decentralized executions, MADiff simultaneously performs\nteammate modeling, and the centralized controller can also be applied in\nmulti-agent trajectory predictions. Our experiments show the superior\nperformance of MADiff compared to baseline algorithms in a wide range of\nmulti-agent learning tasks, which emphasizes the effectiveness of MADiff in\nmodeling complex multi-agent interactions. Our code is available at\nhttps://github.com/zbzhu99/madiff.\n","authors":["Zhengbang Zhu","Minghuan Liu","Liyuan Mao","Bingyi Kang","Minkai Xu","Yong Yu","Stefano Ermon","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.17330v3.pdf","comment":"20 pages, 10 figures, 6 tables. The first two authors contributed\n equally to the work"},{"id":"http://arxiv.org/abs/2212.06370v3","updated":"2023-12-20T14:50:00Z","published":"2022-12-13T05:03:16Z","title":"Dual Accuracy-Quality-Driven Neural Network for Prediction Interval\n Generation","summary":" Accurate uncertainty quantification is necessary to enhance the reliability\nof deep learning models in real-world applications. In the case of regression\ntasks, prediction intervals (PIs) should be provided along with the\ndeterministic predictions of deep learning models. Such PIs are useful or\n\"high-quality\" as long as they are sufficiently narrow and capture most of the\nprobability density. In this paper, we present a method to learn prediction\nintervals for regression-based neural networks automatically in addition to the\nconventional target predictions. In particular, we train two companion neural\nnetworks: one that uses one output, the target estimate, and another that uses\ntwo outputs, the upper and lower bounds of the corresponding PI. Our main\ncontribution is the design of a novel loss function for the PI-generation\nnetwork that takes into account the output of the target-estimation network and\nhas two optimization objectives: minimizing the mean prediction interval width\nand ensuring the PI integrity using constraints that maximize the prediction\ninterval probability coverage implicitly. Furthermore, we introduce a\nself-adaptive coefficient that balances both objectives within the loss\nfunction, which alleviates the task of fine-tuning. Experiments using a\nsynthetic dataset, eight benchmark datasets, and a real-world crop yield\nprediction dataset showed that our method was able to maintain a nominal\nprobability coverage and produce significantly narrower PIs without detriment\nto its target estimation accuracy when compared to those PIs generated by three\nstate-of-the-art neural-network-based methods. In other words, our method was\nshown to produce higher-quality PIs.\n","authors":["Giorgio Morales","John W. Sheppard"],"pdf_url":"https://arxiv.org/pdf/2212.06370v3.pdf","comment":"Accepted at the IEEE Transactions on Neural Networks and Learning\n Systems"},{"id":"http://arxiv.org/abs/2312.13068v1","updated":"2023-12-20T14:46:54Z","published":"2023-12-20T14:46:54Z","title":"Continuous-time Graph Representation with Sequential Survival Process","summary":" Over the past two decades, there has been a tremendous increase in the growth\nof representation learning methods for graphs, with numerous applications\nacross various fields, including bioinformatics, chemistry, and the social\nsciences. However, current dynamic network approaches focus on discrete-time\nnetworks or treat links in continuous-time networks as instantaneous events.\nTherefore, these approaches have limitations in capturing the persistence or\nabsence of links that continuously emerge and disappear over time for\nparticular durations. To address this, we propose a novel stochastic process\nrelying on survival functions to model the durations of links and their\nabsences over time. This forms a generic new likelihood specification\nexplicitly accounting for intermittent edge-persistent networks, namely GraSSP:\nGraph Representation with Sequential Survival Process. We apply the developed\nframework to a recent continuous time dynamic latent distance model\ncharacterizing network dynamics in terms of a sequence of piecewise linear\nmovements of nodes in latent space. We quantitatively assess the developed\nframework in various downstream tasks, such as link prediction and network\ncompletion, demonstrating that the developed modeling framework accounting for\nlink persistence and absence well tracks the intrinsic trajectories of nodes in\na latent space and captures the underlying characteristics of evolving network\nstructure.\n","authors":["Abdulkadir Celikkanat","Nikolaos Nakis","Morten Mørup"],"pdf_url":"https://arxiv.org/pdf/2312.13068v1.pdf","comment":"Accepted to the 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI24), Vancouver, British Columbia, 2024"},{"id":"http://arxiv.org/abs/2310.02152v2","updated":"2023-12-20T14:30:36Z","published":"2023-10-03T15:40:03Z","title":"Graph Neural Network-based EEG Classification: A Survey","summary":" Graph neural networks (GNN) are increasingly used to classify EEG for tasks\nsuch as emotion recognition, motor imagery and neurological diseases and\ndisorders. A wide range of methods have been proposed to design GNN-based\nclassifiers. Therefore, there is a need for a systematic review and\ncategorisation of these approaches. We exhaustively search the published\nliterature on this topic and derive several categories for comparison. These\ncategories highlight the similarities and differences among the methods. The\nresults suggest a prevalence of spectral graph convolutional layers over\nspatial. Additionally, we identify standard forms of node features, with the\nmost popular being the raw EEG signal and differential entropy. Our results\nsummarise the emerging trends in GNN-based approaches for EEG classification.\nFinally, we discuss several promising research directions, such as exploring\nthe potential of transfer learning methods and appropriate modelling of\ncross-frequency interactions.\n","authors":["Dominik Klepl","Min Wu","Fei He"],"pdf_url":"https://arxiv.org/pdf/2310.02152v2.pdf","comment":"14 pages, 3 figures"},{"id":"http://arxiv.org/abs/2301.03713v3","updated":"2023-12-20T14:24:31Z","published":"2023-01-09T23:19:40Z","title":"Non-contact Respiratory Anomaly Detection using Infrared Light-wave\n Sensing","summary":" Human respiratory rate and its pattern convey essential information about the\nphysical and psychological states of the subject. Abnormal breathing can\nindicate fatal health issues leading to further diagnosis and treatment.\nWireless light-wave sensing (LWS) using incoherent infrared light shows promise\nin safe, discreet, efficient, and non-invasive human breathing monitoring\nwithout raising privacy concerns. The respiration monitoring system needs to be\ntrained on different types of breathing patterns to identify breathing\nanomalies.The system must also validate the collected data as a breathing\nwaveform, discarding any faulty data caused by external interruption, user\nmovement, or system malfunction. To address these needs, this study simulated\nnormal and different types of abnormal respiration using a robot that mimics\nhuman breathing patterns. Then, time-series respiration data were collected\nusing infrared light-wave sensing technology. Three machine learning\nalgorithms, decision tree, random forest and XGBoost, were applied to detect\nbreathing anomalies and faulty data. Model performances were evaluated through\ncross-validation, assessing classification accuracy, precision and recall\nscores. The random forest model achieved the highest classification accuracy of\n96.75% with data collected at a 0.5m distance. In general, ensemble models like\nrandom forest and XGBoost performed better than a single model in classifying\nthe data collected at multiple distances from the light-wave sensing setup.\n","authors":["Md Zobaer Islam","Brenden Martin","Carly Gotcher","Tyler Martinez","John F. O'Hara","Sabit Ekin"],"pdf_url":"https://arxiv.org/pdf/2301.03713v3.pdf","comment":"12 pages, 15 figures excluding photos of authors, submitted to IEEE\n Transactions on Human-machine Systems"},{"id":"http://arxiv.org/abs/2310.01685v2","updated":"2023-12-20T14:19:17Z","published":"2023-10-02T22:46:49Z","title":"A Framework for Interpretability in Machine Learning for Medical Imaging","summary":" Interpretability for machine learning models in medical imaging (MLMI) is an\nimportant direction of research. However, there is a general sense of murkiness\nin what interpretability means. Why does the need for interpretability in MLMI\narise? What goals does one actually seek to address when interpretability is\nneeded? To answer these questions, we identify a need to formalize the goals\nand elements of interpretability in MLMI. By reasoning about real-world tasks\nand goals common in both medical image analysis and its intersection with\nmachine learning, we identify five core elements of interpretability:\nlocalization, visual recognizability, physical attribution, model transparency,\nand actionability. From this, we arrive at a framework for interpretability in\nMLMI, which serves as a step-by-step guide to approaching interpretability in\nthis context. Overall, this paper formalizes interpretability needs in the\ncontext of medical imaging, and our applied perspective clarifies concrete\nMLMI-specific goals and considerations in order to guide method design and\nimprove real-world usage. Our goal is to provide practical and didactic\ninformation for model designers and practitioners, inspire developers of models\nin the medical imaging field to reason more deeply about what interpretability\nis achieving, and suggest future directions of interpretability research.\n","authors":["Alan Q. Wang","Batuhan K. Karaman","Heejong Kim","Jacob Rosenthal","Rachit Saluja","Sean I. Young","Mert R. Sabuncu"],"pdf_url":"https://arxiv.org/pdf/2310.01685v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.00283v3","updated":"2023-12-20T14:15:49Z","published":"2022-07-01T09:20:05Z","title":"Learning Lattice Quantum Field Theories with Equivariant Continuous\n Flows","summary":" We propose a novel machine learning method for sampling from the\nhigh-dimensional probability distributions of Lattice Field Theories, which is\nbased on a single neural ODE layer and incorporates the full symmetries of the\nproblem. We test our model on the $\\phi^4$ theory, showing that it\nsystematically outperforms previously proposed flow-based methods in sampling\nefficiency, and the improvement is especially pronounced for larger lattices.\nFurthermore, we demonstrate that our model can learn a continuous family of\ntheories at once, and the results of learning can be transferred to larger\nlattices. Such generalizations further accentuate the advantages of machine\nlearning methods.\n","authors":["Mathis Gerdes","Pim de Haan","Corrado Rainone","Roberto Bondesan","Miranda C. N. Cheng"],"pdf_url":"https://arxiv.org/pdf/2207.00283v3.pdf","comment":"17 pages, 9 figures, 1 table; slightly expanded published version,\n added 2 figures and 2 sections to appendix"},{"id":"http://arxiv.org/abs/2312.13038v1","updated":"2023-12-20T14:04:57Z","published":"2023-12-20T14:04:57Z","title":"AutoXPCR: Automated Multi-Objective Model Selection for Time Series\n Forecasting","summary":" Automated machine learning (AutoML) streamlines the creation of ML models.\nWhile most methods select the \"best\" model based on predictive quality, it's\ncrucial to acknowledge other aspects, such as interpretability and resource\nconsumption. This holds particular importance in the context of deep neural\nnetworks (DNNs), as these models are often perceived as computationally\nintensive black boxes. In the challenging domain of time series forecasting,\nDNNs achieve stunning results, but specialized approaches for automatically\nselecting models are scarce. In this paper, we propose AutoXPCR - a novel\nmethod for automated and explainable multi-objective model selection. Our\napproach leverages meta-learning to estimate any model's performance along PCR\ncriteria, which encompass (P)redictive error, (C)omplexity, and (R)esource\ndemand. Explainability is addressed on multiple levels, as our interactive\nframework can prioritize less complex models and provide by-product\nexplanations of recommendations. We demonstrate practical feasibility by\ndeploying AutoXPCR on over 1000 configurations across 114 data sets from\nvarious domains. Our method clearly outperforms other model selection\napproaches - on average, it only requires 20% of computation costs for\nrecommending models with 90% of the best-possible quality.\n","authors":["Raphael Fischer","Amal Saadallah"],"pdf_url":"https://arxiv.org/pdf/2312.13038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13035v1","updated":"2023-12-20T13:59:43Z","published":"2023-12-20T13:59:43Z","title":"1D-CNN Optimization for Non-contact Respiration Pattern Classification","summary":" In this study, we present a deep learning-based approach for time-series\nrespiration data classification. The dataset contains regular breathing\npatterns as well as various forms of abnormal breathing, obtained through\nnon-contact incoherent light-wave sensing (LWS) technology. Given the\none-dimensional (1D) nature of the data, we employed a 1D convolutional neural\nnetwork (1D-CNN) for classification purposes. Genetic algorithm was employed to\noptimize the 1D-CNN architecture to maximize classification accuracy.\nAddressing the computational complexity associated with training the 1D-CNN\nacross multiple generations, we implemented transfer learning from a\npre-trained model. This approach significantly reduced the computational time\nrequired for training, thereby enhancing the efficiency of the optimization\nprocess. This study contributes valuable insights into the potential\napplications of deep learning methodologies for enhancing respiratory anomaly\ndetection through precise and efficient respiration classification.\n","authors":["Md Zobaer Islam","Gary Yen"],"pdf_url":"https://arxiv.org/pdf/2312.13035v1.pdf","comment":"7 pages, 8 figures, to be submitted to IEEE conference"},{"id":"http://arxiv.org/abs/2312.13033v1","updated":"2023-12-20T13:56:31Z","published":"2023-12-20T13:56:31Z","title":"Explainable artificial intelligence approaches for brain-computer\n interfaces: a review and design space","summary":" This review paper provides an integrated perspective of Explainable\nArtificial Intelligence techniques applied to Brain-Computer Interfaces. BCIs\nuse predictive models to interpret brain signals for various high-stake\napplications. However, achieving explainability in these complex models is\nchallenging as it compromises accuracy. The field of XAI has emerged to address\nthe need for explainability across various stakeholders, but there is a lack of\nan integrated perspective in XAI for BCI (XAI4BCI) literature. It is necessary\nto differentiate key concepts like explainability, interpretability, and\nunderstanding in this context and formulate a comprehensive framework. To\nunderstand the need of XAI for BCI, we pose six key research questions for a\nsystematic review and meta-analysis, encompassing its purposes, applications,\nusability, and technical feasibility. We employ the PRISMA methodology --\npreferred reporting items for systematic reviews and meta-analyses to review\n(n=1246) and analyze (n=84) studies published in 2015 and onwards for key\ninsights. The results highlight that current research primarily focuses on\ninterpretability for developers and researchers, aiming to justify outcomes and\nenhance model performance. We discuss the unique approaches, advantages, and\nlimitations of XAI4BCI from the literature. We draw insights from philosophy,\npsychology, and social sciences. We propose a design space for XAI4BCI,\nconsidering the evolving need to visualize and investigate predictive model\noutcomes customised for various stakeholders in the BCI development and\ndeployment lifecycle. This paper is the first to focus solely on reviewing\nXAI4BCI research articles. This systematic review and meta-analysis findings\nwith the proposed design space prompt important discussions on establishing\nstandards for BCI explanations, highlighting current limitations, and guiding\nthe future of XAI in BCI.\n","authors":["Param Rajpura","Hubert Cecotti","Yogesh Kumar Meena"],"pdf_url":"https://arxiv.org/pdf/2312.13033v1.pdf","comment":"draft submission"},{"id":"http://arxiv.org/abs/2312.13032v1","updated":"2023-12-20T13:56:27Z","published":"2023-12-20T13:56:27Z","title":"NodeMixup: Tackling Under-Reaching for Graph Neural Networks","summary":" Graph Neural Networks (GNNs) have become mainstream methods for solving the\nsemi-supervised node classification problem. However, due to the uneven\nlocation distribution of labeled nodes in the graph, labeled nodes are only\naccessible to a small portion of unlabeled nodes, leading to the\n\\emph{under-reaching} issue. In this study, we firstly reveal under-reaching by\nconducting an empirical investigation on various well-known graphs. Then, we\ndemonstrate that under-reaching results in unsatisfactory distribution\nalignment between labeled and unlabeled nodes through systematic experimental\nanalysis, significantly degrading GNNs' performance. To tackle under-reaching\nfor GNNs, we propose an architecture-agnostic method dubbed NodeMixup. The\nfundamental idea is to (1) increase the reachability of labeled nodes by\nlabeled-unlabeled pairs mixup, (2) leverage graph structures via fusing the\nneighbor connections of intra-class node pairs to improve performance gains of\nmixup, and (3) use neighbor label distribution similarity incorporating node\ndegrees to determine sampling weights for node mixup. Extensive experiments\ndemonstrate the efficacy of NodeMixup in assisting GNNs in handling\nunder-reaching. The source code is available at\n\\url{https://github.com/WeigangLu/NodeMixup}.\n","authors":["Weigang Lu","Ziyu Guan","Wei Zhao","Long Jin"],"pdf_url":"https://arxiv.org/pdf/2312.13032v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13031v1","updated":"2023-12-20T13:55:56Z","published":"2023-12-20T13:55:56Z","title":"A self-attention-based differentially private tabular GAN with high data\n utility","summary":" Generative Adversarial Networks (GANs) have become a ubiquitous technology\nfor data generation, with their prowess in image generation being\nwell-established. However, their application in generating tabular data has\nbeen less than ideal. Furthermore, attempting to incorporate differential\nprivacy technology into these frameworks has often resulted in a degradation of\ndata utility. To tackle these challenges, this paper introduces DP-SACTGAN, a\nnovel Conditional Generative Adversarial Network (CGAN) framework for\ndifferentially private tabular data generation, aiming to surmount these\nobstacles. Experimental findings demonstrate that DP-SACTGAN not only\naccurately models the distribution of the original data but also effectively\nsatisfies the requirements of differential privacy.\n","authors":["Zijian Li","Zhihui Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13031v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13027v1","updated":"2023-12-20T13:50:26Z","published":"2023-12-20T13:50:26Z","title":"Doubly Perturbed Task-Free Continual Learning","summary":" Task-free online continual learning (TF-CL) is a challenging problem where\nthe model incrementally learns tasks without explicit task information.\nAlthough training with entire data from the past, present as well as future is\nconsidered as the gold standard, naive approaches in TF-CL with the current\nsamples may be conflicted with learning with samples in the future, leading to\ncatastrophic forgetting and poor plasticity. Thus, a proactive consideration of\nan unseen future sample in TF-CL becomes imperative. Motivated by this\nintuition, we propose a novel TF-CL framework considering future samples and\nshow that injecting adversarial perturbations on both input data and\ndecision-making is effective. Then, we propose a novel method named Doubly\nPerturbed Continual Learning (DPCL) to efficiently implement these input and\ndecision-making perturbations. Specifically, for input perturbation, we propose\nan approximate perturbation method that injects noise into the input data as\nwell as the feature vector and then interpolates the two perturbed samples. For\ndecision-making process perturbation, we devise multiple stochastic\nclassifiers. We also investigate a memory management scheme and learning rate\nscheduling reflecting our proposed double perturbations. We demonstrate that\nour proposed method outperforms the state-of-the-art baseline methods by large\nmargins on various TF-CL benchmarks.\n","authors":["Byung Hyun Lee","Min-hwan Oh","Se Young Chun"],"pdf_url":"https://arxiv.org/pdf/2312.13027v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2308.13380v2","updated":"2023-12-20T13:42:58Z","published":"2023-08-25T13:50:17Z","title":"From system models to class models: An in-context learning paradigm","summary":" Is it possible to understand the intricacies of a dynamical system not solely\nfrom its input/output pattern, but also by observing the behavior of other\nsystems within the same class? This central question drives the study presented\nin this paper.\n In response to this query, we introduce a novel paradigm for system\nidentification, addressing two primary tasks: one-step-ahead prediction and\nmulti-step simulation. Unlike conventional methods, we do not directly estimate\na model for the specific system. Instead, we learn a meta model that represents\na class of dynamical systems. This meta model is trained on a potentially\ninfinite stream of synthetic data, generated by simulators whose settings are\nrandomly extracted from a probability distribution. When provided with a\ncontext from a new system-specifically, an input/output sequence-the meta model\nimplicitly discerns its dynamics, enabling predictions of its behavior.\n The proposed approach harnesses the power of Transformers, renowned for their\n\\emph{in-context learning} capabilities. For one-step prediction, a GPT-like\ndecoder-only architecture is utilized, whereas the simulation problem employs\nan encoder-decoder structure. Initial experimental results affirmatively answer\nour foundational question, opening doors to fresh research avenues in system\nidentification.\n","authors":["Marco Forgione","Filippo Pura","Dario Piga"],"pdf_url":"https://arxiv.org/pdf/2308.13380v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.03351v2","updated":"2023-12-20T13:34:42Z","published":"2023-11-06T18:58:59Z","title":"Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with\n Multi-Step On-Policy Optimization","summary":" Combining offline and online reinforcement learning (RL) is crucial for\nefficient and safe learning. However, previous approaches treat offline and\nonline learning as separate procedures, resulting in redundant designs and\nlimited performance. We ask: Can we achieve straightforward yet effective\noffline and online learning without introducing extra conservatism or\nregularization? In this study, we propose Uni-o4, which utilizes an on-policy\nobjective for both offline and online learning. Owning to the alignment of\nobjectives in two phases, the RL agent can transfer between offline and online\nlearning seamlessly. This property enhances the flexibility of the learning\nparadigm, allowing for arbitrary combinations of pretraining, fine-tuning,\noffline, and online learning. In the offline phase, specifically, Uni-o4\nleverages diverse ensemble policies to address the mismatch issues between the\nestimated behavior policy and the offline dataset. Through a simple offline\npolicy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy\nimprovement safely. We demonstrate that by employing the method above, the\nfusion of these two paradigms can yield superior offline initialization as well\nas stable and rapid online fine-tuning capabilities. Through real-world robot\ntasks, we highlight the benefits of this paradigm for rapid deployment in\nchallenging, previously unseen real-world environments. Additionally, through\ncomprehensive evaluations using numerous simulated benchmarks, we substantiate\nthat our method achieves state-of-the-art performance in both offline and\noffline-to-online fine-tuning learning. Our website:\nhttps://lei-kun.github.io/uni-o4/ .\n","authors":["Kun Lei","Zhengmao He","Chenhao Lu","Kaizhe Hu","Yang Gao","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2311.03351v2.pdf","comment":"Our website: https://lei-kun.github.io/uni-o4/"},{"id":"http://arxiv.org/abs/2312.12183v2","updated":"2023-12-20T13:29:23Z","published":"2023-12-19T14:15:20Z","title":"Poincaré Differential Privacy for Hierarchy-Aware Graph Embedding","summary":" Hierarchy is an important and commonly observed topological property in\nreal-world graphs that indicate the relationships between supervisors and\nsubordinates or the organizational behavior of human groups. As hierarchy is\nintroduced as a new inductive bias into the Graph Neural Networks (GNNs) in\nvarious tasks, it implies latent topological relations for attackers to improve\ntheir inference attack performance, leading to serious privacy leakage issues.\nIn addition, existing privacy-preserving frameworks suffer from reduced\nprotection ability in hierarchical propagation due to the deficiency of\nadaptive upper-bound estimation of the hierarchical perturbation boundary. It\nis of great urgency to effectively leverage the hierarchical property of data\nwhile satisfying privacy guarantees. To solve the problem, we propose the\nPoincar\\'e Differential Privacy framework, named PoinDP, to protect the\nhierarchy-aware graph embedding based on hyperbolic geometry. Specifically,\nPoinDP first learns the hierarchy weights for each entity based on the\nPoincar\\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware\nSensitivity is designed to measure the sensitivity of the hierarchical\nstructure and adaptively allocate the privacy protection strength. Besides, the\nHyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian\nmechanism in Euclidean space to hyperbolic space to realize random\nperturbations that satisfy differential privacy under the hyperbolic space\nmetric. Extensive experiment results on five real-world datasets demonstrate\nthe proposed PoinDP's advantages of effective privacy protection while\nmaintaining good performance on the node classification task.\n","authors":["Yuecen Wei","Haonan Yuan","Xingcheng Fu","Qingyun Sun","Hao Peng","Xianxian Li","Chunming Hu"],"pdf_url":"https://arxiv.org/pdf/2312.12183v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13008v1","updated":"2023-12-20T13:20:31Z","published":"2023-12-20T13:20:31Z","title":"No More Shortcuts: Realizing the Potential of Temporal Self-Supervision","summary":" Self-supervised approaches for video have shown impressive results in video\nunderstanding tasks. However, unlike early works that leverage temporal\nself-supervision, current state-of-the-art methods primarily rely on tasks from\nthe image domain (e.g., contrastive learning) that do not explicitly promote\nthe learning of temporal features. We identify two factors that limit existing\ntemporal self-supervision: 1) tasks are too simple, resulting in saturated\ntraining performance, and 2) we uncover shortcuts based on local appearance\nstatistics that hinder the learning of high-level features. To address these\nissues, we propose 1) a more challenging reformulation of temporal\nself-supervision as frame-level (rather than clip-level) recognition tasks and\n2) an effective augmentation strategy to mitigate shortcuts. Our model extends\na representation of single video frames, pre-trained through contrastive\nlearning, with a transformer that we train through temporal self-supervision.\nWe demonstrate experimentally that our more challenging frame-level task\nformulations and the removal of shortcuts drastically improve the quality of\nfeatures learned through temporal self-supervision. The generalization\ncapability of our self-supervised video method is evidenced by its\nstate-of-the-art performance in a wide range of high-level semantic tasks,\nincluding video retrieval, action classification, and video attribute\nrecognition (such as object and scene identification), as well as low-level\ntemporal correspondence tasks like video object segmentation and pose tracking.\nAdditionally, we show that the video representations learned through our method\nexhibit increased robustness to the input perturbations.\n","authors":["Ishan Rajendrakumar Dave","Simon Jenni","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2312.13008v1.pdf","comment":"AAAI 2024 (Main Technical Track)"},{"id":"http://arxiv.org/abs/2312.12989v1","updated":"2023-12-20T12:46:44Z","published":"2023-12-20T12:46:44Z","title":"Benchmarking and Analyzing In-context Learning, Fine-tuning and\n Supervised Learning for Biomedical Knowledge Curation: a focused study on\n chemical entities of biological interest","summary":" Automated knowledge curation for biomedical ontologies is key to ensure that\nthey remain comprehensive, high-quality and up-to-date. In the era of\nfoundational language models, this study compares and analyzes three NLP\nparadigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and\nsupervised learning (ML). Using the Chemical Entities of Biological Interest\n(ChEBI) database as a model ontology, three curation tasks were devised. For\nICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT.\nPubmedBERT was chosen for the FT paradigm. For ML, six embedding models were\nutilized for training Random Forest and Long-Short Term Memory models. Five\nsetups were designed to assess ML and FT model performance across different\ndata availability scenarios.Datasets for curation tasks included: task 1\n(620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive\nversus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of\n0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML\n(trained on ~260,000 triples) outperformed ICL in accuracy across all tasks.\n(accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed\nsimilarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and\n+.002), but worse in task 3 (-.048). Simulations revealed performance declines\nin both ML and FT models with smaller and higher imbalanced training data.\nwhere ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks\n1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed\nML/FT in task 2.ICL-augmented foundation models can be good assistants for\nknowledge curation with correct prompting, however, not making ML and FT\nparadigms obsolete. The latter two require task-specific data to beat ICL. In\nsuch cases, ML relies on small pretrained embeddings, minimizing computational\ndemands.\n","authors":["Emily Groves","Minhong Wang","Yusuf Abdulle","Holger Kunz","Jason Hoelscher-Obermaier","Ronin Wu","Honghan Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12989v1.pdf","comment":"26 pages, 5 figures, 14 tables"},{"id":"http://arxiv.org/abs/2312.12977v1","updated":"2023-12-20T12:34:54Z","published":"2023-12-20T12:34:54Z","title":"Collaborative Optimization of the Age of Information under Partial\n Observability","summary":" The significance of the freshness of sensor and control data at the receiver\nside, often referred to as Age of Information (AoI), is fundamentally\nconstrained by contention for limited network resources. Evidently, network\ncongestion is detrimental for AoI, where this congestion is partly self-induced\nby the sensor transmission process in addition to the contention from other\ntransmitting sensors. In this work, we devise a decentralized AoI-minimizing\ntransmission policy for a number of sensor agents sharing capacity-limited,\nnon-FIFO duplex channels that introduce random delays in communication with a\ncommon receiver. By implementing the same policy, however with no explicit\ninter-agent communication, the agents minimize the expected AoI in this\npartially observable system. We cater to the partial observability due to\nrandom channel delays by designing a bootstrap particle filter that\nindependently maintains a belief over the AoI of each agent. We also leverage\nmean-field control approximations and reinforcement learning to derive scalable\nand optimal solutions for minimizing the expected AoI collaboratively.\n","authors":["Anam Tahir","Kai Cui","Bastian Alt","Amr Rizk","Heinz Koeppl"],"pdf_url":"https://arxiv.org/pdf/2312.12977v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12973v1","updated":"2023-12-20T12:31:28Z","published":"2023-12-20T12:31:28Z","title":"Sparse Mean Field Load Balancing in Large Localized Queueing Systems","summary":" Scalable load balancing algorithms are of great interest in cloud networks\nand data centers, necessitating the use of tractable techniques to compute\noptimal load balancing policies for good performance. However, most existing\nscalable techniques, especially asymptotically scaling methods based on mean\nfield theory, have not been able to model large queueing networks with strong\nlocality. Meanwhile, general multi-agent reinforcement learning techniques can\nbe hard to scale and usually lack a theoretical foundation. In this work, we\naddress this challenge by leveraging recent advances in sparse mean field\ntheory to learn a near-optimal load balancing policy in sparsely connected\nqueueing networks in a tractable manner, which may be preferable to global\napproaches in terms of communication overhead. Importantly, we obtain a general\nload balancing framework for a large class of sparse bounded-degree topologies.\nBy formulating a novel mean field control problem in the context of graphs with\nbounded degree, we reduce the otherwise difficult multi-agent problem to a\nsingle-agent problem. Theoretically, the approach is justified by approximation\nguarantees. Empirically, the proposed methodology performs well on several\nrealistic and scalable network topologies. Moreover, we compare it with a\nnumber of well-known load balancing heuristics and with existing scalable\nmulti-agent reinforcement learning methods. Overall, we obtain a tractable\napproach for load balancing in highly localized networks.\n","authors":["Anam Tahir","Kai Cui","Heinz Koeppl"],"pdf_url":"https://arxiv.org/pdf/2312.12973v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12972v1","updated":"2023-12-20T12:23:30Z","published":"2023-12-20T12:23:30Z","title":"From Past to Future: Rethinking Eligibility Traces","summary":" In this paper, we introduce a fresh perspective on the challenges of credit\nassignment and policy evaluation. First, we delve into the nuances of\neligibility traces and explore instances where their updates may result in\nunexpected credit assignment to preceding states. From this investigation\nemerges the concept of a novel value function, which we refer to as the\n\\emph{bidirectional value function}. Unlike traditional state value functions,\nbidirectional value functions account for both future expected returns (rewards\nanticipated from the current state onward) and past expected returns\n(cumulative rewards from the episode's start to the present). We derive\nprincipled update equations to learn this value function and, through\nexperimentation, demonstrate its efficacy in enhancing the process of policy\nevaluation. In particular, our results indicate that the proposed learning\napproach can, in certain challenging contexts, perform policy evaluation more\nrapidly than TD($\\lambda$) -- a method that learns forward value functions,\n$v^\\pi$, \\emph{directly}. Overall, our findings present a new perspective on\neligibility traces and potential advantages associated with the novel value\nfunction it inspires, especially for policy evaluation.\n","authors":["Dhawal Gupta","Scott M. Jordan","Shreyas Chaudhari","Bo Liu","Philip S. Thomas","Bruno Castro da Silva"],"pdf_url":"https://arxiv.org/pdf/2312.12972v1.pdf","comment":"Accepted in The 38th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2310.04469v3","updated":"2023-12-20T12:14:57Z","published":"2023-10-05T21:04:16Z","title":"Taming Binarized Neural Networks and Mixed-Integer Programs","summary":" There has been a great deal of recent interest in binarized neural networks,\nespecially because of their explainability. At the same time, automatic\ndifferentiation algorithms such as backpropagation fail for binarized neural\nnetworks, which limits their applicability. By reformulating the problem of\ntraining binarized neural networks as a subadditive dual of a mixed-integer\nprogram, we show that binarized neural networks admit a tame representation.\nThis, in turn, makes it possible to use the framework of Bolte et al. for\nimplicit differentiation, which offers the possibility for practical\nimplementation of backpropagation in the context of binarized neural networks.\n This approach could also be used for a broader class of mixed-integer\nprograms, beyond the training of binarized neural networks, as encountered in\nsymbolic approaches to AI and beyond.\n","authors":["Johannes Aspman","Georgios Korpas","Jakub Marecek"],"pdf_url":"https://arxiv.org/pdf/2310.04469v3.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.10080v2","updated":"2023-12-20T12:01:45Z","published":"2023-12-10T18:33:45Z","title":"No prejudice! Fair Federated Graph Neural Networks for Personalized\n Recommendation","summary":" Ensuring fairness in Recommendation Systems (RSs) across demographic groups\nis critical due to the increased integration of RSs in applications such as\npersonalized healthcare, finance, and e-commerce. Graph-based RSs play a\ncrucial role in capturing intricate higher-order interactions among entities.\nHowever, integrating these graph models into the Federated Learning (FL)\nparadigm with fairness constraints poses formidable challenges as this requires\naccess to the entire interaction graph and sensitive user information (such as\ngender, age, etc.) at the central server. This paper addresses the pervasive\nissue of inherent bias within RSs for different demographic groups without\ncompromising the privacy of sensitive user attributes in FL environment with\nthe graph-based model. To address the group bias, we propose F2PGNN (Fair\nFederated Personalized Graph Neural Network), a novel framework that leverages\nthe power of Personalized Graph Neural Network (GNN) coupled with fairness\nconsiderations. Additionally, we use differential privacy techniques to fortify\nprivacy protection. Experimental evaluation on three publicly available\ndatasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47%\n- 99% compared to the state-of-the-art while preserving privacy and maintaining\nthe utility. The results validate the significance of our framework in\nachieving equitable and personalized recommendations using GNN within the FL\nlandscape.\n","authors":["Nimesh Agrawal","Anuj Kumar Sirohi"," Jayadeva","Sandeep Kumar"],"pdf_url":"https://arxiv.org/pdf/2312.10080v2.pdf","comment":"To appear as a full paper in AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12946v1","updated":"2023-12-20T11:43:33Z","published":"2023-12-20T11:43:33Z","title":"Class Conditional Time Series Generation with Structured Noise Space GAN","summary":" This paper introduces Structured Noise Space GAN (SNS-GAN), a novel approach\nin the field of generative modeling specifically tailored for class-conditional\ngeneration in both image and time series data. It addresses the challenge of\neffectively integrating class labels into generative models without requiring\nstructural modifications to the network. The SNS-GAN method embeds class\nconditions within the generator's noise space, simplifying the training process\nand enhancing model versatility. The model's efficacy is demonstrated through\nqualitative validations in the image domain and superior performance in time\nseries generation compared to baseline models. This research opens new avenues\nfor the application of GANs in various domains, including but not limited to\ntime series and image data generation.\n","authors":["Hamidreza Gholamrezaei","Alireza Koochali","Andreas Dengel","Sheraz Ahmed"],"pdf_url":"https://arxiv.org/pdf/2312.12946v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12945v1","updated":"2023-12-20T11:42:49Z","published":"2023-12-20T11:42:49Z","title":"Misclassification excess risk bounds for 1-bit matrix completion","summary":" This study investigates the misclassification excess risk bound in the\ncontext of 1-bit matrix completion, a significant problem in machine learning\ninvolving the recovery of an unknown matrix from a limited subset of its\nentries. Matrix completion has garnered considerable attention in the last two\ndecades due to its diverse applications across various fields. Unlike\nconventional approaches that deal with real-valued samples, 1-bit matrix\ncompletion is concerned with binary observations. While prior research has\npredominantly focused on the estimation error of proposed estimators, our study\nshifts attention to the prediction error. This paper offers theoretical\nanalysis regarding the prediction errors of two previous works utilizing the\nlogistic regression model: one employing a max-norm constrained minimization\nand the other employing nuclear-norm penalization. Significantly, our findings\ndemonstrate that the latter achieves the minimax-optimal rate without the need\nfor an additional logarithmic term. These novel results contribute to a deeper\nunderstanding of 1-bit matrix completion by shedding light on the predictive\nperformance of specific methodologies.\n","authors":["The Tien Mai"],"pdf_url":"https://arxiv.org/pdf/2312.12945v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.02916v2","updated":"2023-12-20T11:42:46Z","published":"2023-12-05T17:46:52Z","title":"MIND: Multi-Task Incremental Network Distillation","summary":" The recent surge of pervasive devices that generate dynamic data streams has\nunderscored the necessity for learning systems to adapt continually to data\ndistributional shifts. To tackle this challenge, the research community has put\nforth a spectrum of methodologies, including the demanding pursuit of\nclass-incremental learning without replay data. In this study, we present MIND,\na parameter isolation method that aims to significantly enhance the performance\nof replay-free solutions and achieve state-of-the-art results on several widely\nstudied datasets. Our approach introduces two main contributions: two\nalternative distillation procedures that significantly improve the efficiency\nof MIND increasing the accumulated knowledge of each sub-network, and the\noptimization of the BachNorm layers across tasks inside the sub-networks.\nOverall, MIND outperforms all the state-of-the-art methods for rehearsal-free\nClass-Incremental learning (with an increment in classification accuracy of\napprox. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx.\n+40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each\ncontribution to demonstrate its impact on performance improvement. Our results\nshowcase the superior performance of MIND indicating its potential for\naddressing the challenges posed by Class-incremental and Domain-Incremental\nlearning in resource-constrained environments.\n","authors":["Jacopo Bonato","Francesco Pelosin","Luigi Sabetta","Alessandro Nicolosi"],"pdf_url":"https://arxiv.org/pdf/2312.02916v2.pdf","comment":"Accepted at the 38th AAAI Conference on Artificial Intelligence"},{"id":"http://arxiv.org/abs/2312.12937v1","updated":"2023-12-20T11:27:46Z","published":"2023-12-20T11:27:46Z","title":"Robust Loss Functions for Training Decision Trees with Noisy Labels","summary":" We consider training decision trees using noisily labeled data, focusing on\nloss functions that can lead to robust learning algorithms. Our contributions\nare threefold. First, we offer novel theoretical insights on the robustness of\nmany existing loss functions in the context of decision tree learning. We show\nthat some of the losses belong to a class of what we call conservative losses,\nand the conservative losses lead to an early stopping behavior during training\nand noise-tolerant predictions during testing. Second, we introduce a framework\nfor constructing robust loss functions, called distribution losses. These\nlosses apply percentile-based penalties based on an assumed margin\ndistribution, and they naturally allow adapting to different noise rates via a\nrobustness parameter. In particular, we introduce a new loss called the\nnegative exponential loss, which leads to an efficient greedy\nimpurity-reduction learning algorithm. Lastly, our experiments on multiple\ndatasets and noise settings validate our theoretical insight and the\neffectiveness of our adaptive negative exponential loss.\n","authors":["Jonathan Wilton","Nan Ye"],"pdf_url":"https://arxiv.org/pdf/2312.12937v1.pdf","comment":"Accepted at AAAI Conference on Artificial Intelligence 2024"},{"id":"http://arxiv.org/abs/2306.04886v2","updated":"2023-12-20T11:27:01Z","published":"2023-06-08T02:29:49Z","title":"Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity\n Prediction","summary":" Protein-ligand binding affinity (PLBA) prediction is the fundamental task in\ndrug discovery. Recently, various deep learning-based models predict binding\naffinity by incorporating the three-dimensional structure of protein-ligand\ncomplexes as input and achieving astounding progress. However, due to the\nscarcity of high-quality training data, the generalization ability of current\nmodels is still limited. In addition, different bioassays use varying affinity\nmeasurement labels (i.e., IC50, Ki, Kd), and different experimental conditions\ninevitably introduce systematic noise, which poses a significant challenge to\nconstructing high-precision affinity prediction models. To address these\nissues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training\nframework for structure-based PLBA prediction; (2) construct a pre-training\ndataset called ChEMBL-Dock with more than 300k experimentally measured affinity\nlabels and about 2.8M docked three-dimensional structures. By introducing\nmulti-task pre-training to treat the prediction of different affinity labels as\ndifferent tasks and classifying relative rankings between samples from the same\nbioassay, MBP learns robust and transferrable structural knowledge from our new\nChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the\ncapability of MBP as a general framework that can improve and be tailored to\nmainstream structure-based PLBA prediction tasks. To the best of our knowledge,\nMBP is the first affinity pre-training model and shows great potential for\nfuture development.\n","authors":["Jiaxian Yan","Zhaofeng Ye","Ziyi Yang","Chengqiang Lu","Shengyu Zhang","Qi Liu","Jiezhong Qiu"],"pdf_url":"https://arxiv.org/pdf/2306.04886v2.pdf","comment":"21 pages, 7 figures"},{"id":"http://arxiv.org/abs/2306.03625v2","updated":"2023-12-20T11:24:12Z","published":"2023-06-06T12:22:20Z","title":"Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy\n Learning","summary":" We propose a simple and general framework for nonparametric estimation of\nheterogeneous treatment effects under fairness constraints. Under standard\nregularity conditions, we show that the resulting estimators possess the double\nrobustness property. We use this framework to characterize the trade-off\nbetween fairness and the maximum welfare achievable by the optimal policy. We\nevaluate the methods in a simulation study and illustrate them in a real-world\ncase study.\n","authors":["Kwangho Kim","José R. Zubizarreta"],"pdf_url":"https://arxiv.org/pdf/2306.03625v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12934v1","updated":"2023-12-20T11:20:35Z","published":"2023-12-20T11:20:35Z","title":"Stability of Graph Convolutional Neural Networks through the lens of\n small perturbation analysis","summary":" In this work, we study the problem of stability of Graph Convolutional Neural\nNetworks (GCNs) under random small perturbations in the underlying graph\ntopology, i.e. under a limited number of insertions or deletions of edges. We\nderive a novel bound on the expected difference between the outputs of\nunperturbed and perturbed GCNs. The proposed bound explicitly depends on the\nmagnitude of the perturbation of the eigenpairs of the Laplacian matrix, and\nthe perturbation explicitly depends on which edges are inserted or deleted.\nThen, we provide a quantitative characterization of the effect of perturbing\nspecific edges on the stability of the network. We leverage tools from small\nperturbation analysis to express the bounds in closed, albeit approximate,\nform, in order to enhance interpretability of the results, without the need to\ncompute any perturbed shift operator. Finally, we numerically evaluate the\neffectiveness of the proposed bound.\n","authors":["Lucia Testa","Claudio Battiloro","Stefania Sardellitti","Sergio Barbarossa"],"pdf_url":"https://arxiv.org/pdf/2312.12934v1.pdf","comment":"Accepted for publication in Proc. of 2024 IEEE International\n Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)"},{"id":"http://arxiv.org/abs/2308.10542v2","updated":"2023-12-20T11:17:24Z","published":"2023-08-21T07:52:39Z","title":"Learning Weakly Convex Regularizers for Convergent Image-Reconstruction\n Algorithms","summary":" We propose to learn non-convex regularizers with a prescribed upper bound on\ntheir weak-convexity modulus. Such regularizers give rise to variational\ndenoisers that minimize a convex energy. They rely on few parameters (less than\n15,000) and offer a signal-processing interpretation as they mimic handcrafted\nsparsity-promoting regularizers. Through numerical experiments, we show that\nsuch denoisers outperform convex-regularization methods as well as the popular\nBM3D denoiser. Additionally, the learned regularizer can be deployed to solve\ninverse problems with iterative schemes that provably converge. For both CT and\nMRI reconstruction, the regularizer generalizes well and offers an excellent\ntradeoff between performance, number of parameters, guarantees, and\ninterpretability when compared to other data-driven approaches.\n","authors":["Alexis Goujon","Sebastian Neumayer","Michael Unser"],"pdf_url":"https://arxiv.org/pdf/2308.10542v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.05209v2","updated":"2023-12-20T10:51:06Z","published":"2023-07-11T12:28:05Z","title":"Contextual Pre-Planning on Reward Machine Abstractions for Enhanced\n Transfer in Deep Reinforcement Learning","summary":" Recent studies show that deep reinforcement learning (DRL) agents tend to\noverfit to the task on which they were trained and fail to adapt to minor\nenvironment changes. To expedite learning when transferring to unseen tasks, we\npropose a novel approach to representing the current task using reward machines\n(RMs), state machine abstractions that induce subtasks based on the current\ntask's rewards and dynamics. Our method provides agents with symbolic\nrepresentations of optimal transitions from their current abstract state and\nrewards them for achieving these transitions. These representations are shared\nacross tasks, allowing agents to exploit knowledge of previously encountered\nsymbols and transitions, thus enhancing transfer. Empirical results show that\nour representations improve sample efficiency and few-shot transfer in a\nvariety of domains.\n","authors":["Guy Azran","Mohamad H. Danesh","Stefano V. Albrecht","Sarah Keren"],"pdf_url":"https://arxiv.org/pdf/2307.05209v2.pdf","comment":"Proceedings of the 38th AAAI Conference on Artificial Intelligence\n (AAAI), 2024"},{"id":"http://arxiv.org/abs/2306.14932v3","updated":"2023-12-20T10:47:23Z","published":"2023-06-26T09:42:59Z","title":"GloptiNets: Scalable Non-Convex Optimization with Certificates","summary":" We present a novel approach to non-convex optimization with certificates,\nwhich handles smooth functions on the hypercube or on the torus. Unlike\ntraditional methods that rely on algebraic properties, our algorithm exploits\nthe regularity of the target function intrinsic in the decay of its Fourier\nspectrum. By defining a tractable family of models, we allow at the same time\nto obtain precise certificates and to leverage the advanced and powerful\ncomputational techniques developed to optimize neural networks. In this way the\nscalability of our approach is naturally enhanced by parallel computing with\nGPUs. Our approach, when applied to the case of polynomials of moderate\ndimensions but with thousands of coefficients, outperforms the state-of-the-art\noptimization methods with certificates, as the ones based on Lasserre's\nhierarchy, addressing problems intractable for the competitors.\n","authors":["Gaspard Beugnot","Julien Mairal","Alessandro Rudi"],"pdf_url":"https://arxiv.org/pdf/2306.14932v3.pdf","comment":"Edit affiliations and acknowledgments"},{"id":"http://arxiv.org/abs/2312.08288v2","updated":"2023-12-20T10:46:33Z","published":"2023-12-13T17:04:16Z","title":"Hybrid Sample Synthesis-based Debiasing of Classifier in Limited Data\n Setting","summary":" Deep learning models are known to suffer from the problem of bias, and\nresearchers have been exploring methods to address this issue. However, most of\nthese methods require prior knowledge of the bias and are not always practical.\nIn this paper, we focus on a more practical setting with no prior information\nabout the bias. Generally, in this setting, there are a large number of\nbias-aligned samples that cause the model to produce biased predictions and a\nfew bias-conflicting samples that do not conform to the bias. If the training\ndata is limited, the influence of the bias-aligned samples may become even\nstronger on the model predictions, and we experimentally demonstrate that\nexisting debiasing techniques suffer severely in such cases. In this paper, we\nexamine the effects of unknown bias in small dataset regimes and present a\nnovel approach to mitigate this issue. The proposed approach directly addresses\nthe issue of the extremely low occurrence of bias-conflicting samples in\nlimited data settings through the synthesis of hybrid samples that can be used\nto reduce the effect of bias. We perform extensive experiments on several\nbenchmark datasets and experimentally demonstrate the effectiveness of our\nproposed approach in addressing any unknown bias in the presence of limited\ndata. Specifically, our approach outperforms the vanilla, LfF, LDD, and DebiAN\ndebiasing methods by absolute margins of 10.39%, 9.08%, 8.07%, and 9.67% when\nonly 10% of the Corrupted CIFAR-10 Type 1 dataset is available with a\nbias-conflicting sample ratio of 0.05.\n","authors":["Piyush Arora","Pratik Mazumder"],"pdf_url":"https://arxiv.org/pdf/2312.08288v2.pdf","comment":"Accepted in WACV 2024"},{"id":"http://arxiv.org/abs/2312.12909v1","updated":"2023-12-20T10:45:24Z","published":"2023-12-20T10:45:24Z","title":"Energy-efficient Spiking Neural Network Equalization for IM/DD Systems\n with Optimized Neural Encoding","summary":" We propose an energy-efficient equalizer for IM/DD systems based on spiking\nneural networks. We optimize a neural spike encoding that boosts the\nequalizer's performance while decreasing energy consumption.\n","authors":["Alexander von Bank","Eike-Manuel Edelmann","Laurent Schmalen"],"pdf_url":"https://arxiv.org/pdf/2312.12909v1.pdf","comment":"Accepted for publication at OFC 2024"},{"id":"http://arxiv.org/abs/2312.12904v1","updated":"2023-12-20T10:40:41Z","published":"2023-12-20T10:40:41Z","title":"PGN: A perturbation generation network against deep reinforcement\n learning","summary":" Deep reinforcement learning has advanced greatly and applied in many areas.\nIn this paper, we explore the vulnerability of deep reinforcement learning by\nproposing a novel generative model for creating effective adversarial examples\nto attack the agent. Our proposed model can achieve both targeted attacks and\nuntargeted attacks. Considering the specificity of deep reinforcement learning,\nwe propose the action consistency ratio as a measure of stealthiness, and a new\nmeasurement index of effectiveness and stealthiness. Experiment results show\nthat our method can ensure the effectiveness and stealthiness of attack\ncompared with other algorithms. Moreover, our methods are considerably faster\nand thus can achieve rapid and efficient verification of the vulnerability of\ndeep reinforcement learning.\n","authors":["Xiangjuan Li","Feifan Li","Yang Li","Quan Pan"],"pdf_url":"https://arxiv.org/pdf/2312.12904v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12903v1","updated":"2023-12-20T10:36:55Z","published":"2023-12-20T10:36:55Z","title":"A Minimal Control Family of Dynamical Syetem for Universal Approximation","summary":" The universal approximation property (UAP) of neural networks is a\nfundamental characteristic of deep learning. It is widely recognized that a\ncomposition of linear functions and non-linear functions, such as the rectified\nlinear unit (ReLU) activation function, can approximate continuous functions on\ncompact domains. In this paper, we extend this efficacy to the scenario of\ndynamical systems with controls. We prove that the control family\n$\\mathcal{F}_1 = \\mathcal{F}_0 \\cup \\{ \\text{ReLU}(\\cdot)\\} $ is enough to\ngenerate flow maps that can uniformly approximate diffeomorphisms of\n$\\mathbb{R}^d$ on any compact domain, where $\\mathcal{F}_0 = \\{x \\mapsto Ax+b:\nA\\in \\mathbb{R}^{d\\times d}, b \\in \\mathbb{R}^d\\}$ is the set of linear maps\nand the dimension $d\\ge2$. Since $\\mathcal{F}_1$ contains only one nonlinear\nfunction and $\\mathcal{F}_0$ does not hold the UAP, we call $\\mathcal{F}_1$ a\nminimal control family for UAP. Based on this, some sufficient conditions, such\nas the affine invariance, on the control family are established and discussed.\nOur result reveals an underlying connection between the approximation power of\nneural networks and control systems.\n","authors":["Yifei Duan","Yongqiang Cai"],"pdf_url":"https://arxiv.org/pdf/2312.12903v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2205.15834v3","updated":"2023-12-20T09:52:24Z","published":"2022-05-31T14:37:39Z","title":"Attribution-based Explanations that Provide Recourse Cannot be Robust","summary":" Different users of machine learning methods require different explanations,\ndepending on their goals. To make machine learning accountable to society, one\nimportant goal is to get actionable options for recourse, which allow an\naffected user to change the decision $f(x)$ of a machine learning system by\nmaking limited changes to its input $x$. We formalize this by providing a\ngeneral definition of recourse sensitivity, which needs to be instantiated with\na utility function that describes which changes to the decisions are relevant\nto the user. This definition applies to local attribution methods, which\nattribute an importance weight to each input feature. It is often argued that\nsuch local attributions should be robust, in the sense that a small change in\nthe input $x$ that is being explained, should not cause a large change in the\nfeature weights. However, we prove formally that it is in general impossible\nfor any single attribution method to be both recourse sensitive and robust at\nthe same time. It follows that there must always exist counterexamples to at\nleast one of these properties. We provide such counterexamples for several\npopular attribution methods, including LIME, SHAP, Integrated Gradients and\nSmoothGrad. Our results also cover counterfactual explanations, which may be\nviewed as attributions that describe a perturbation of $x$. We further discuss\npossible ways to work around our impossibility result, for instance by allowing\nthe output to consist of sets with multiple attributions, and we provide\nsufficient conditions for specific classes of continuous functions to be\nrecourse sensitive. Finally, we strengthen our impossibility result for the\nrestricted case where users are only able to change a single attribute of $x$,\nby providing an exact characterization of the functions $f$ to which\nimpossibility applies.\n","authors":["Hidde Fokkema","Rianne de Heide","Tim van Erven"],"pdf_url":"https://arxiv.org/pdf/2205.15834v3.pdf","comment":"32 pages, 6 figures"},{"id":"http://arxiv.org/abs/2312.12882v1","updated":"2023-12-20T09:46:42Z","published":"2023-12-20T09:46:42Z","title":"BSL: Understanding and Improving Softmax Loss for Recommendation","summary":" Loss functions steer the optimization direction of recommendation models and\nare critical to model performance, but have received relatively little\nattention in recent recommendation research. Among various losses, we find\nSoftmax loss (SL) stands out for not only achieving remarkable accuracy but\nalso better robustness and fairness. Nevertheless, the current literature lacks\na comprehensive explanation for the efficacy of SL. Toward addressing this\nresearch gap, we conduct theoretical analyses on SL and uncover three insights:\n1) Optimizing SL is equivalent to performing Distributionally Robust\nOptimization (DRO) on the negative data, thereby learning against perturbations\non the negative distribution and yielding robustness to noisy negatives. 2)\nComparing with other loss functions, SL implicitly penalizes the prediction\nvariance, resulting in a smaller gap between predicted values and and thus\nproducing fairer results. Building on these insights, we further propose a\nnovel loss function Bilateral SoftMax Loss (BSL) that extends the advantage of\nSL to both positive and negative sides. BSL augments SL by applying the same\nLog-Expectation-Exp structure to positive examples as is used for negatives,\nmaking the model robust to the noisy positives as well. Remarkably, BSL is\nsimple and easy-to-implement -- requiring just one additional line of code\ncompared to SL. Experiments on four real-world datasets and three\nrepresentative backbones demonstrate the effectiveness of our proposal. The\ncode is available at https://github.com/junkangwu/BSL\n","authors":["Junkang Wu","Jiawei Chen","Jiancan Wu","Wentao Shi","Jizhi Zhang","Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12880v1","updated":"2023-12-20T09:45:21Z","published":"2023-12-20T09:45:21Z","title":"Testing the Segment Anything Model on radiology data","summary":" Deep learning models trained with large amounts of data have become a recent\nand effective approach to predictive problem solving -- these have become known\nas \"foundation models\" as they can be used as fundamental tools for other\napplications. While the paramount examples of image classification (earlier)\nand large language models (more recently) led the way, the Segment Anything\nModel (SAM) was recently proposed and stands as the first foundation model for\nimage segmentation, trained on over 10 million images and with recourse to over\n1 billion masks. However, the question remains -- what are the limits of this\nfoundation? Given that magnetic resonance imaging (MRI) stands as an important\nmethod of diagnosis, we sought to understand whether SAM could be used for a\nfew tasks of zero-shot segmentation using MRI data. Particularly, we wanted to\nknow if selecting masks from the pool of SAM predictions could lead to good\nsegmentations.\n Here, we provide a critical assessment of the performance of SAM on magnetic\nresonance imaging data. We show that, while acceptable in a very limited set of\ncases, the overall trend implies that these models are insufficient for MRI\nsegmentation across the whole volume, but can provide good segmentations in a\nfew, specific slices. More importantly, we note that while foundation models\ntrained on natural images are set to become key aspects of predictive\nmodelling, they may prove ineffective when used on other imaging modalities.\n","authors":["José Guilherme de Almeida","Nuno M. Rodrigues","Sara Silva","Nickolas Papanikolaou"],"pdf_url":"https://arxiv.org/pdf/2312.12880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12878v1","updated":"2023-12-20T09:40:07Z","published":"2023-12-20T09:40:07Z","title":"Rule-Extraction Methods From Feedforward Neural Networks: A Systematic\n Literature Review","summary":" Motivated by the interpretability question in ML models as a crucial element\nfor the successful deployment of AI systems, this paper focuses on rule\nextraction as a means for neural networks interpretability. Through a\nsystematic literature review, different approaches for extracting rules from\nfeedforward neural networks, an important block in deep learning models, are\nidentified and explored. The findings reveal a range of methods developed for\nover two decades, mostly suitable for shallow neural networks, with recent\ndevelopments to meet deep learning models' challenges. Rules offer a\ntransparent and intuitive means of explaining neural networks, making this\nstudy a comprehensive introduction for researchers interested in the field.\nWhile the study specifically addresses feedforward networks with supervised\nlearning and crisp rules, future work can extend to other network types,\nmachine learning methods, and fuzzy rule extraction.\n","authors":["Sara El Mekkaoui","Loubna Benabbou","Abdelaziz Berrado"],"pdf_url":"https://arxiv.org/pdf/2312.12878v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12871v1","updated":"2023-12-20T09:34:28Z","published":"2023-12-20T09:34:28Z","title":"Effect Size Estimation for Duration Recommendation in Online\n Experiments: Leveraging Hierarchical Models and Objective Utility Approaches","summary":" The selection of the assumed effect size (AES) critically determines the\nduration of an experiment, and hence its accuracy and efficiency.\nTraditionally, experimenters determine AES based on domain knowledge. However,\nthis method becomes impractical for online experimentation services managing\nnumerous experiments, and a more automated approach is hence of great demand.\nWe initiate the study of data-driven AES selection in for online\nexperimentation services by introducing two solutions. The first employs a\nthree-layer Gaussian Mixture Model considering the heteroskedasticity across\nexperiments, and it seeks to estimate the true expected effect size among\npositive experiments. The second method, grounded in utility theory, aims to\ndetermine the optimal effect size by striking a balance between the\nexperiment's cost and the precision of decision-making. Through comparisons\nwith baseline methods using both simulated and real data, we showcase the\nsuperior performance of the proposed approaches.\n","authors":["Yu Liu","Runzhe Wan","James McQueen","Doug Hains","Jinxiang Gu","Rui Song"],"pdf_url":"https://arxiv.org/pdf/2312.12871v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12869v1","updated":"2023-12-20T09:33:16Z","published":"2023-12-20T09:33:16Z","title":"Parameterized Projected Bellman Operator","summary":" Approximate value iteration~(AVI) is a family of algorithms for reinforcement\nlearning~(RL) that aims to obtain an approximation of the optimal value\nfunction. Generally, AVI algorithms implement an iterated procedure where each\nstep consists of (i) an application of the Bellman operator and (ii) a\nprojection step into a considered function space. Notoriously, the Bellman\noperator leverages transition samples, which strongly determine its behavior,\nas uninformative samples can result in negligible updates or long detours,\nwhose detrimental effects are further exacerbated by the computationally\nintensive projection step. To address these issues, we propose a novel\nalternative approach based on learning an approximate version of the Bellman\noperator rather than estimating it through samples as in AVI approaches. This\nway, we are able to (i) generalize across transition samples and (ii) avoid the\ncomputationally intensive projection step. For this reason, we call our novel\noperator projected Bellman operator (PBO). We formulate an optimization problem\nto learn PBO for generic sequential decision-making problems, and we\ntheoretically analyze its properties in two representative classes of RL\nproblems. Furthermore, we theoretically study our approach under the lens of\nAVI and devise algorithmic implementations to learn PBO in offline and online\nsettings by leveraging neural network parameterizations. Finally, we\nempirically showcase the benefits of PBO w.r.t. the regular Bellman operator on\nseveral RL problems.\n","authors":["Théo Vincent","Alberto Maria Metelli","Boris Belousov","Jan Peters","Marcello Restelli","Carlo D'Eramo"],"pdf_url":"https://arxiv.org/pdf/2312.12869v1.pdf","comment":"Proceedings of the National Conference on Artificial Intelligence\n (AAAI-24)"},{"id":"http://arxiv.org/abs/2312.12863v1","updated":"2023-12-20T09:27:09Z","published":"2023-12-20T09:27:09Z","title":"Federated Learning While Providing Model as a Service: Joint Training\n and Inference Optimization","summary":" While providing machine learning model as a service to process users'\ninference requests, online applications can periodically upgrade the model\nutilizing newly collected data. Federated learning (FL) is beneficial for\nenabling the training of models across distributed clients while keeping the\ndata locally. However, existing work has overlooked the coexistence of model\ntraining and inference under clients' limited resources. This paper focuses on\nthe joint optimization of model training and inference to maximize inference\nperformance at clients. Such an optimization faces several challenges. The\nfirst challenge is to characterize the clients' inference performance when\nclients may partially participate in FL. To resolve this challenge, we\nintroduce a new notion of age of model (AoM) to quantify client-side model\nfreshness, based on which we use FL's global model convergence error as an\napproximate measure of inference performance. The second challenge is the tight\ncoupling among clients' decisions, including participation probability in FL,\nmodel download probability, and service rates. Toward the challenges, we\npropose an online problem approximation to reduce the problem complexity and\noptimize the resources to balance the needs of model training and inference.\nExperimental results demonstrate that the proposed algorithm improves the\naverage inference accuracy by up to 12%.\n","authors":["Pengchao Han","Shiqiang Wang","Yang Jiao","Jianwei Huang"],"pdf_url":"https://arxiv.org/pdf/2312.12863v1.pdf","comment":"Accepted by IEEE International Conference on Computer Communications\n (INFOCOM) 2024"},{"id":"http://arxiv.org/abs/2212.05908v2","updated":"2023-12-20T09:26:38Z","published":"2022-12-12T14:16:26Z","title":"Instance-Conditional Timescales of Decay for Non-Stationary Learning","summary":" Slow concept drift is a ubiquitous, yet under-studied problem in practical\nmachine learning systems. In such settings, although recent data is more\nindicative of future data, naively prioritizing recent instances runs the risk\nof losing valuable information from the past. We propose an optimization-driven\napproach towards balancing instance importance over large training windows.\nFirst, we model instance relevance using a mixture of multiple timescales of\ndecay, allowing us to capture rich temporal trends. Second, we learn an\nauxiliary scorer model that recovers the appropriate mixture of timescales as a\nfunction of the instance itself. Finally, we propose a nested optimization\nobjective for learning the scorer, by which it maximizes forward transfer for\nthe learned model. Experiments on a large real-world dataset of 39M photos over\na 9 year period show upto 15% relative gains in accuracy compared to other\nrobust learning baselines. We replicate our gains on two collections of\nreal-world datasets for non-stationary learning, and extend our work to\ncontinual learning settings where, too, we beat SOTA methods by large margins.\n","authors":["Nishant Jain","Pradeep Shenoy"],"pdf_url":"https://arxiv.org/pdf/2212.05908v2.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.09787v2","updated":"2023-12-20T09:21:25Z","published":"2023-12-15T13:41:20Z","title":"Physics-informed Neural Network Estimation of Material Properties in\n Soft Tissue Nonlinear Biomechanical Models","summary":" The development of biophysical models for clinical applications is rapidly\nadvancing in the research community, thanks to their predictive nature and\ntheir ability to assist the interpretation of clinical data. However,\nhigh-resolution and accurate multi-physics computational models are\ncomputationally expensive and their personalisation involves fine calibration\nof a large number of parameters, which may be space-dependent, challenging\ntheir clinical translation. In this work, we propose a new approach which\nrelies on the combination of physics-informed neural networks (PINNs) with\nthree-dimensional soft tissue nonlinear biomechanical models, capable of\nreconstructing displacement fields and estimating heterogeneous\npatient-specific biophysical properties. The proposed learning algorithm\nencodes information from a limited amount of displacement and, in some cases,\nstrain data, that can be routinely acquired in the clinical setting, and\ncombines it with the physics of the problem, represented by a mathematical\nmodel based on partial differential equations, to regularise the problem and\nimprove its convergence properties. Several benchmarks are presented to show\nthe accuracy and robustness of the proposed method and its great potential to\nenable the robust and effective identification of patient-specific,\nheterogeneous physical properties, s.a. tissue stiffness properties. In\nparticular, we demonstrate the capability of the PINN to detect the presence,\nlocation and severity of scar tissue, which is beneficial to develop\npersonalised simulation models for disease diagnosis, especially for cardiac\napplications.\n","authors":["Federica Caforio","Francesco Regazzoni","Stefano Pagani","Elias Karabelas","Christoph Augustin","Gundolf Haase","Gernot Plank","Alfio Quarteroni"],"pdf_url":"https://arxiv.org/pdf/2312.09787v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12856v1","updated":"2023-12-20T09:19:48Z","published":"2023-12-20T09:19:48Z","title":"SkyScript: A Large and Semantically Diverse Vision-Language Dataset for\n Remote Sensing","summary":" Remote sensing imagery, despite its broad applications in helping achieve\nSustainable Development Goals and tackle climate change, has not yet benefited\nfrom the recent advancements of versatile, task-agnostic vision language models\n(VLMs). A key reason is that the large-scale, semantically diverse image-text\ndataset required for developing VLMs is still absent for remote sensing images.\nUnlike natural images, remote sensing images and their associated text\ndescriptions cannot be efficiently collected from the public Internet at scale.\nIn this work, we bridge this gap by using geo-coordinates to automatically\nconnect open, unlabeled remote sensing images with rich semantics covered in\nOpenStreetMap, and thus construct SkyScript, a comprehensive vision-language\ndataset for remote sensing images, comprising 2.6 million image-text pairs\ncovering 29K distinct semantic tags. With continual pre-training on this\ndataset, we obtain a VLM that surpasses baseline models with a 6.2% average\naccuracy gain in zero-shot scene classification across seven benchmark\ndatasets. It also demonstrates the ability of zero-shot transfer for\nfine-grained object attribute classification and cross-modal retrieval. We hope\nthis dataset can support the advancement of VLMs for various multi-modal tasks\nin remote sensing, such as open-vocabulary classification, retrieval,\ncaptioning, and text-to-image synthesis.\n","authors":["Zhecheng Wang","Rajanie Prabha","Tianyuan Huang","Jiajun Wu","Ram Rajagopal"],"pdf_url":"https://arxiv.org/pdf/2312.12856v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2103.07066v2","updated":"2023-12-20T09:19:41Z","published":"2021-03-12T03:36:03Z","title":"Finding Subgroups with Significant Treatment Effects","summary":" Researchers often run resource-intensive randomized controlled trials (RCTs)\nto estimate the causal effects of interventions on outcomes of interest. Yet\nthese outcomes are often noisy, and estimated overall effects can be small or\nimprecise. Nevertheless, we may still be able to produce reliable evidence of\nthe efficacy of an intervention by finding subgroups with significant effects.\nIn this paper, we propose a machine-learning method that is specifically\noptimized for finding such subgroups in noisy data. Unlike available methods\nfor personalized treatment assignment, our tool is fundamentally designed to\ntake significance testing into account: it produces a subgroup that is chosen\nto maximize the probability of obtaining a statistically significant positive\ntreatment effect. We provide a computationally efficient implementation using\ndecision trees and demonstrate its gain over selecting subgroups based on\npositive (estimated) treatment effects. Compared to standard tree-based\nregression and classification tools, this approach tends to yield higher power\nin detecting subgroups affected by the treatment.\n","authors":["Jann Spiess","Vasilis Syrgkanis","Victor Yaneng Wang"],"pdf_url":"https://arxiv.org/pdf/2103.07066v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12849v1","updated":"2023-12-20T08:59:05Z","published":"2023-12-20T08:59:05Z","title":"Divergences induced by dual subtractive and divisive normalizations of\n exponential families and their convex deformations","summary":" Exponential families are statistical models which are the workhorses in\nstatistics, information theory, and machine learning. An exponential family can\neither be normalized subtractively by its cumulant function or equivalently\nnormalized divisively by its partition function. Both subtractive and divisive\nnormalizers are strictly convex and smooth functions inducing pairs of Bregman\nand Jensen divergences. It is well-known that skewed Bhattacharryya distances\nbetween probability densities of an exponential family amounts to skewed Jensen\ndivergences induced by the cumulant function between their corresponding\nnatural parameters, and in limit cases that the sided Kullback-Leibler\ndivergences amount to reverse-sided Bregman divergences. In this note, we first\nshow that the $\\alpha$-divergences between unnormalized densities of an\nexponential family amounts scaled $\\alpha$-skewed Jensen divergences induced by\nthe partition function. We then show how comparative convexity with respect to\na pair of quasi-arithmetic means allows to deform convex functions and define\ndually flat spaces with corresponding divergences when ordinary convexity is\npreserved.\n","authors":["Frank Nielsen"],"pdf_url":"https://arxiv.org/pdf/2312.12849v1.pdf","comment":"16 pages, 2 figures"},{"id":"http://arxiv.org/abs/2303.00196v3","updated":"2023-12-20T08:57:18Z","published":"2023-03-01T03:05:40Z","title":"Transformed Low-Rank Parameterization Can Help Robust Generalization for\n Tensor Neural Networks","summary":" Achieving efficient and robust multi-channel data learning is a challenging\ntask in data science. By exploiting low-rankness in the transformed domain,\ni.e., transformed low-rankness, tensor Singular Value Decomposition (t-SVD) has\nachieved extensive success in multi-channel data representation and has\nrecently been extended to function representation such as Neural Networks with\nt-product layers (t-NNs). However, it still remains unclear how t-SVD\ntheoretically affects the learning behavior of t-NNs. This paper is the first\nto answer this question by deriving the upper bounds of the generalization\nerror of both standard and adversarially trained t-NNs. It reveals that the\nt-NNs compressed by exact transformed low-rank parameterization can achieve a\nsharper adversarial generalization bound. In practice, although t-NNs rarely\nhave exactly transformed low-rank weights, our analysis further shows that by\nadversarial training with gradient flow (GF), the over-parameterized t-NNs with\nReLU activations are trained with implicit regularization towards transformed\nlow-rank parameterization under certain conditions. We also establish\nadversarial generalization bounds for t-NNs with approximately transformed\nlow-rank weights. Our analysis indicates that the transformed low-rank\nparameterization can promisingly enhance robust generalization for t-NNs.\n","authors":["Andong Wang","Chao Li","Mingyuan Bai","Zhong Jin","Guoxu Zhou","Qibin Zhao"],"pdf_url":"https://arxiv.org/pdf/2303.00196v3.pdf","comment":"51 pages, presented on NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.12844v1","updated":"2023-12-20T08:51:58Z","published":"2023-12-20T08:51:58Z","title":"Causal Discovery under Identifiable Heteroscedastic Noise Model","summary":" Capturing the underlying structural causal relations represented by Directed\nAcyclic Graphs (DAGs) has been a fundamental task in various AI disciplines.\nCausal DAG learning via the continuous optimization framework has recently\nachieved promising performance in terms of both accuracy and efficiency.\nHowever, most methods make strong assumptions of homoscedastic noise, i.e.,\nexogenous noises have equal variances across variables, observations, or even\nboth. The noises in real data usually violate both assumptions due to the\nbiases introduced by different data collection processes. To address the issue\nof heteroscedastic noise, we introduce relaxed and implementable sufficient\nconditions, proving the identifiability of a general class of SEM subject to\nthese conditions. Based on the identifiable general SEM, we propose a novel\nformulation for DAG learning that accounts for the variation in noise variance\nacross variables and observations. We then propose an effective two-phase\niterative DAG learning algorithm to address the increasing optimization\ndifficulties and to learn a causal DAG from data with heteroscedastic variable\nnoise under varying variance. We show significant empirical gains of the\nproposed approaches over state-of-the-art methods on both synthetic data and\nreal data.\n","authors":["Naiyu Yin","Tian Gao","Yue Yu","Qiang Ji"],"pdf_url":"https://arxiv.org/pdf/2312.12844v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12839v1","updated":"2023-12-20T08:47:21Z","published":"2023-12-20T08:47:21Z","title":"Comparing Machine Learning Algorithms by Union-Free Generic Depth","summary":" We propose a framework for descriptively analyzing sets of partial orders\nbased on the concept of depth functions. Despite intensive studies in linear\nand metric spaces, there is very little discussion on depth functions for\nnon-standard data types such as partial orders. We introduce an adaptation of\nthe well-known simplicial depth to the set of all partial orders, the\nunion-free generic (ufg) depth. Moreover, we utilize our ufg depth for a\ncomparison of machine learning algorithms based on multidimensional performance\nmeasures. Concretely, we provide two examples of classifier comparisons on\nsamples of standard benchmark data sets. Our results demonstrate promisingly\nthe wide variety of different analysis approaches based on ufg methods.\nFurthermore, the examples outline that our approach differs substantially from\nexisting benchmarking approaches, and thus adds a new perspective to the vivid\ndebate on classifier comparison.\n","authors":["Hannah Blocher","Georg Schollmeyer","Malte Nalenz","Christoph Jansen"],"pdf_url":"https://arxiv.org/pdf/2312.12839v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2304.09872"},{"id":"http://arxiv.org/abs/2312.12838v1","updated":"2023-12-20T08:42:57Z","published":"2023-12-20T08:42:57Z","title":"FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image\n Segmentation Against Heterogeneous Annotation Noise","summary":" Federated learning (FL) has emerged as a promising paradigm for training\nsegmentation models on decentralized medical data, owing to its\nprivacy-preserving property. However, existing research overlooks the prevalent\nannotation noise encountered in real-world medical datasets, which limits the\nperformance ceilings of FL. In this paper, we, for the first time, identify and\ntackle this problem. For problem formulation, we propose a contour evolution\nfor modeling non-independent and identically distributed (Non-IID) noise across\npixels within each client and then extend it to the case of multi-source data\nto form a heterogeneous noise model (\\textit{i.e.}, Non-IID annotation noise\nacross clients). For robust learning from annotations with such two-level\nNon-IID noise, we emphasize the importance of data quality in model\naggregation, allowing high-quality clients to have a greater impact on FL. To\nachieve this, we propose \\textbf{Fed}erated learning with \\textbf{A}nnotation\nqu\\textbf{A}lity-aware \\textbf{A}ggregat\\textbf{I}on, named \\textbf{FedA$^3$I},\nby introducing a quality factor based on client-wise noise estimation.\nSpecifically, noise estimation at each client is accomplished through the\nGaussian mixture model and then incorporated into model aggregation in a\nlayer-wise manner to up-weight high-quality clients. Extensive experiments on\ntwo real-world medical image segmentation datasets demonstrate the superior\nperformance of FedA$^3$I against the state-of-the-art approaches in dealing\nwith cross-client annotation noise. The code is available at\n\\color{blue}{https://github.com/wnn2000/FedAAAI}.\n","authors":["Nannan Wu","Zhaobin Sun","Zengqiang Yan","Li Yu"],"pdf_url":"https://arxiv.org/pdf/2312.12838v1.pdf","comment":"Accepted at AAAI'24"},{"id":"http://arxiv.org/abs/2312.11831v2","updated":"2023-12-20T08:41:57Z","published":"2023-12-19T03:45:27Z","title":"Locally-Minimal Probabilistic Explanations","summary":" Formal abductive explanations offer crucial guarantees of rigor and so are of\ninterest in high-stakes uses of machine learning (ML). One drawback of\nabductive explanations is explanation size, justified by the cognitive limits\nof human decision-makers. Probabilistic abductive explanations (PAXps) address\nthis limitation, but their theoretical and practical complexity makes their\nexact computation most often unrealistic. This paper proposes novel efficient\nalgorithms for the computation of locally-minimal PXAps, which offer\nhigh-quality approximations of PXAps in practice. The experimental results\ndemonstrate the practical efficiency of the proposed algorithms.\n","authors":["Yacine Izza","Kuldeep S. Meel","Joao Marques-Silva"],"pdf_url":"https://arxiv.org/pdf/2312.11831v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12835v1","updated":"2023-12-20T08:36:55Z","published":"2023-12-20T08:36:55Z","title":"Near-Optimal Resilient Aggregation Rules for Distributed Learning Using\n 1-Center and 1-Mean Clustering with Outliers","summary":" Byzantine machine learning has garnered considerable attention in light of\nthe unpredictable faults that can occur in large-scale distributed learning\nsystems. The key to secure resilience against Byzantine machines in distributed\nlearning is resilient aggregation mechanisms. Although abundant resilient\naggregation rules have been proposed, they are designed in ad-hoc manners,\nimposing extra barriers on comparing, analyzing, and improving the rules across\nperformance criteria. This paper studies near-optimal aggregation rules using\nclustering in the presence of outliers. Our outlier-robust clustering approach\nutilizes geometric properties of the update vectors provided by workers. Our\nanalysis show that constant approximations to the 1-center and 1-mean\nclustering problems with outliers provide near-optimal resilient aggregators\nfor metric-based criteria, which have been proven to be crucial in the\nhomogeneous and heterogeneous cases respectively. In addition, we discuss two\ncontradicting types of attacks under which no single aggregation rule is\nguaranteed to improve upon the naive average. Based on the discussion, we\npropose a two-phase resilient aggregation framework. We run experiments for\nimage classification using a non-convex loss function. The proposed algorithms\noutperform previously known aggregation rules by a large margin with both\nhomogeneous and heterogeneous data distributions among non-faulty workers. Code\nand appendix are available at https://github.com/jerry907/AAAI24-RASHB.\n","authors":["Yuhao Yi","Ronghui You","Hong Liu","Changxin Liu","Yuan Wang","Jiancheng Lv"],"pdf_url":"https://arxiv.org/pdf/2312.12835v1.pdf","comment":"17 pages, 4 figures. Accepted by the 38th Annual AAAI Conference on\n Artificial Intelligence (AAAI'24)"},{"id":"http://arxiv.org/abs/2309.02033v3","updated":"2023-12-20T08:27:40Z","published":"2023-09-05T08:22:07Z","title":"Data-Juicer: A One-Stop Data Processing System for Large Language Models","summary":" The immense evolution in Large Language Models (LLMs) has underscored the\nimportance of massive, heterogeneous, and high-quality data. A data recipe is a\nmixture of data from different sources for training LLMs, which plays a vital\nrole in LLMs' performance. Existing open-source tools for LLM data processing\nare mostly tailored for specific data recipes. To continuously uncover the\npotential of LLMs, incorporate data from new sources, and improve LLMs'\nperformance, we build a new system named Data-Juicer, with which we can\nefficiently generate diverse data recipes, explore different possibilities in\nforming data mixtures, and evaluate their effects on model performance.\nDifferent from traditional data-analytics pipelines, Data-Juicer faces some\nunique challenges. Firstly, the possible data sources for forming data recipes\nare truly heterogeneous and massive with various qualities. Secondly, it is\nextremely expensive to precisely evaluate data recipes' impact on LLMs'\nperformance. Thirdly, the end users of Data-Juicer, model developers, need\nsufficient flexibility to configure and evaluate different data recipes.\n Data-Juicer features a fine-grained abstraction of pipelines for constructing\ndata recipes, with over 50 built-in operators for easy composition and\nextension. By incorporating visualization and auto-evaluation capabilities,\nData-Juicer enables a timely feedback loop for both LLM pre-training and\nfine-tuning. Further, Data-Juicer is optimized and integrated with ecosystems\nfor LLM training, evaluation, and distributed computing. The data recipes\nderived with Data-Juicer gain notable improvements on state-of-the-art LLMs, by\nup to 7.45% increase in averaged score across 16 LLM benchmarks and 17.5%\nhigher win rate in pair-wise GPT-4 evaluations. Our system, data recipes, and\ntutorials are released, calling for broader data-centric research on training\nand understanding LLMs.\n","authors":["Daoyuan Chen","Yilun Huang","Zhijian Ma","Hesen Chen","Xuchen Pan","Ce Ge","Dawei Gao","Yuexiang Xie","Zhaoyang Liu","Jinyang Gao","Yaliang Li","Bolin Ding","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2309.02033v3.pdf","comment":"20 Pages, 10 figures, 9 tables. The system, data recipes, and demos\n are continuously maintained at https://github.com/alibaba/data-juicer"},{"id":"http://arxiv.org/abs/2212.01071v5","updated":"2023-12-20T08:18:14Z","published":"2022-12-02T10:22:18Z","title":"Fake detection in imbalance dataset by Semi-supervised learning with GAN","summary":" As social media continues to grow rapidly, the prevalence of harassment on\nthese platforms has also increased. This has piqued the interest of researchers\nin the field of fake detection. Social media data, often forms complex graphs\nwith numerous nodes, posing several challenges. These challenges and\nlimitations include dealing with a significant amount of irrelevant features in\nmatrices and addressing issues such as high data dispersion and an imbalanced\nclass distribution within the dataset. To overcome these challenges and\nlimitations, researchers have employed auto-encoders and a combination of\nsemi-supervised learning with a GAN algorithm, referred to as SGAN. Our\nproposed method utilizes auto-encoders for feature extraction and incorporates\nSGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN\ncompensates for the limited availability of labeled data, making efficient use\nof the limited number of labeled instances. Multiple evaluation metrics were\nemployed, including the Confusion Matrix and the ROC curve. The dataset was\ndivided into training and testing sets, with 100 labeled samples for training\nand 1,000 samples for testing. The novelty of our research lies in applying\nSGAN to address the issue of imbalanced datasets in fake account detection. By\noptimizing the use of a smaller number of labeled instances and reducing the\nneed for extensive computational power, our method offers a more efficient\nsolution. Additionally, our study contributes to the field by achieving an 81%\naccuracy in detecting fake accounts using only 100 labeled samples. This\ndemonstrates the potential of SGAN as a powerful tool for handling minority\nclasses and addressing big data challenges in fake account detection.\n","authors":["Jinus Bordbar","Saman Ardalan","Mohammadreza Mohammadrezaie","Zahra Ghasemi"],"pdf_url":"https://arxiv.org/pdf/2212.01071v5.pdf","comment":"needed more investigation o final results"},{"id":"http://arxiv.org/abs/2304.03483v3","updated":"2023-12-20T08:18:10Z","published":"2023-04-07T05:29:59Z","title":"RED-PSM: Regularization by Denoising of Partially Separable Models for\n Dynamic Imaging","summary":" Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at\neach time instant using its undersampled measurements. In particular, in the\ncase of dynamic tomography, only a single projection at a single view angle may\nbe available at a time, making the problem severely ill-posed. In this work, we\npropose an approach, RED-PSM, which combines for the first time two powerful\ntechniques to address this challenging imaging problem. The first, are\npartially separable models, which have been used to efficiently introduce a\nlow-rank prior for the spatio-temporal object. The second is the recent\n\\textit{Regularization by Denoising (RED)}, which provides a flexible framework\nto exploit the impressive performance of state-of-the-art image denoising\nalgorithms, for various inverse problems. We propose a partially separable\nobjective with RED and a computationally efficient and scalable optimization\nscheme with variable splitting and ADMM. Theoretical analysis proves the\nconvergence of our objective to a value corresponding to a stationary point\nsatisfying the first-order optimality conditions. Convergence is accelerated by\na particular projection-domain-based initialization. We demonstrate the\nperformance and computational improvements of our proposed RED-PSM with a\nlearned image denoiser by comparing it to a recent deep-prior-based method\nknown as TD-DIP. Although the main focus is on dynamic tomography, we also show\nperformance advantages of RED-PSM in a cardiac dynamic MRI setting.\n","authors":["Berk Iskender","Marc L. Klasky","Yoram Bresler"],"pdf_url":"https://arxiv.org/pdf/2304.03483v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.15657v4","updated":"2023-12-20T08:17:39Z","published":"2022-10-25T10:20:27Z","title":"Detecting fake accounts through Generative Adversarial Network in online\n social media","summary":" Online social media is integral to human life, facilitating messaging,\ninformation sharing, and confidential communication while preserving privacy.\nPlatforms like Twitter, Instagram, and Facebook exemplify this phenomenon.\nHowever, users face challenges due to network anomalies, often stemming from\nmalicious activities such as identity theft for financial gain or harm. This\npaper proposes a novel method using user similarity measures and the Generative\nAdversarial Network (GAN) algorithm to identify fake user accounts in the\nTwitter dataset. Despite the problem's complexity, the method achieves an AUC\nrate of 80\\% in classifying and detecting fake accounts. Notably, the study\nbuilds on previous research, highlighting advancements and insights into the\nevolving landscape of anomaly detection in online social networks.\n","authors":["Jinus Bordbar","Mohammadreza Mohammadrezaie","Saman Ardalan","Mohammad Ebrahim Shiri"],"pdf_url":"https://arxiv.org/pdf/2210.15657v4.pdf","comment":"needed more investigation on final results"},{"id":"http://arxiv.org/abs/2304.04353v2","updated":"2023-12-20T08:16:36Z","published":"2023-04-10T02:22:36Z","title":"Exponentially Improved Efficient and Accurate Machine Learning for\n Quantum Many-body States with Provable Guarantees","summary":" Solving the ground state and the ground-state properties of quantum many-body\nsystems is generically a hard task for classical algorithms. For a family of\nHamiltonians defined on an $m$-dimensional space of physical parameters, the\nground state and its properties at an arbitrary parameter configuration can be\npredicted via a machine learning protocol up to a prescribed prediction error\n$\\varepsilon$, provided that a sample set (of size $N$) of the states can be\nefficiently prepared and measured. In a recent work [Huang et al., Science 377,\neabk3333 (2022)], a rigorous guarantee for such a generalization was proved.\nUnfortunately, an exponential scaling for the provable sample complexity,\n$N=m^{{\\cal{O}}\\left(\\frac{1}{\\varepsilon}\\right)}$, was found to be universal\nfor generic gapped Hamiltonians. This result applies to the situation where the\ndimension of the parameter space is large while the scaling with the accuracy\nis not an urgent factor. In this work, we consider an alternative scenario\nwhere $m$ is a finite, not necessarily large constant while the scaling with\nthe prediction error becomes the central concern. By jointly preserving the\nfundamental properties of density matrices in the learning protocol and\nutilizing the continuity of quantum states in the parameter range of interest,\nwe rigorously obtain a polynomial sample complexity for predicting quantum\nmany-body states and their properties, with respect to the uniform prediction\nerror $\\varepsilon$ and the number of qubits $n$. Moreover, if restricted to\nlearning local quantum-state properties, the number of samples with respect to\n$n$ can be further reduced exponentially. Our results provide theoretical\nguarantees for efficient and accurate learning of quantum many-body states and\ntheir properties, with model-independent applications not restricted to ground\nstates of gapped Hamiltonians.\n","authors":["Yanming Che","Clemens Gneiting","Franco Nori"],"pdf_url":"https://arxiv.org/pdf/2304.04353v2.pdf","comment":"8 + 13 pages, 2 + 1 figures; With supplemental material (SM).\n Improved presentation to highlight our new findings; Added numerical\n demonstration with a quantum XY model; Added Sec. II in the SM"},{"id":"http://arxiv.org/abs/2312.08410v2","updated":"2023-12-20T08:16:10Z","published":"2023-12-13T11:27:15Z","title":"Universal Approximation Property of Random Neural Networks","summary":" In this paper, we study random neural networks which are single-hidden-layer\nfeedforward neural networks whose weights and biases are randomly initialized.\nAfter this random initialization, only the linear readout needs to be trained,\nwhich can be performed efficiently, e.g., by the least squares method. By\nviewing random neural networks as Banach space-valued random variables, we\nprove a universal approximation theorem within a large class of Bochner spaces.\nHereby, the corresponding Banach space can be significantly more general than\nthe space of continuous functions over a compact subset of a Euclidean space,\nnamely, e.g., an $L^p$-space or a Sobolev space, where the latter includes the\napproximation of the derivatives. Moreover, we derive approximation rates and\nan explicit algorithm to learn a deterministic function by a random neural\nnetwork. In addition, we provide a full error analysis and study when random\nneural networks overcome the curse of dimensionality in the sense that the\ntraining costs scale at most polynomially in the input and output dimension.\nFurthermore, we show in two numerical examples the empirical advantages of\nrandom neural networks compared to fully trained deterministic neural networks.\n","authors":["Ariel Neufeld","Philipp Schmocker"],"pdf_url":"https://arxiv.org/pdf/2312.08410v2.pdf","comment":"64 pages, 3 figures"},{"id":"http://arxiv.org/abs/2307.16092v2","updated":"2023-12-20T07:43:53Z","published":"2023-07-29T23:31:18Z","title":"Feature Transportation Improves Graph Neural Networks","summary":" Graph neural networks (GNNs) have shown remarkable success in learning\nrepresentations for graph-structured data. However, GNNs still face challenges\nin modeling complex phenomena that involve feature transportation. In this\npaper, we propose a novel GNN architecture inspired by\nAdvection-Diffusion-Reaction systems, called ADR-GNN. Advection models feature\ntransportation, while diffusion captures the local smoothing of features, and\nreaction represents the non-linear transformation between feature channels. We\nprovide an analysis of the qualitative behavior of ADR-GNN, that shows the\nbenefit of combining advection, diffusion, and reaction. To demonstrate its\nefficacy, we evaluate ADR-GNN on real-world node classification and\nspatio-temporal datasets, and show that it improves or offers competitive\nperformance compared to state-of-the-art networks.\n","authors":["Moshe Eliasof","Eldad Haber","Eran Treister"],"pdf_url":"https://arxiv.org/pdf/2307.16092v2.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.11562v2","updated":"2023-12-20T07:25:58Z","published":"2023-12-17T15:16:13Z","title":"A Survey of Reasoning with Foundation Models: Concepts, Methodologies,\n and Outlook","summary":" Reasoning, a crucial ability for complex problem-solving, plays a pivotal\nrole in various real-world settings such as negotiation, medical diagnosis, and\ncriminal investigation. It serves as a fundamental methodology in the field of\nArtificial General Intelligence (AGI). With the ongoing development of\nfoundation models, there is a growing interest in exploring their abilities in\nreasoning tasks. In this paper, we introduce seminal foundation models proposed\nor adaptable for reasoning, highlighting the latest advancements in various\nreasoning tasks, methods, and benchmarks. We then delve into the potential\nfuture directions behind the emergence of reasoning abilities within foundation\nmodels. We also discuss the relevance of multimodal learning, autonomous\nagents, and super alignment in the context of reasoning. By discussing these\nfuture research directions, we hope to inspire researchers in their exploration\nof this field, stimulate further advancements in reasoning with foundation\nmodels, and contribute to the development of AGI.\n","authors":["Jiankai Sun","Chuanyang Zheng","Enze Xie","Zhengying Liu","Ruihang Chu","Jianing Qiu","Jiaqi Xu","Mingyu Ding","Hongyang Li","Mengzhe Geng","Yue Wu","Wenhai Wang","Junsong Chen","Zhangyue Yin","Xiaozhe Ren","Jie Fu","Junxian He","Wu Yuan","Qi Liu","Xihui Liu","Yu Li","Hao Dong","Yu Cheng","Ming Zhang","Pheng Ann Heng","Jifeng Dai","Ping Luo","Jingdong Wang","Ji-Rong Wen","Xipeng Qiu","Yike Guo","Hui Xiong","Qun Liu","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2312.11562v2.pdf","comment":"20 Figures, 159 Pages, 740 References, Project Page\n https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models"},{"id":"http://arxiv.org/abs/2306.06041v2","updated":"2023-12-20T06:58:35Z","published":"2023-06-09T17:07:04Z","title":"A Graph Dynamics Prior for Relational Inference","summary":" Relational inference aims to identify interactions between parts of a\ndynamical system from the observed dynamics. Current state-of-the-art methods\nfit the dynamics with a graph neural network (GNN) on a learnable graph. They\nuse one-step message-passing GNNs -- intuitively the right choice since\nnon-locality of multi-step or spectral GNNs may confuse direct and indirect\ninteractions. But the \\textit{effective} interaction graph depends on the\nsampling rate and it is rarely localized to direct neighbors, leading to poor\nlocal optima for the one-step model. In this work, we propose a \\textit{graph\ndynamics prior} (GDP) for relational inference. GDP constructively uses error\namplification in non-local polynomial filters to steer the solution to the\nground-truth graph. To deal with non-uniqueness, GDP simultaneously fits a\n``shallow'' one-step model and a polynomial multi-step model with shared graph\ntopology. Experiments show that GDP reconstructs graphs far more accurately\nthan earlier methods, with remarkable robustness to under-sampling. Since\nappropriate sampling rates for unknown dynamical systems are not known a\npriori, this robustness makes GDP suitable for real applications in scientific\nmachine learning. Reproducible code is available at\nhttps://github.com/DaDaCheng/GDP.\n","authors":["Liming Pan","Cheng Shi","Ivan Dokmanić"],"pdf_url":"https://arxiv.org/pdf/2306.06041v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13646v3","updated":"2023-12-20T06:52:08Z","published":"2023-04-26T16:08:49Z","title":"Data-driven Piecewise Affine Decision Rules for Stochastic Programming\n with Covariate Information","summary":" Focusing on stochastic programming (SP) with covariate information, this\npaper proposes an empirical risk minimization (ERM) method embedded within a\nnonconvex piecewise affine decision rule (PADR), which aims to learn the direct\nmapping from features to optimal decisions. We establish the nonasymptotic\nconsistency result of our PADR-based ERM model for unconstrained problems and\nasymptotic consistency result for constrained ones. To solve the nonconvex and\nnondifferentiable ERM problem, we develop an enhanced stochastic\nmajorization-minimization algorithm and establish the asymptotic convergence to\n(composite strong) directional stationarity along with complexity analysis. We\nshow that the proposed PADR-based ERM method applies to a broad class of\nnonconvex SP problems with theoretical consistency guarantees and computational\ntractability. Our numerical study demonstrates the superior performance of\nPADR-based ERM methods compared to state-of-the-art approaches under various\nsettings, with significantly lower costs, less computation time, and robustness\nto feature dimensions and nonlinearity of the underlying dependency.\n","authors":["Yiyang Zhang","Junyi Liu","Xiaobo Zhao"],"pdf_url":"https://arxiv.org/pdf/2304.13646v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.10525v3","updated":"2023-12-20T06:46:19Z","published":"2022-11-18T22:48:09Z","title":"Differentiable Uncalibrated Imaging","summary":" We propose a differentiable imaging framework to address uncertainty in\nmeasurement coordinates such as sensor locations and projection angles. We\nformulate the problem as measurement interpolation at unknown nodes supervised\nthrough the forward operator. To solve it we apply implicit neural networks,\nalso known as neural fields, which are naturally differentiable with respect to\nthe input coordinates. We also develop differentiable spline interpolators\nwhich perform as well as neural networks, require less time to optimize and\nhave well-understood properties. Differentiability is key as it allows us to\njointly fit a measurement representation, optimize over the uncertain\nmeasurement coordinates, and perform image reconstruction which in turn ensures\nconsistent calibration. We apply our approach to 2D and 3D computed tomography,\nand show that it produces improved reconstructions compared to baselines that\ndo not account for the lack of calibration. The flexibility of the proposed\nframework makes it easy to extend to almost arbitrary imaging problems.\n","authors":["Sidharth Gupta","Konik Kothari","Valentin Debarnot","Ivan Dokmanić"],"pdf_url":"https://arxiv.org/pdf/2211.10525v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12794v1","updated":"2023-12-20T06:34:15Z","published":"2023-12-20T06:34:15Z","title":"Bandit Sequential Posted Pricing via Half-Concavity","summary":" Sequential posted pricing auctions are popular because of their simplicity in\npractice and their tractability in theory. A usual assumption in their study is\nthat the Bayesian prior distributions of the buyers are known to the seller,\nwhile in reality these priors can only be accessed from historical data. To\novercome this assumption, we study sequential posted pricing in the bandit\nlearning model, where the seller interacts with $n$ buyers over $T$ rounds: In\neach round the seller posts $n$ prices for the $n$ buyers and the first buyer\nwith a valuation higher than the price takes the item. The only feedback that\nthe seller receives in each round is the revenue.\n Our main results obtain nearly-optimal regret bounds for single-item\nsequential posted pricing in the bandit learning model. In particular, we\nachieve an $\\tilde{O}(\\mathsf{poly}(n)\\sqrt{T})$ regret for buyers with\n(Myerson's) regular distributions and an\n$\\tilde{O}(\\mathsf{poly}(n)T^{{2}/{3}})$ regret for buyers with general\ndistributions, both of which are tight in the number of rounds $T$. Our result\nfor regular distributions was previously not known even for the single-buyer\nsetting and relies on a new half-concavity property of the revenue function in\nthe value space. For $n$ sequential buyers, our technique is to run a\ngeneralized single-buyer algorithm for all the buyers and to carefully bound\nthe regret from the sub-optimal pricing of the suffix buyers.\n","authors":["Sahil Singla","Yifan Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12794v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.14606v3","updated":"2023-12-20T06:26:36Z","published":"2023-08-28T14:20:53Z","title":"On the Tradeoff between Privacy Preservation and Byzantine-Robustness in\n Decentralized Learning","summary":" This paper jointly considers privacy preservation and Byzantine-robustness in\ndecentralized learning. In a decentralized network, honest-but-curious agents\nfaithfully follow the prescribed algorithm, but expect to infer their\nneighbors' private data from messages received during the learning process,\nwhile dishonest-and-Byzantine agents disobey the prescribed algorithm, and\ndeliberately disseminate wrong messages to their neighbors so as to bias the\nlearning process. For this novel setting, we investigate a generic\nprivacy-preserving and Byzantine-robust decentralized stochastic gradient\ndescent (SGD) framework, in which Gaussian noise is injected to preserve\nprivacy and robust aggregation rules are adopted to counteract Byzantine\nattacks. We analyze its learning error and privacy guarantee, discovering an\nessential tradeoff between privacy preservation and Byzantine-robustness in\ndecentralized learning -- the learning error caused by defending against\nByzantine attacks is exacerbated by the Gaussian noise added to preserve\nprivacy. For a class of state-of-the-art robust aggregation rules, we give\nunified analysis of the \"mixing abilities\". Building upon this analysis, we\nreveal how the \"mixing abilities\" affect the tradeoff between privacy\npreservation and Byzantine-robustness. The theoretical results provide\nguidelines for achieving a favorable tradeoff with proper design of robust\naggregation rules. Numerical experiments are conducted and corroborate our\ntheoretical findings.\n","authors":["Haoxiang Ye","Heng Zhu","Qing Ling"],"pdf_url":"https://arxiv.org/pdf/2308.14606v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12791v1","updated":"2023-12-20T06:25:02Z","published":"2023-12-20T06:25:02Z","title":"Model-Based Control with Sparse Neural Dynamics","summary":" Learning predictive models from observations using deep neural networks\n(DNNs) is a promising new approach to many real-world planning and control\nproblems. However, common DNNs are too unstructured for effective planning, and\ncurrent control methods typically rely on extensive sampling or local gradient\ndescent. In this paper, we propose a new framework for integrated model\nlearning and predictive control that is amenable to efficient optimization\nalgorithms. Specifically, we start with a ReLU neural model of the system\ndynamics and, with minimal losses in prediction accuracy, we gradually sparsify\nit by removing redundant neurons. This discrete sparsification process is\napproximated as a continuous problem, enabling an end-to-end optimization of\nboth the model architecture and the weight parameters. The sparsified model is\nsubsequently used by a mixed-integer predictive controller, which represents\nthe neuron activations as binary variables and employs efficient\nbranch-and-bound algorithms. Our framework is applicable to a wide variety of\nDNNs, from simple multilayer perceptrons to complex graph neural dynamics. It\ncan efficiently handle tasks involving complicated contact dynamics, such as\nobject pushing, compositional object sorting, and manipulation of deformable\nobjects. Numerical and hardware experiments show that, despite the aggressive\nsparsification, our framework can deliver better closed-loop performance than\nexisting state-of-the-art methods.\n","authors":["Ziang Liu","Genggeng Zhou","Jeff He","Tobia Marcucci","Li Fei-Fei","Jiajun Wu","Yunzhu Li"],"pdf_url":"https://arxiv.org/pdf/2312.12791v1.pdf","comment":"Accepted at NeurIPS 2023. For tutorial code and additional\n visualizations, see https://robopil.github.io/Sparse-Dynamics/"},{"id":"http://arxiv.org/abs/2312.12789v1","updated":"2023-12-20T06:22:21Z","published":"2023-12-20T06:22:21Z","title":"SLP-Net:An efficient lightweight network for segmentation of skin\n lesions","summary":" Prompt treatment for melanoma is crucial. To assist physicians in identifying\nlesion areas precisely in a quick manner, we propose a novel skin lesion\nsegmentation technique namely SLP-Net, an ultra-lightweight segmentation\nnetwork based on the spiking neural P(SNP) systems type mechanism. Most\nexisting convolutional neural networks achieve high segmentation accuracy while\nneglecting the high hardware cost. SLP-Net, on the contrary, has a very small\nnumber of parameters and a high computation speed. We design a lightweight\nmulti-scale feature extractor without the usual encoder-decoder structure.\nRather than a decoder, a feature adaptation module is designed to replace it\nand implement multi-scale information decoding. Experiments at the ISIC2018\nchallenge demonstrate that the proposed model has the highest Acc and DSC among\nthe state-of-the-art methods, while experiments on the PH2 dataset also\ndemonstrate a favorable generalization ability. Finally, we compare the\ncomputational complexity as well as the computational speed of the models in\nexperiments, where SLP-Net has the highest overall superiority\n","authors":["Bo Yang","Hong Peng","Chenggang Guo","Xiaohui Luo","Jun Wang","Xianzhong Long"],"pdf_url":"https://arxiv.org/pdf/2312.12789v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.04273v2","updated":"2023-12-20T06:15:11Z","published":"2023-12-07T12:53:05Z","title":"Invariant Random Forest: Tree-Based Model Solution for OOD\n Generalization","summary":" Out-Of-Distribution (OOD) generalization is an essential topic in machine\nlearning. However, recent research is only focusing on the corresponding\nmethods for neural networks. This paper introduces a novel and effective\nsolution for OOD generalization of decision tree models, named Invariant\nDecision Tree (IDT). IDT enforces a penalty term with regard to the\nunstable/varying behavior of a split across different environments during the\ngrowth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is\nconstructed. Our proposed method is motivated by a theoretical result under\nmild conditions, and validated by numerical tests with both synthetic and real\ndatasets. The superior performance compared to non-OOD tree models implies that\nconsidering OOD generalization for tree models is absolutely necessary and\nshould be given more attention.\n","authors":["Yufan Liao","Qi Wu","Xing Yan"],"pdf_url":"https://arxiv.org/pdf/2312.04273v2.pdf","comment":"AAAI Conference on Artificial Intelligence, 2024"},{"id":"http://arxiv.org/abs/2312.12784v1","updated":"2023-12-20T06:10:27Z","published":"2023-12-20T06:10:27Z","title":"Fast Cell Library Characterization for Design Technology Co-Optimization\n Based on Graph Neural Networks","summary":" Design technology co-optimization (DTCO) plays a critical role in achieving\noptimal power, performance, and area (PPA) for advanced semiconductor process\ndevelopment. Cell library characterization is essential in DTCO flow, but\ntraditional methods are time-consuming and costly. To overcome these\nchallenges, we propose a graph neural network (GNN)-based machine learning\nmodel for rapid and accurate cell library characterization. Our model\nincorporates cell structures and demonstrates high prediction accuracy across\nvarious process-voltage-temperature (PVT) corners and technology parameters.\nValidation with 512 unseen technology corners and over one million test data\npoints shows accurate predictions of delay, power, and input pin capacitance\nfor 33 types of cells, with a mean absolute percentage error (MAPE) $\\le$ 0.95%\nand a speed-up of 100X compared with SPICE simulations. Additionally, we\ninvestigate system-level metrics such as worst negative slack (WNS), leakage\npower, and dynamic power using predictions obtained from the GNN-based model on\nunseen corners. Our model achieves precise predictions, with absolute error\n$\\le$3.0 ps for WNS, percentage errors $\\le$0.60% for leakage power, and\n$\\le$0.99% for dynamic power, when compared to golden reference. With the\ndeveloped model, we further proposed a fine-grained drive strength\ninterpolation methodology to enhance PPA for small-to-medium-scale designs,\nresulting in an approximate 1-3% improvement.\n","authors":["Tianliang Ma","Zhihui Deng","Xuguang Sun","Leilai Shao"],"pdf_url":"https://arxiv.org/pdf/2312.12784v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.05614v2","updated":"2023-12-20T05:59:10Z","published":"2023-12-09T17:01:18Z","title":"Transformer as Linear Expansion of Learngene","summary":" We propose expanding the shared Transformer module to produce and initialize\nTransformers of varying depths, enabling adaptation to diverse resource\nconstraints. Drawing an analogy to genetic expansibility, we term such module\nas learngene. To identify the expansion mechanism, we delve into the\nrelationship between the layer's position and its corresponding weight value,\nand find that linear function appropriately approximates this relationship.\nBuilding on this insight, we present Transformer as Linear Expansion of\nlearnGene (TLEG), a novel approach for flexibly producing and initializing\nTransformers of diverse depths. Specifically, to learn learngene, we firstly\nconstruct an auxiliary Transformer linearly expanded from learngene, after\nwhich we train it through employing soft distillation. Subsequently, we can\nproduce and initialize Transformers of varying depths via linearly expanding\nthe well-trained learngene, thereby supporting diverse downstream scenarios.\nExtensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable\nor better performance in contrast to many individual models trained from\nscratch, while reducing around 2x training cost. When transferring to several\ndownstream classification datasets, TLEG surpasses existing initialization\nmethods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100).\nUnder the situation where we need to produce models of varying depths adapting\nfor different resource constraints, TLEG achieves comparable results while\nreducing around 19x parameters stored to initialize these models and around 5x\npre-training costs, in contrast to the pre-training and fine-tuning approach.\nWhen transferring a fixed set of parameters to initialize different models,\nTLEG presents better flexibility and competitive performance while reducing\naround 2.9x parameters stored to initialize, compared to the pre-training\napproach.\n","authors":["Shiyu Xia","Miaosen Zhang","Xu Yang","Ruiming Chen","Haokun Chen","Xin Geng"],"pdf_url":"https://arxiv.org/pdf/2312.05614v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12781v1","updated":"2023-12-20T05:55:05Z","published":"2023-12-20T05:55:05Z","title":"DynaLay: An Introspective Approach to Dynamic Layer Selection for Deep\n Networks","summary":" Deep learning models have become increasingly computationally intensive,\nrequiring extensive computational resources and time for both training and\ninference. A significant contributing factor to this challenge is the uniform\ncomputational effort expended on each input example, regardless of its\ncomplexity. We introduce \\textbf{DynaLay}, an alternative architecture that\nfeatures a decision-making agent to adaptively select the most suitable layers\nfor processing each input, thereby endowing the model with a remarkable level\nof introspection. DynaLay reevaluates more complex inputs during inference,\nadjusting the computational effort to optimize both performance and efficiency.\nThe core of the system is a main model equipped with Fixed-Point Iterative\n(FPI) layers, capable of accurately approximating complex functions, paired\nwith an agent that chooses these layers or a direct action based on the\nintrospection of the models inner state. The model invests more time in\nprocessing harder examples, while minimal computation is required for easier\nones. This introspective approach is a step toward developing deep learning\nmodels that \"think\" and \"ponder\", rather than \"ballistically'' produce answers.\nOur experiments demonstrate that DynaLay achieves accuracy comparable to\nconventional deep models while significantly reducing computational demands.\n","authors":["Mrinal Mathur","Sergey Plis"],"pdf_url":"https://arxiv.org/pdf/2312.12781v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12773v1","updated":"2023-12-20T05:17:06Z","published":"2023-12-20T05:17:06Z","title":"Segmenting Messy Text: Detecting Boundaries in Text Derived from\n Historical Newspaper Images","summary":" Text segmentation, the task of dividing a document into sections, is often a\nprerequisite for performing additional natural language processing tasks.\nExisting text segmentation methods have typically been developed and tested\nusing clean, narrative-style text with segments containing distinct topics.\nHere we consider a challenging text segmentation task: dividing newspaper\nmarriage announcement lists into units of one announcement each. In many cases\nthe information is not structured into sentences, and adjacent segments are not\ntopically distinct from each other. In addition, the text of the announcements,\nwhich is derived from images of historical newspapers via optical character\nrecognition, contains many typographical errors. As a result, these\nannouncements are not amenable to segmentation with existing techniques. We\npresent a novel deep learning-based model for segmenting such text and show\nthat it significantly outperforms an existing state-of-the-art method on our\ntask.\n","authors":["Carol Anderson","Phil Crone"],"pdf_url":"https://arxiv.org/pdf/2312.12773v1.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2309.15312v3","updated":"2023-12-20T04:51:44Z","published":"2023-09-26T23:43:37Z","title":"MAPTree: Beating \"Optimal\" Decision Trees with Bayesian Decision Trees","summary":" Decision trees remain one of the most popular machine learning models today,\nlargely due to their out-of-the-box performance and interpretability. In this\nwork, we present a Bayesian approach to decision tree induction via maximum a\nposteriori inference of a posterior distribution over trees. We first\ndemonstrate a connection between maximum a posteriori inference of decision\ntrees and AND/OR search. Using this connection, we propose an AND/OR search\nalgorithm, dubbed MAPTree, which is able to recover the maximum a posteriori\ntree. Lastly, we demonstrate the empirical performance of the maximum a\nposteriori tree both on synthetic data and in real world settings. On 16 real\nworld datasets, MAPTree either outperforms baselines or demonstrates comparable\nperformance but with much smaller trees. On a synthetic dataset, MAPTree also\ndemonstrates greater robustness to noise and better generalization than\nexisting approaches. Finally, MAPTree recovers the maxiumum a posteriori tree\nfaster than existing sampling approaches and, in contrast with those\nalgorithms, is able to provide a certificate of optimality. The code for our\nexperiments is available at https://github.com/ThrunGroup/maptree.\n","authors":["Colin Sullivan","Mo Tiwari","Sebastian Thrun"],"pdf_url":"https://arxiv.org/pdf/2309.15312v3.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2306.12045v6","updated":"2023-12-20T04:22:24Z","published":"2023-06-21T06:30:18Z","title":"Temporal Conditioning Spiking Latent Variable Models of the Neural\n Response to Natural Visual Scenes","summary":" Developing computational models of neural response is crucial for\nunderstanding sensory processing and neural computations. Current\nstate-of-the-art neural network methods use temporal filters to handle temporal\ndependencies, resulting in an unrealistic and inflexible processing paradigm.\nMeanwhile, these methods target trial-averaged firing rates and fail to capture\nimportant features in spike trains. This work presents the temporal\nconditioning spiking latent variable models (TeCoS-LVM) to simulate the neural\nresponse to natural visual stimuli. We use spiking neurons to produce spike\noutputs that directly match the recorded trains. This approach helps to avoid\nlosing information embedded in the original spike trains. We exclude the\ntemporal dimension from the model parameter space and introduce a temporal\nconditioning operation to allow the model to adaptively explore and exploit\ntemporal dependencies in stimuli sequences in a {\\it natural paradigm}. We show\nthat TeCoS-LVM models can produce more realistic spike activities and\naccurately fit spike statistics than powerful alternatives. Additionally,\nlearned TeCoS-LVM models can generalize well to longer time scales. Overall,\nwhile remaining computationally tractable, our model effectively captures key\nfeatures of neural coding systems. It thus provides a useful tool for building\naccurate predictive computational accounts for various sensory perception\ncircuits.\n","authors":["Gehua Ma","Runhao Jiang","Rui Yan","Huajin Tang"],"pdf_url":"https://arxiv.org/pdf/2306.12045v6.pdf","comment":"Accepted at NeurIPS 2023\n (https://openreview.net/forum?id=V4YeOvsQfu). 22 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2110.02473v4","updated":"2023-12-20T03:59:21Z","published":"2021-10-06T03:10:28Z","title":"The Power of Contrast for Feature Learning: A Theoretical Analysis","summary":" Contrastive learning has achieved state-of-the-art performance in various\nself-supervised learning tasks and even outperforms its supervised counterpart.\nDespite its empirical success, theoretical understanding of the superiority of\ncontrastive learning is still limited. In this paper, under linear\nrepresentation settings, (i) we provably show that contrastive learning\noutperforms the standard autoencoders and generative adversarial networks, two\nclassical generative unsupervised learning methods, for both feature recovery\nand in-domain downstream tasks; (ii) we also illustrate the impact of labeled\ndata in supervised contrastive learning. This provides theoretical support for\nrecent findings that contrastive learning with labels improves the performance\nof learned representations in the in-domain downstream task, but it can harm\nthe performance in transfer learning. We verify our theory with numerical\nexperiments.\n","authors":["Wenlong Ji","Zhun Deng","Ryumei Nakada","James Zou","Linjun Zhang"],"pdf_url":"https://arxiv.org/pdf/2110.02473v4.pdf","comment":"78 pages, accepted by JMLR"},{"id":"http://arxiv.org/abs/2312.12747v1","updated":"2023-12-20T03:44:18Z","published":"2023-12-20T03:44:18Z","title":"ALMANACS: A Simulatability Benchmark for Language Model Explainability","summary":" How do we measure the efficacy of language model explainability methods?\nWhile many explainability methods have been developed, they are typically\nevaluated on bespoke tasks, preventing an apples-to-apples comparison. To help\nfill this gap, we present ALMANACS, a language model explainability benchmark.\nALMANACS scores explainability methods on simulatability, i.e., how well the\nexplanations improve behavior prediction on new inputs. The ALMANACS scenarios\nspan twelve safety-relevant topics such as ethical reasoning and advanced AI\nbehaviors; they have idiosyncratic premises to invoke model-specific behavior;\nand they have a train-test distributional shift to encourage faithful\nexplanations. By using another language model to predict behavior based on the\nexplanations, ALMANACS is a fully automated benchmark. We use ALMANACS to\nevaluate counterfactuals, rationalizations, attention, and Integrated Gradients\nexplanations. Our results are sobering: when averaged across all topics, no\nexplanation method outperforms the explanation-free control. We conclude that\ndespite modest successes in prior work, developing an explanation method that\naids simulatability in ALMANACS remains an open challenge.\n","authors":["Edmund Mills","Shiye Su","Stuart Russell","Scott Emmons"],"pdf_url":"https://arxiv.org/pdf/2312.12747v1.pdf","comment":"Code is available at\n https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS"},{"id":"http://arxiv.org/abs/2312.12744v1","updated":"2023-12-20T03:38:24Z","published":"2023-12-20T03:38:24Z","title":"3D-CLMI: A Motor Imagery EEG Classification Model via Fusion of 3D-CNN\n and LSTM with Attention","summary":" Due to the limitations in the accuracy and robustness of current\nelectroencephalogram (EEG) classification algorithms, applying motor imagery\n(MI) for practical Brain-Computer Interface (BCI) applications remains\nchallenging. This paper proposed a model that combined a three-dimensional\nconvolutional neural network (CNN) with a long short-term memory (LSTM) network\nwith attention to classify MI-EEG signals. This model combined MI-EEG signals\nfrom different channels into three-dimensional features and extracted spatial\nfeatures through convolution operations with multiple three-dimensional\nconvolutional kernels of different scales. At the same time, to ensure the\nintegrity of the extracted MI-EEG signal temporal features, the LSTM network\nwas directly trained on the preprocessed raw signal. Finally, the features\nobtained from these two networks were combined and used for classification.\nExperimental results showed that this model achieved a classification accuracy\nof 92.7% and an F1-score of 0.91 on the public dataset BCI Competition IV\ndataset 2a, which were both higher than the state-of-the-art models in the\nfield of MI tasks. Additionally, 12 participants were invited to complete a\nfour-class MI task in our lab, and experiments on the collected dataset showed\nthat the 3D-CLMI model also maintained the highest classification accuracy and\nF1-score. The model greatly improved the classification accuracy of users'\nmotor imagery intentions, giving brain-computer interfaces better application\nprospects in emerging fields such as autonomous vehicles and medical\nrehabilitation.\n","authors":["Shiwei Cheng","Yuejiang Hao"],"pdf_url":"https://arxiv.org/pdf/2312.12744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12430v2","updated":"2023-12-20T03:33:54Z","published":"2023-12-19T18:56:52Z","title":"Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP","summary":" We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel\ntitle reranking technique to achieve efficient title reranking 20x-40x faster\nthan vanilla passage reranker. However, one of the challenges with the training\nof Efficient Title Reranker is the instability. Analyzing the issue, we found\nsome very difficult ground truths might act as noisy labels causing accuracy to\ndrop as well as some extreme values in model probability output causing nan. To\naddress these issues, we introduce the Sigmoid Trick, a novel technique that\nreduces the gradient update of both cases resulting in better retrieval\nefficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we\nachieved four state-of-the-art positions on the kilt knowledge benchmark.\n","authors":["Ziyi Chen","Heyi Tao","Daqian Zuo","Jize Jiang","Jun Yang","Yuxiang Wei"],"pdf_url":"https://arxiv.org/pdf/2312.12430v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12741v1","updated":"2023-12-20T03:28:49Z","published":"2023-12-20T03:28:49Z","title":"Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed\n Gaussian Bandits with Unknown Variances","summary":" We address the problem of best arm identification (BAI) with a fixed budget\nfor two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the\nbest arm, an arm with the highest expected reward, through an adaptive\nexperiment. Kaufmann et al. (2016) develops a lower bound for the probability\nof misidentifying the best arm. They also propose a strategy, assuming that the\nvariances of rewards are known, and show that it is asymptotically optimal in\nthe sense that its probability of misidentification matches the lower bound as\nthe budget approaches infinity. However, an asymptotically optimal strategy is\nunknown when the variances are unknown. For this open issue, we propose a\nstrategy that estimates variances during an adaptive experiment and draws arms\nwith a ratio of the estimated standard deviations. We refer to this strategy as\nthe Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)\nstrategy. We then demonstrate that this strategy is asymptotically optimal by\nshowing that its probability of misidentification matches the lower bound when\nthe budget approaches infinity, and the gap between the expected rewards of two\narms approaches zero (small-gap regime). Our results suggest that under the\nworst-case scenario characterized by the small-gap regime, our strategy, which\nemploys estimated variance, is asymptotically optimal even when the variances\nare unknown.\n","authors":["Masahiro Kato"],"pdf_url":"https://arxiv.org/pdf/2312.12741v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12737v1","updated":"2023-12-20T03:18:56Z","published":"2023-12-20T03:18:56Z","title":"FSscore: A Machine Learning-based Synthetic Feasibility Score Leveraging\n Human Expertise","summary":" Determining whether a molecule can be synthesized is crucial for many aspects\nof chemistry and drug discovery, allowing prioritization of experimental work\nand ranking molecules in de novo design tasks. Existing scoring approaches to\nassess synthetic feasibility struggle to extrapolate to out-of-distribution\nchemical spaces or fail to discriminate based on minor differences such as\nchirality that might be obvious to trained chemists. This work aims to address\nthese limitations by introducing the Focused Synthesizability score (FSscore),\nwhich learns to rank structures based on binary preferences using a graph\nattention network. First, a baseline trained on an extensive set of\nreactant-product pairs is established that subsequently is fine-tuned with\nexpert human feedback on a chemical space of interest. Fine-tuning on focused\ndatasets improves performance on these chemical scopes over the pre-trained\nmodel exhibiting moderate performance and generalizability. This enables\ndistinguishing hard- from easy-to-synthesize molecules and improving the\nsynthetic accessibility of generative model outputs. On very complex scopes\nwith limited labels achieving satisfactory gains remains challenging. The\nFSscore showcases how human expert feedback can be utilized to optimize the\nassessment of synthetic feasibility for a variety of applications.\n","authors":["Rebecca M. Neeser","Bruno Correia","Philippe Schwaller"],"pdf_url":"https://arxiv.org/pdf/2312.12737v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12736v1","updated":"2023-12-20T03:18:50Z","published":"2023-12-20T03:18:50Z","title":"Learning and Forgetting Unsafe Examples in Large Language Models","summary":" As the number of large language models (LLMs) released to the public grows,\nthere is a pressing need to understand the safety implications associated with\nthese models learning from third-party custom finetuning data. We explore the\nbehavior of LLMs finetuned on noisy custom data containing unsafe content,\nrepresented by datasets that contain biases, toxicity, and harmfulness, finding\nthat while aligned LLMs can readily learn this unsafe content, they also tend\nto forget it more significantly than other examples when subsequently finetuned\non safer content. Drawing inspiration from the discrepancies in forgetting, we\nintroduce the \"ForgetFilter\" algorithm, which filters unsafe data based on how\nstrong the model's forgetting signal is for that data. We demonstrate that the\nForgetFilter algorithm ensures safety in customized finetuning without\ncompromising downstream task performance, unlike sequential safety finetuning.\nForgetFilter outperforms alternative strategies like replay and moral\nself-correction in curbing LLMs' ability to assimilate unsafe content during\ncustom finetuning, e.g. 75% lower than not applying any safety measures and 62%\nlower than using self-correction in toxicity score.\n","authors":["Jiachen Zhao","Zhun Deng","David Madras","James Zou","Mengye Ren"],"pdf_url":"https://arxiv.org/pdf/2312.12736v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08742v4","updated":"2023-12-20T03:16:09Z","published":"2023-08-17T02:33:43Z","title":"PMET: Precise Model Editing in a Transformer","summary":" Model editing techniques modify a minor proportion of knowledge in Large\nLanguage Models (LLMs) at a relatively low cost, which have demonstrated\nnotable success. Existing methods assume Transformer Layer (TL) hidden states\nare values of key-value memories of the Feed-Forward Network (FFN). They\nusually optimize the TL hidden states to memorize target knowledge and use it\nto update the weights of the FFN in LLMs. However, the information flow of TL\nhidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN,\nand residual connections. Existing methods neglect the fact that the TL hidden\nstates contains information not specifically required for FFN. Consequently,\nthe performance of model editing decreases. To achieve more precise model\nediting, we analyze hidden states of MHSA and FFN, finding that MHSA encodes\ncertain general knowledge extraction patterns. This implies that MHSA weights\ndo not require updating when new knowledge is introduced. Based on above\nfindings, we introduce PMET, which simultaneously optimizes Transformer\nComponent (TC, namely MHSA and FFN) hidden states, while only using the\noptimized TC hidden states of FFN to precisely update FFN weights. Our\nexperiments demonstrate that PMET exhibits state-of-the-art performance on both\nthe COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the\neffectiveness of our enhancements, further reinforcing the finding that the\nMHSA encodes certain general knowledge extraction patterns and indicating its\nstorage of a small amount of factual knowledge. Our code is available at\nhttps://github.com/xpq-tech/PMET.\n","authors":["Xiaopeng Li","Shasha Li","Shezheng Song","Jing Yang","Jun Ma","Jie Yu"],"pdf_url":"https://arxiv.org/pdf/2308.08742v4.pdf","comment":"Accepted in AAAI24"},{"id":"http://arxiv.org/abs/2306.10982v2","updated":"2023-12-20T03:03:25Z","published":"2023-06-19T14:44:34Z","title":"Differentially Private Over-the-Air Federated Learning Over MIMO Fading\n Channels","summary":" Federated learning (FL) enables edge devices to collaboratively train machine\nlearning models, with model communication replacing direct data uploading.\nWhile over-the-air model aggregation improves communication efficiency,\nuploading models to an edge server over wireless networks can pose privacy\nrisks. Differential privacy (DP) is a widely used quantitative technique to\nmeasure statistical data privacy in FL. Previous research has focused on\nover-the-air FL with a single-antenna server, leveraging communication noise to\nenhance user-level DP. This approach achieves the so-called \"free DP\" by\ncontrolling transmit power rather than introducing additional DP-preserving\nmechanisms at devices, such as adding artificial noise. In this paper, we study\ndifferentially private over-the-air FL over a multiple-input multiple-output\n(MIMO) fading channel. We show that FL model communication with a\nmultiple-antenna server amplifies privacy leakage as the multiple-antenna\nserver employs separate receive combining for model aggregation and information\ninference. Consequently, relying solely on communication noise, as done in the\nmultiple-input single-output system, cannot meet high privacy requirements, and\na device-side privacy-preserving mechanism is necessary for optimal DP design.\nWe analyze the learning convergence and privacy loss of the studied FL system\nand propose a transceiver design algorithm based on alternating optimization.\nNumerical results demonstrate that the proposed method achieves a better\nprivacy-learning trade-off compared to prior work.\n","authors":["Hang Liu","Jia Yan","Ying-Jun Angela Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.10982v2.pdf","comment":"This work has been accepted by the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2312.12731v1","updated":"2023-12-20T03:03:06Z","published":"2023-12-20T03:03:06Z","title":"Robustly Improving Bandit Algorithms with Confounded and Selection\n Biased Offline Data: A Causal Approach","summary":" This paper studies bandit problems where an agent has access to offline data\nthat might be utilized to potentially improve the estimation of each arm's\nreward distribution. A major obstacle in this setting is the existence of\ncompound biases from the observational data. Ignoring these biases and blindly\nfitting a model with the biased data could even negatively affect the online\nlearning phase. In this work, we formulate this problem from a causal\nperspective. First, we categorize the biases into confounding bias and\nselection bias based on the causal structure they imply. Next, we extract the\ncausal bound for each arm that is robust towards compound biases from biased\nobservational data. The derived bounds contain the ground truth mean reward and\ncan effectively guide the bandit agent to learn a nearly-optimal decision\npolicy. We also conduct regret analysis in both contextual and non-contextual\nbandit settings and show that prior causal bounds could help consistently\nreduce the asymptotic regret.\n","authors":["Wen Huang","Xintao Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12731v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12728v1","updated":"2023-12-20T02:55:15Z","published":"2023-12-20T02:55:15Z","title":"Lookahead: An Inference Acceleration Framework for Large Language Model\n with Lossless Generation Accuracy","summary":" As Large Language Models (LLMs) have made significant advancements across\nvarious tasks, such as question answering, translation, text summarization, and\ndialogue systems, the need for accuracy in information becomes crucial,\nespecially for serious financial products serving billions of users like\nAlipay. To address this, Alipay has developed a Retrieval-Augmented Generation\n(RAG) system that grounds LLMs on the most accurate and up-to-date information.\nHowever, for a real-world product serving millions of users, the inference\nspeed of LLMs becomes a critical factor compared to a mere experimental model.\n Hence, this paper presents a generic framework for accelerating the inference\nprocess, resulting in a substantial increase in speed and cost reduction for\nour RAG system, with lossless generation accuracy. In the traditional inference\nprocess, each token is generated sequentially by the LLM, leading to a time\nconsumption proportional to the number of generated tokens. To enhance this\nprocess, our framework, named \\textit{lookahead}, introduces a\n\\textit{multi-branch} strategy. Instead of generating a single token at a time,\nwe propose a \\textit{Trie-based Retrieval} (TR) process that enables the\ngeneration of multiple branches simultaneously, each of which is a sequence of\ntokens. Subsequently, for each branch, a \\textit{Verification and Accept} (VA)\nprocess is performed to identify the longest correct sub-sequence as the final\noutput. Our strategy offers two distinct advantages: (1) it guarantees absolute\ncorrectness of the output, avoiding any approximation algorithms, and (2) the\nworst-case performance of our approach is equivalent to the conventional\nprocess. We conduct extensive experiments to demonstrate the significant\nimprovements achieved by applying our inference acceleration framework.\n","authors":["Yao Zhao","Zhitian Xie","Chenyi Zhuang","Jinjie Gu"],"pdf_url":"https://arxiv.org/pdf/2312.12728v1.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2311.16135v2","updated":"2023-12-20T02:47:02Z","published":"2023-11-03T00:12:24Z","title":"Use of Deep Neural Networks for Uncertain Stress Functions with\n Extensions to Impact Mechanics","summary":" Stress-strain curves, or more generally, stress functions, are an extremely\nimportant characterization of a material's mechanical properties. However,\nstress functions are often difficult to derive and are narrowly tailored to a\nspecific material. Further, large deformations, high strain-rates, temperature\nsensitivity, and effect of material parameters compound modeling challenges. We\npropose a generalized deep neural network approach to model stress as a state\nfunction with quantile regression to capture uncertainty. We extend these\nmodels to uniaxial impact mechanics using stochastic differential equations to\ndemonstrate a use case and provide a framework for implementing this\nuncertainty-aware stress function. We provide experiments benchmarking our\napproach against leading constitutive, machine learning, and transfer learning\napproaches to stress and impact mechanics modeling on publicly available and\nnewly presented data sets. We also provide a framework to optimize material\nparameters given multiple competing impact scenarios.\n","authors":["Garrett Blum","Ryan Doris","Diego Klabjan","Horacio Espinosa","Ron Szalkowski"],"pdf_url":"https://arxiv.org/pdf/2311.16135v2.pdf","comment":"Index Terms: Stress, Uncertainty, Impact Mechanics, Deep Learning,\n Neural Network. 10 pages, 9 figures, 6 tables"},{"id":"http://arxiv.org/abs/2312.12724v1","updated":"2023-12-20T02:40:28Z","published":"2023-12-20T02:40:28Z","title":"Progressive Poisoned Data Isolation for Training-time Backdoor Defense","summary":" Deep Neural Networks (DNN) are susceptible to backdoor attacks where\nmalicious attackers manipulate the model's predictions via data poisoning. It\nis hence imperative to develop a strategy for training a clean model using a\npotentially poisoned dataset. Previous training-time defense mechanisms\ntypically employ an one-time isolation process, often leading to suboptimal\nisolation outcomes. In this study, we present a novel and efficacious defense\nmethod, termed Progressive Isolation of Poisoned Data (PIPD), that\nprogressively isolates poisoned data to enhance the isolation accuracy and\nmitigate the risk of benign samples being misclassified as poisoned ones. Once\nthe poisoned portion of the dataset has been identified, we introduce a\nselective training process to train a clean model. Through the implementation\nof these techniques, we ensure that the trained model manifests a significantly\ndiminished attack success rate against the poisoned data. Extensive experiments\non multiple benchmark datasets and DNN models, assessed against nine\nstate-of-the-art backdoor attacks, demonstrate the superior performance of our\nPIPD method for backdoor defense. For instance, our PIPD achieves an average\nTrue Positive Rate (TPR) of 99.95% and an average False Positive Rate (FPR) of\n0.06% for diverse attacks over CIFAR-10 dataset, markedly surpassing the\nperformance of state-of-the-art methods.\n","authors":["Yiming Chen","Haiwei Wu","Jiantao Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.12724v1.pdf","comment":"Accepted to AAAI2024"},{"id":"http://arxiv.org/abs/2308.08198v2","updated":"2023-12-20T02:31:25Z","published":"2023-08-16T07:58:02Z","title":"DeSCo: Towards Generalizable and Scalable Deep Subgraph Counting","summary":" We introduce DeSCo, a scalable neural deep subgraph counting pipeline,\ndesigned to accurately predict both the count and occurrence position of\nqueries on target graphs post single training. Firstly, DeSCo uses a novel\ncanonical partition and divides the large target graph into small neighborhood\ngraphs, greatly reducing the count variation while guaranteeing no missing or\ndouble-counting. Secondly, neighborhood counting uses an expressive\nsubgraph-based heterogeneous graph neural network to accurately count in each\nneighborhood. Finally, gossip propagation propagates neighborhood counts with\nlearnable gates to harness the inductive biases of motif counts. DeSCo is\nevaluated on eight real-world datasets from various domains. It outperforms\nstate-of-the-art neural methods with 137x improvement in the mean squared error\nof count prediction, while maintaining the polynomial runtime complexity. Our\nopen source project is at https://github.com/fuvty/DeSCo.\n","authors":["Tianyu Fu","Chiyue Wei","Yu Wang","Rex Ying"],"pdf_url":"https://arxiv.org/pdf/2308.08198v2.pdf","comment":"8 pages main text, 2 pages references, 11 pages appendix; open source\n at https://github.com/fuvty/DeSCo"},{"id":"http://arxiv.org/abs/2312.12717v1","updated":"2023-12-20T02:22:54Z","published":"2023-12-20T02:22:54Z","title":"DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS\n Channel and DNA Storage","summary":" Recently, DNA storage has emerged as a promising data storage solution,\noffering significant advantages in storage density, maintenance cost\nefficiency, and parallel replication capability. Mathematically, the DNA\nstorage pipeline can be viewed as an insertion, deletion, and substitution\n(IDS) channel. Because of the mathematical terra incognita of the Levenshtein\ndistance, designing an IDS-correcting code is still a challenge. In this paper,\nwe propose an innovative approach that utilizes deep Levenshtein distance\nembedding to bypass these mathematical challenges. By representing the\nLevenshtein distance between two sequences as a conventional distance between\ntheir corresponding embedding vectors, the inherent structural property of\nLevenshtein distance is revealed in the friendly embedding space. Leveraging\nthis embedding space, we introduce the DoDo-Code, an IDS-correcting code that\nincorporates deep embedding of Levenshtein distance, deep embedding-based\ncodeword search, and deep embedding-based segment correcting. To address the\nrequirements of DNA storage, we also present a preliminary algorithm for long\nsequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting\ncode designed using plausible deep learning methodologies, potentially paving\nthe way for a new direction in error-correcting code research. It is also the\nfirst IDS code that exhibits characteristics of being `optimal' in terms of\nredundancy, significantly outperforming the mainstream IDS-correcting codes of\nthe Varshamov-Tenengolts code family in code rate.\n","authors":["Alan J. X. Guo","Sihan Sun","Xiang Wei","Mengyi Wei","Xin Chen"],"pdf_url":"https://arxiv.org/pdf/2312.12717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12716v1","updated":"2023-12-20T02:22:49Z","published":"2023-12-20T02:22:49Z","title":"BloomVQA: Assessing Hierarchical Multi-modal Comprehension","summary":" We propose a novel VQA dataset, based on picture stories designed for\neducating young children, that aims to facilitate comprehensive evaluation and\ncharacterization of vision-language models on comprehension tasks. Unlike\ncurrent VQA datasets that often focus on fact-based memorization and simple\nreasoning tasks without principled scientific grounding, we collect data\ncontaining tasks reflecting different levels of comprehension and underlying\ncognitive processes, as laid out in Bloom's Taxonomy, a classic framework\nwidely adopted in education research. The proposed BloomVQA dataset can be\nmapped to a hierarchical graph-based representation of visual stories, enabling\nautomatic data augmentation and novel measures characterizing model consistency\nacross the underlying taxonomy. We demonstrate graded evaluation and\nreliability analysis based on our proposed consistency metrics on\nstate-of-the-art vision-language models. Our results suggest that, while\ncurrent models achieve the most gain on low-level comprehension tasks, they\ngenerally fall short on high-level tasks requiring more advanced comprehension\nand cognitive skills, as 38.0% drop in VQA accuracy is observed comparing\nlowest and highest level tasks. Furthermore, current models show consistency\npatterns misaligned with human comprehension in various scenarios, suggesting\nemergent structures of model behaviors.\n","authors":["Yunye Gong","Robik Shrestha","Jared Claypoole","Michael Cogswell","Arijit Ray","Christopher Kanan","Ajay Divakaran"],"pdf_url":"https://arxiv.org/pdf/2312.12716v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12715v1","updated":"2023-12-20T02:21:26Z","published":"2023-12-20T02:21:26Z","title":"Learning Performance Maximizing Ensembles with Explainability Guarantees","summary":" In this paper we propose a method for the optimal allocation of observations\nbetween an intrinsically explainable glass box model and a black box model. An\noptimal allocation being defined as one which, for any given explainability\nlevel (i.e. the proportion of observations for which the explainable model is\nthe prediction function), maximizes the performance of the ensemble on the\nunderlying task, and maximizes performance of the explainable model on the\nobservations allocated to it, subject to the maximal ensemble performance\ncondition. The proposed method is shown to produce such explainability optimal\nallocations on a benchmark suite of tabular datasets across a variety of\nexplainable and black box model types. These learned allocations are found to\nconsistently maintain ensemble performance at very high explainability levels\n(explaining $74\\%$ of observations on average), and in some cases even\noutperforming both the component explainable and black box models while\nimproving explainability.\n","authors":["Vincent Pisztora","Jia Li"],"pdf_url":"https://arxiv.org/pdf/2312.12715v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08511v3","updated":"2023-12-20T02:21:20Z","published":"2023-08-16T17:07:40Z","title":"Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse\n Problems","summary":" Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial\ntechnologies in the field of medical imaging. Score-based models have proven to\nbe effective in addressing different inverse problems encountered in CT and\nMRI, such as sparse-view CT and fast MRI reconstruction. However, these models\nface challenges in achieving accurate three dimensional (3D) volumetric\nreconstruction. The existing score-based models primarily focus on\nreconstructing two dimensional (2D) data distribution, leading to\ninconsistencies between adjacent slices in the reconstructed 3D volumetric\nimages. To overcome this limitation, we propose a novel two-and-a-half order\nscore-based model (TOSM). During the training phase, our TOSM learns data\ndistributions in 2D space, which reduces the complexity of training compared to\ndirectly working on 3D volumes. However, in the reconstruction phase, the TOSM\nupdates the data distribution in 3D space, utilizing complementary scores along\nthree directions (sagittal, coronal, and transaxial) to achieve a more precise\nreconstruction. The development of TOSM is built on robust theoretical\nprinciples, ensuring its reliability and efficacy. Through extensive\nexperimentation on large-scale sparse-view CT and fast MRI datasets, our method\ndemonstrates remarkable advancements and attains state-of-the-art results in\nsolving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively\naddresses the inter-slice inconsistency issue, resulting in high-quality 3D\nvolumetric reconstruction.\n","authors":["Zirong Li","Yanyang Wang","Jianjia Zhang","Weiwen Wu","Hengyong Yu"],"pdf_url":"https://arxiv.org/pdf/2308.08511v3.pdf","comment":"10 pages, 13 figures"},{"id":"http://arxiv.org/abs/2312.11489v2","updated":"2023-12-20T02:15:23Z","published":"2023-12-01T06:18:45Z","title":"Agglomerative Federated Learning: Empowering Larger Model Training via\n End-Edge-Cloud Collaboration","summary":" Federated Learning (FL) enables training Artificial Intelligence (AI) models\nover end devices without compromising their privacy. As computing tasks are\nincreasingly performed by a combination of cloud, edge, and end devices, FL can\nbenefit from this End-Edge-Cloud Collaboration (EECC) paradigm to achieve\ncollaborative device-scale expansion with real-time access. Although\nHierarchical Federated Learning (HFL) supports multi-tier model aggregation\nsuitable for EECC, prior works assume the same model structure on all computing\nnodes, constraining the model scale by the weakest end devices. To address this\nissue, we propose Agglomerative Federated Learning (FedAgg), which is a novel\nEECC-empowered FL framework that allows the trained models from end, edge, to\ncloud to grow larger in size and stronger in generalization ability. FedAgg\nrecursively organizes computing nodes among all tiers based on Bridge Sample\nBased Online Distillation Protocol (BSBODP), which enables every pair of\nparent-child computing nodes to mutually transfer and distill knowledge\nextracted from generated bridge samples. This design enhances the performance\nby exploiting the potential of larger models, with privacy constraints of FL\nand flexibility requirements of EECC both satisfied. Experiments under various\nsettings demonstrate that FedAgg outperforms state-of-the-art methods by an\naverage of 4.53\\% accuracy gains and remarkable improvements in convergence\nrate.\n","authors":["Zhiyuan Wu","Sheng Sun","Yuwei Wang","Min Liu","Bo Gao","Quyang Pan","Tianliu He","Xuefeng Jiang"],"pdf_url":"https://arxiv.org/pdf/2312.11489v2.pdf","comment":"Accepted by IEEE International Conference on Computer Communications\n (INFOCOM), 2024"},{"id":"http://arxiv.org/abs/2312.12703v1","updated":"2023-12-20T01:59:48Z","published":"2023-12-20T01:59:48Z","title":"Federated Learning with Extremely Noisy Clients via Negative\n Distillation","summary":" Federated learning (FL) has shown remarkable success in cooperatively\ntraining deep models, while typically struggling with noisy labels. Advanced\nworks propose to tackle label noise by a re-weighting strategy with a strong\nassumption, i.e., mild label noise. However, it may be violated in many\nreal-world FL scenarios because of highly contaminated clients, resulting in\nextreme noise ratios, e.g., $>$90%. To tackle extremely noisy clients, we study\nthe robustness of the re-weighting strategy, showing a pessimistic conclusion:\nminimizing the weight of clients trained over noisy data outperforms\nre-weighting strategies. To leverage models trained on noisy clients, we\npropose a novel approach, called negative distillation (FedNed). FedNed first\nidentifies noisy clients and employs rather than discards the noisy clients in\na knowledge distillation manner. In particular, clients identified as noisy\nones are required to train models using noisy labels and pseudo-labels obtained\nby global models. The model trained on noisy labels serves as a `bad teacher'\nin knowledge distillation, aiming to decrease the risk of providing incorrect\ninformation. Meanwhile, the model trained on pseudo-labels is involved in model\naggregation if not identified as a noisy client. Consequently, through\npseudo-labeling, FedNed gradually increases the trustworthiness of models\ntrained on noisy clients, while leveraging all clients for model aggregation\nthrough negative distillation. To verify the efficacy of FedNed, we conduct\nextensive experiments under various settings, demonstrating that FedNed can\nconsistently outperform baselines and achieve state-of-the-art performance. Our\ncode is available at https://github.com/linChen99/FedNed.\n","authors":["Yang Lu","Lin Chen","Yonggang Zhang","Yiliang Zhang","Bo Han","Yiu-ming Cheung","Hanzi Wang"],"pdf_url":"https://arxiv.org/pdf/2312.12703v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.12697v1","updated":"2023-12-20T01:43:55Z","published":"2023-12-20T01:43:55Z","title":"DGCLUSTER: A Neural Framework for Attributed Graph Clustering via\n Modularity Maximization","summary":" Graph clustering is a fundamental and challenging task in the field of graph\nmining where the objective is to group the nodes into clusters taking into\nconsideration the topology of the graph. It has several applications in diverse\ndomains spanning social network analysis, recommender systems, computer vision,\nand bioinformatics. In this work, we propose a novel method, DGCluster, which\nprimarily optimizes the modularity objective using graph neural networks and\nscales linearly with the graph size. Our method does not require the number of\nclusters to be specified as a part of the input and can also leverage the\navailability of auxiliary node level information. We extensively test DGCluster\non several real-world datasets of varying sizes, across multiple popular\ncluster quality metrics. Our approach consistently outperforms the\nstate-of-the-art methods, demonstrating significant performance gains in almost\nall settings.\n","authors":["Aritra Bhowmick","Mert Kosan","Zexi Huang","Ambuj Singh","Sourav Medya"],"pdf_url":"https://arxiv.org/pdf/2312.12697v1.pdf","comment":"Accepted to AAAI'24"},{"id":"http://arxiv.org/abs/2312.12691v1","updated":"2023-12-20T01:29:11Z","published":"2023-12-20T01:29:11Z","title":"How Good Are Deep Generative Models for Solving Inverse Problems?","summary":" Deep generative models, such as diffusion models, GANs, and IMLE, have shown\nimpressive capability in tackling inverse problems. However, the validity of\nmodel-generated solutions w.r.t. the forward problem and the reliability of\nassociated uncertainty estimates remain understudied. This study evaluates\nrecent diffusion-based, GAN-based, and IMLE-based methods on three inverse\nproblems, i.e., $16\\times$ super-resolution, colourization, and image\ndecompression. We assess the validity of these models' outputs as solutions to\nthe inverse problems and conduct a thorough analysis of the reliability of the\nmodels' estimates of uncertainty over the solution. Overall, we find that the\nIMLE-based CHIMLE method outperforms other methods in terms of producing valid\nsolutions and reliable uncertainty estimates.\n","authors":["Shichong Peng","Alireza Moazeni","Ke Li"],"pdf_url":"https://arxiv.org/pdf/2312.12691v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12492v1","updated":"2023-12-20T01:20:24Z","published":"2023-12-20T01:20:24Z","title":"CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data\n and Language Models of Code","summary":" Motivated by recent work on lifelong learning applications for language\nmodels (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused\non code changes. Our contribution addresses a notable research gap marked by\nthe absence of a long-term temporal dimension in existing code change datasets,\nlimiting their suitability in lifelong learning scenarios. In contrast, our\ndataset aims to comprehensively capture code changes across the entire release\nhistory of open-source software repositories. In this work, we introduce an\ninitial version of CodeLL, comprising 71 machine-learning-based projects mined\nfrom Software Heritage. This dataset enables the extraction and in-depth\nanalysis of code changes spanning 2,483 releases at both the method and API\nlevels. CodeLL enables researchers studying the behaviour of LMs in lifelong\nfine-tuning settings for learning code changes. Additionally, the dataset can\nhelp studying data distribution shifts within software repositories and the\nevolution of API usages over time.\n","authors":["Martin Weyssow","Claudio Di Sipio","Davide Di Ruscio","Houari Sahraoui"],"pdf_url":"https://arxiv.org/pdf/2312.12492v1.pdf","comment":"4+1 pages"},{"id":"http://arxiv.org/abs/2306.03410v3","updated":"2023-12-20T00:46:16Z","published":"2023-06-06T05:17:02Z","title":"Learning to Simulate Tree-Branch Dynamics for Manipulation","summary":" We propose to use a simulation driven inverse inference approach to model the\ndynamics of tree branches under manipulation. Learning branch dynamics and\ngaining the ability to manipulate deformable vegetation can help with\nocclusion-prone tasks, such as fruit picking in dense foliage, as well as\nmoving overhanging vines and branches for navigation in dense vegetation. The\nunderlying deformable tree geometry is encapsulated as coarse spring\nabstractions executed on parallel, non-differentiable simulators. The implicit\nstatistical model defined by the simulator, reference trajectories obtained by\nactively probing the ground truth, and the Bayesian formalism, together guide\nthe spring parameter posterior density estimation. Our non-parametric inference\nalgorithm, based on Stein Variational Gradient Descent, incorporates\nbiologically motivated assumptions into the inference process as neural network\ndriven learnt joint priors; moreover, it leverages the finite difference scheme\nfor gradient approximations. Real and simulated experiments confirm that our\nmodel can predict deformation trajectories, quantify the estimation\nuncertainty, and it can perform better when base-lined against other inference\nalgorithms, particularly from the Monte Carlo family. The model displays strong\nrobustness properties in the presence of heteroscedastic sensor noise;\nfurthermore, it can generalise to unseen grasp locations.\n","authors":["Jayadeep Jacob","Tirthankar Bandyopadhyay","Jason Williams","Paulo Borges","Fabio Ramos"],"pdf_url":"https://arxiv.org/pdf/2306.03410v3.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.12679v1","updated":"2023-12-20T00:43:13Z","published":"2023-12-20T00:43:13Z","title":"Towards Efficient Verification of Quantized Neural Networks","summary":" Quantization replaces floating point arithmetic with integer arithmetic in\ndeep neural network models, providing more efficient on-device inference with\nless power and memory. In this work, we propose a framework for formally\nverifying properties of quantized neural networks. Our baseline technique is\nbased on integer linear programming which guarantees both soundness and\ncompleteness. We then show how efficiency can be improved by utilizing\ngradient-based heuristic search methods and also bound-propagation techniques.\nWe evaluate our approach on perception networks quantized with PyTorch. Our\nresults show that we can verify quantized networks with better scalability and\nefficiency than the previous state of the art.\n","authors":["Pei Huang","Haoze Wu","Yuting Yang","Ieva Daukantas","Min Wu","Yedi Zhang","Clark Barrett"],"pdf_url":"https://arxiv.org/pdf/2312.12679v1.pdf","comment":"This paper has accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.12678v1","updated":"2023-12-20T00:33:26Z","published":"2023-12-20T00:33:26Z","title":"Causal Discovery for fMRI data: Challenges, Solutions, and a Case Study","summary":" Designing studies that apply causal discovery requires navigating many\nresearcher degrees of freedom. This complexity is exacerbated when the study\ninvolves fMRI data. In this paper we (i) describe nine challenges that occur\nwhen applying causal discovery to fMRI data, (ii) discuss the space of\ndecisions that need to be made, (iii) review how a recent case study made those\ndecisions, (iv) and identify existing gaps that could potentially be solved by\nthe development of new methods. Overall, causal discovery is a promising\napproach for analyzing fMRI data, and multiple successful applications have\nindicated that it is superior to traditional fMRI functional connectivity\nmethods, but current causal discovery methods for fMRI leave room for\nimprovement.\n","authors":["Eric Rawls","Bryan Andrews","Kelvin Lim","Erich Kummerfeld"],"pdf_url":"https://arxiv.org/pdf/2312.12678v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12676v1","updated":"2023-12-20T00:31:43Z","published":"2023-12-20T00:31:43Z","title":"Combinatorial Gaussian Process Bandits in Bayesian Settings: Theory and\n Application for Energy-Efficient Navigation","summary":" We consider a combinatorial Gaussian process semi-bandit problem with\ntime-varying arm availability. Each round, an agent is provided a set of\navailable base arms and must select a subset of them to maximize the long-term\ncumulative reward. Assuming the expected rewards are sampled from a Gaussian\nprocess (GP) over the arm space, the agent can efficiently learn. We study the\nBayesian setting and provide novel Bayesian regret bounds for three GP-based\nalgorithms: GP-UCB, Bayes-GP-UCB and GP-TS. Our bounds extend previous results\nfor GP-UCB and GP-TS to a combinatorial setting with varying arm availability\nand to the best of our knowledge, we provide the first Bayesian regret bound\nfor Bayes-GP-UCB. Time-varying arm availability encompasses other widely\nconsidered bandit problems such as contextual bandits. We formulate the online\nenergy-efficient navigation problem as a combinatorial and contextual bandit\nand provide a comprehensive experimental study on synthetic and real-world road\nnetworks with detailed simulations. The contextual GP model obtains lower\nregret and is less dependent on the informativeness of the prior compared to\nthe non-contextual Bayesian inference model. In addition, Thompson sampling\nobtains lower regret than Bayes-UCB for both the contextual and non-contextual\nmodel.\n","authors":["Jack Sandberg","Niklas Åkerblom","Morteza Haghir Chehreghani"],"pdf_url":"https://arxiv.org/pdf/2312.12676v1.pdf","comment":"39 pages, 10 figures"},{"id":"http://arxiv.org/abs/2206.14203v3","updated":"2023-12-20T23:53:54Z","published":"2022-06-28T17:54:17Z","title":"Latent Combinational Game Design","summary":" We present latent combinational game design -- an approach for generating\nplayable games that blend a given set of games in a desired combination using\ndeep generative latent variable models. We use Gaussian Mixture Variational\nAutoencoders (GMVAEs) which model the VAE latent space via a mixture of\nGaussian components. Through supervised training, each component encodes levels\nfrom one game and lets us define blended games as linear combinations of these\ncomponents. This enables generating new games that blend the input games as\nwell as controlling the relative proportions of each game in the blend. We also\nextend prior blending work using conditional VAEs and compare against the GMVAE\nand additionally introduce a hybrid conditional GMVAE (CGMVAE) architecture\nwhich lets us generate whole blended levels and layouts. Results show that\nthese approaches can generate playable games that blend the input games in\nspecified combinations. We use both platformers and dungeon-based games to\ndemonstrate our results.\n","authors":["Anurag Sarkar","Seth Cooper"],"pdf_url":"https://arxiv.org/pdf/2206.14203v3.pdf","comment":"10 pages, 9 figures, IEEE Transactions on Games"},{"id":"http://arxiv.org/abs/2202.00824v6","updated":"2023-12-20T23:49:03Z","published":"2022-02-02T00:33:09Z","title":"KSD Aggregated Goodness-of-fit Test","summary":" We investigate properties of goodness-of-fit tests based on the Kernel Stein\nDiscrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg,\nwhich aggregates multiple tests with different kernels. KSDAgg avoids splitting\nthe data to perform kernel selection (which leads to a loss in test power), and\nrather maximises the test power over a collection of kernels. We provide\nnon-asymptotic guarantees on the power of KSDAgg: we show it achieves the\nsmallest uniform separation rate of the collection, up to a logarithmic term.\nFor compactly supported densities with bounded model score function, we derive\nthe rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the\nminimax optimal rate over unrestricted Sobolev balls, up to an iterated\nlogarithmic term. KSDAgg can be computed exactly in practice as it relies\neither on a parametric bootstrap or on a wild bootstrap to estimate the\nquantiles and the level corrections. In particular, for the crucial choice of\nbandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such\nas median or standard deviation) or to data splitting. We find on both\nsynthetic and real-world data that KSDAgg outperforms other state-of-the-art\nquadratic-time adaptive KSD-based goodness-of-fit testing procedures.\n","authors":["Antonin Schrab","Benjamin Guedj","Arthur Gretton"],"pdf_url":"https://arxiv.org/pdf/2202.00824v6.pdf","comment":"27 pages, 3 figures, Appendices A.4 and I.4 updated"},{"id":"http://arxiv.org/abs/2312.13486v1","updated":"2023-12-20T23:45:06Z","published":"2023-12-20T23:45:06Z","title":"Meta-Learning with Versatile Loss Geometries for Fast Adaptation Using\n Mirror Descent","summary":" Utilizing task-invariant prior knowledge extracted from related tasks,\nmeta-learning is a principled framework that empowers learning a new task\nespecially when data records are limited. A fundamental challenge in\nmeta-learning is how to quickly \"adapt\" the extracted prior in order to train a\ntask-specific model within a few optimization steps. Existing approaches deal\nwith this challenge using a preconditioner that enhances convergence of the\nper-task training process. Though effective in representing locally a quadratic\ntraining loss, these simple linear preconditioners can hardly capture complex\nloss geometries. The present contribution addresses this limitation by learning\na nonlinear mirror map, which induces a versatile distance metric to enable\ncapturing and optimizing a wide range of loss geometries, hence facilitating\nthe per-task training. Numerical tests on few-shot learning datasets\ndemonstrate the superior expressiveness and convergence of the advocated\napproach.\n","authors":["Yilang Zhang","Bingcong Li","Georgios B. Giannakis"],"pdf_url":"https://arxiv.org/pdf/2312.13486v1.pdf","comment":"Accepted by 2024 IEEE International Conference on Acoustics, Speech\n and Signal Processing (ICASSP-24)"},{"id":"http://arxiv.org/abs/2312.13484v1","updated":"2023-12-20T23:38:17Z","published":"2023-12-20T23:38:17Z","title":"Bayesian Transfer Learning","summary":" Transfer learning is a burgeoning concept in statistical machine learning\nthat seeks to improve inference and/or predictive accuracy on a domain of\ninterest by leveraging data from related domains. While the term \"transfer\nlearning\" has garnered much recent interest, its foundational principles have\nexisted for years under various guises. Prior literature reviews in computer\nscience and electrical engineering have sought to bring these ideas into focus,\nprimarily surveying general methodologies and works from these disciplines.\nThis article highlights Bayesian approaches to transfer learning, which have\nreceived relatively limited attention despite their innate compatibility with\nthe notion of drawing upon prior knowledge to guide new learning tasks. Our\nsurvey encompasses a wide range of Bayesian transfer learning frameworks\napplicable to a variety of practical settings. We discuss how these methods\naddress the problem of finding the optimal information to transfer between\ndomains, which is a central question in transfer learning. We illustrate the\nutility of Bayesian transfer learning methods via a simulation study where we\ncompare performance against frequentist competitors.\n","authors":["Piotr M. Suder","Jason Xu","David B. Dunson"],"pdf_url":"https://arxiv.org/pdf/2312.13484v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13480v1","updated":"2023-12-20T23:20:09Z","published":"2023-12-20T23:20:09Z","title":"InvertibleNetworks.jl: A Julia package for scalable normalizing flows","summary":" InvertibleNetworks.jl is a Julia package designed for the scalable\nimplementation of normalizing flows, a method for density estimation and\nsampling in high-dimensional distributions. This package excels in memory\nefficiency by leveraging the inherent invertibility of normalizing flows, which\nsignificantly reduces memory requirements during backpropagation compared to\nexisting normalizing flow packages that rely on automatic differentiation\nframeworks. InvertibleNetworks.jl has been adapted for diverse applications,\nincluding seismic imaging, medical imaging, and CO2 monitoring, demonstrating\nits effectiveness in learning high-dimensional distributions.\n","authors":["Rafael Orozco","Philipp Witte","Mathias Louboutin","Ali Siahkoohi","Gabrio Rizzuti","Bas Peters","Felix J. Herrmann"],"pdf_url":"https://arxiv.org/pdf/2312.13480v1.pdf","comment":"Submitted to Journal of Open Source Software (JOSS)"},{"id":"http://arxiv.org/abs/2311.18260v3","updated":"2023-12-20T23:08:32Z","published":"2023-11-30T05:38:34Z","title":"Consensus, dissensus and synergy between clinicians and specialist\n foundation models in radiology report generation","summary":" Radiology reports are an instrumental part of modern medicine, informing key\nclinical decisions such as diagnosis and treatment. The worldwide shortage of\nradiologists, however, restricts access to expert care and imposes heavy\nworkloads, contributing to avoidable errors and delays in report delivery.\nWhile recent progress in automated report generation with vision-language\nmodels offer clear potential in ameliorating the situation, the path to\nreal-world adoption has been stymied by the challenge of evaluating the\nclinical quality of AI-generated reports. In this study, we build a\nstate-of-the-art report generation system for chest radiographs,\n$\\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation\nmodel on radiology data. To evaluate the quality of the AI-generated reports, a\ngroup of 16 certified radiologists provide detailed evaluations of AI-generated\nand human written reports for chest X-rays from an intensive care setting in\nthe United States and an inpatient setting in India. At least one radiologist\n(out of two per case) preferred the AI report to the ground truth report in\nover 60$\\%$ of cases for both datasets. Amongst the subset of AI-generated\nreports that contain errors, the most frequently cited reasons were related to\nthe location and finding, whereas for human written reports, most mistakes were\nrelated to severity and finding. This disparity suggested potential\ncomplementarity between our AI system and human experts, prompting us to\ndevelop an assistive scenario in which Flamingo-CXR generates a first-draft\nreport, which is subsequently revised by a clinician. This is the first\ndemonstration of clinician-AI collaboration for report writing, and the\nresultant reports are assessed to be equivalent or preferred by at least one\nradiologist to reports written by experts alone in 80$\\%$ of in-patient cases\nand 60$\\%$ of intensive care cases.\n","authors":["Ryutaro Tanno","David G. T. Barrett","Andrew Sellergren","Sumedh Ghaisas","Sumanth Dathathri","Abigail See","Johannes Welbl","Karan Singhal","Shekoofeh Azizi","Tao Tu","Mike Schaekermann","Rhys May","Roy Lee","SiWai Man","Zahra Ahmed","Sara Mahdavi","Yossi Matias","Joelle Barral","Ali Eslami","Danielle Belgrave","Vivek Natarajan","Shravya Shetty","Pushmeet Kohli","Po-Sen Huang","Alan Karthikesalingam","Ira Ktena"],"pdf_url":"https://arxiv.org/pdf/2311.18260v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.05152v2","updated":"2023-12-20T23:06:09Z","published":"2023-11-09T05:24:20Z","title":"Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual\n Downstream Tasks","summary":" In recent years, the deployment of large-scale pre-trained models in\naudio-visual downstream tasks has yielded remarkable outcomes. However, these\nmodels, primarily trained on single-modality unconstrained datasets, still\nencounter challenges in feature extraction for multi-modal tasks, leading to\nsuboptimal performance. This limitation arises due to the introduction of\nirrelevant modality-specific information during encoding, which adversely\naffects the performance of downstream tasks. To address this challenge, this\npaper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention\nmechanism. This mechanism leverages audio and visual modalities as soft prompts\nto dynamically adjust the parameters of pre-trained models based on the current\nmulti-modal input features. Specifically, the DG-SCT module incorporates\ntrainable cross-modal interaction layers into pre-trained audio-visual\nencoders, allowing adaptive extraction of crucial information from the current\nmodality across spatial, channel, and temporal dimensions, while preserving the\nfrozen parameters of large-scale pre-trained models. Experimental evaluations\ndemonstrate that our proposed model achieves state-of-the-art results across\nmultiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our\nmodel exhibits promising performance in challenging few-shot and zero-shot\nscenarios. The source code and pre-trained models are available at\nhttps://github.com/haoyi-duan/DG-SCT.\n","authors":["Haoyi Duan","Yan Xia","Mingze Zhou","Li Tang","Jieming Zhu","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2311.05152v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2303.00586v2","updated":"2023-12-20T22:54:48Z","published":"2023-03-01T15:28:26Z","title":"FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling","summary":" Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way\nto improve top-line metrics and to outperform a larger single model. In this\nwork, we go beyond top-line metrics and instead explore the impact of\nensembling on subgroup performances. Surprisingly, we observe that even with a\nsimple homogeneous ensemble -- all the individual DNNs share the same training\nset, architecture, and design choices -- the minority group performance\ndisproportionately improves with the number of models compared to the majority\ngroup, i.e. fairness naturally emerges from ensembling. Even more surprising,\nwe find that this gain keeps occurring even when a large number of models is\nconsidered, e.g. $20$, despite the fact that the average performance of the\nensemble plateaus with fewer models. Our work establishes that simple DNN\nensembles can be a powerful tool for alleviating disparate impact from DNN\nclassifiers, thus curbing algorithmic harm. We also explore why this is the\ncase. We find that even in homogeneous ensembles, varying the sources of\nstochasticity through parameter initialization, mini-batch sampling, and\ndata-augmentation realizations, results in different fairness outcomes.\n","authors":["Wei-Yin Ko","Daniel D'souza","Karina Nguyen","Randall Balestriero","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2303.00586v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13473v1","updated":"2023-12-20T22:48:38Z","published":"2023-12-20T22:48:38Z","title":"Accuracy vs Memory Advantage in the Quantum Simulation of Stochastic\n Processes","summary":" Many inference scenarios rely on extracting relevant information from known\ndata in order to make future predictions. When the underlying stochastic\nprocess satisfies certain assumptions, there is a direct mapping between its\nexact classical and quantum simulators, with the latter asymptotically using\nless memory. Here we focus on studying whether such quantum advantage persists\nwhen those assumptions are not satisfied, and the model is doomed to have\nimperfect accuracy. By studying the trade-off between accuracy and memory\nrequirements, we show that quantum models can reach the same accuracy with less\nmemory, or alternatively, better accuracy with the same memory. Finally, we\ndiscuss the implications of this result for learning tasks.\n","authors":["Leonardo Banchi"],"pdf_url":"https://arxiv.org/pdf/2312.13473v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.14496v2","updated":"2023-12-20T22:39:23Z","published":"2023-03-25T15:06:47Z","title":"Learning with Explanation Constraints","summary":" As larger deep learning models are hard to interpret, there has been a recent\nfocus on generating explanations of these black-box models. In contrast, we may\nhave apriori explanations of how models should behave. In this paper, we\nformalize this notion as learning from explanation constraints and provide a\nlearning theoretic framework to analyze how such explanations can improve the\nlearning of our models. One may naturally ask, \"When would these explanations\nbe helpful?\" Our first key contribution addresses this question via a class of\nmodels that satisfies these explanation constraints in expectation over new\ndata. We provide a characterization of the benefits of these models (in terms\nof the reduction of their Rademacher complexities) for a canonical class of\nexplanations given by gradient information in the settings of both linear\nmodels and two layer neural networks. In addition, we provide an algorithmic\nsolution for our framework, via a variational approximation that achieves\nbetter performance and satisfies these constraints more frequently, when\ncompared to simpler augmented Lagrangian methods to incorporate these\nexplanations. We demonstrate the benefits of our approach over a large array of\nsynthetic and real-world experiments.\n","authors":["Rattana Pukdee","Dylan Sam","J. Zico Kolter","Maria-Florina Balcan","Pradeep Ravikumar"],"pdf_url":"https://arxiv.org/pdf/2303.14496v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13469v1","updated":"2023-12-20T22:36:37Z","published":"2023-12-20T22:36:37Z","title":"Neural feels with neural fields: Visuo-tactile perception for in-hand\n manipulation","summary":" To achieve human-level dexterity, robots must infer spatial awareness from\nmultimodal sensing to reason over contact interactions. During in-hand\nmanipulation of novel objects, such spatial awareness involves estimating the\nobject's pose and shape. The status quo for in-hand perception primarily\nemploys vision, and restricts to tracking a priori known objects. Moreover,\nvisual occlusion of objects in-hand is imminent during manipulation, preventing\ncurrent systems to push beyond tasks without occlusion. We combine vision and\ntouch sensing on a multi-fingered hand to estimate an object's pose and shape\nduring in-hand manipulation. Our method, NeuralFeels, encodes object geometry\nby learning a neural field online and jointly tracks it by optimizing a pose\ngraph problem. We study multimodal in-hand perception in simulation and the\nreal-world, interacting with different objects via a proprioception-driven\npolicy. Our experiments show final reconstruction F-scores of $81$% and average\npose drifts of $4.7\\,\\text{mm}$, further reduced to $2.3\\,\\text{mm}$ with known\nCAD models. Additionally, we observe that under heavy visual occlusion we can\nachieve up to $94$% improvements in tracking compared to vision-only methods.\nOur results demonstrate that touch, at the very least, refines and, at the very\nbest, disambiguates visual estimates during in-hand manipulation. We release\nour evaluation dataset of 70 experiments, FeelSight, as a step towards\nbenchmarking in this domain. Our neural representation driven by multimodal\nsensing can serve as a perception backbone towards advancing robot dexterity.\nVideos can be found on our project website\nhttps://suddhu.github.io/neural-feels/\n","authors":["Sudharshan Suresh","Haozhi Qi","Tingfan Wu","Taosha Fan","Luis Pineda","Mike Lambeta","Jitendra Malik","Mrinal Kalakrishnan","Roberto Calandra","Michael Kaess","Joseph Ortiz","Mustafa Mukadam"],"pdf_url":"https://arxiv.org/pdf/2312.13469v1.pdf","comment":"43 pages, 20 figures, 1 table; https://suddhu.github.io/neural-feels/"},{"id":"http://arxiv.org/abs/2310.03223v3","updated":"2023-12-20T22:30:33Z","published":"2023-10-05T00:45:04Z","title":"TacoGFN: Target Conditioned GFlowNet for Structure-Based Drug Design","summary":" We seek to automate the generation of drug-like compounds conditioned to\nspecific protein pocket targets. Most current methods approximate the\nprotein-molecule distribution of a finite dataset and, therefore struggle to\ngenerate molecules with significant binding improvement over the training\ndataset. We instead frame the pocket-conditioned molecular generation task as\nan RL problem and develop TacoGFN, a target conditional Generative Flow Network\nmodel. Our method is explicitly encouraged to generate molecules with desired\nproperties as opposed to fitting on a pre-existing data distribution. To this\nend, we develop transformer-based docking score prediction to speed up docking\nscore computation and propose TacoGFN to explore molecule space efficiently.\nFurthermore, we incorporate several rounds of active learning where generated\nsamples are queried using a docking oracle to improve the docking score\nprediction. This approach allows us to accurately explore as much of the\nmolecule landscape as we can afford computationally. Empirically, molecules\ngenerated using TacoGFN and its variants significantly outperform all baseline\nmethods across every property (Docking score, QED, SA, Lipinski), while being\norders of magnitude faster.\n","authors":["Tony Shen","Mohit Pandey","Jason Smith","Artem Cherkasov","Martin Ester"],"pdf_url":"https://arxiv.org/pdf/2310.03223v3.pdf","comment":"Accepted at NeurIPS 2023 AID3 and at NeurIPS 2023 GenBio as Spotlight"},{"id":"http://arxiv.org/abs/2312.13455v1","updated":"2023-12-20T22:15:10Z","published":"2023-12-20T22:15:10Z","title":"Revisiting Deep Generalized Canonical Correlation Analysis","summary":" Canonical correlation analysis (CCA) is a classic statistical method for\ndiscovering latent co-variation that underpins two or more observed random\nvectors. Several extensions and variations of CCA have been proposed that have\nstrengthened our capabilities in terms of revealing common random factors from\nmultiview datasets. In this work, we first revisit the most recent\ndeterministic extensions of deep CCA and highlight the strengths and\nlimitations of these state-of-the-art methods. Some methods allow trivial\nsolutions, while others can miss weak common factors. Others overload the\nproblem by also seeking to reveal what is not common among the views -- i.e.,\nthe private components that are needed to fully reconstruct each view. The\nlatter tends to overload the problem and its computational and sample\ncomplexities. Aiming to improve upon these limitations, we design a novel and\nefficient formulation that alleviates some of the current restrictions. The\nmain idea is to model the private components as conditionally independent given\nthe common ones, which enables the proposed compact formulation. In addition,\nwe also provide a sufficient condition for identifying the common random\nfactors. Judicious experiments with synthetic and real datasets showcase the\nvalidity of our claims and the effectiveness of the proposed approach.\n","authors":["Paris A. Karakasis","Nicholas D. Sidiropoulos"],"pdf_url":"https://arxiv.org/pdf/2312.13455v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13454v1","updated":"2023-12-20T22:13:45Z","published":"2023-12-20T22:13:45Z","title":"MixEHR-SurG: a joint proportional hazard and guided topic model for\n inferring mortality-associated topics from electronic health records","summary":" Objective: To improve survival analysis using EHR data, we aim to develop a\nsupervised topic model called MixEHR-SurG to simultaneously integrate\nheterogeneous EHR data and model survival hazard.\n Materials and Methods: Our technical contributions are three-folds: (1)\nintegrating EHR topic inference with Cox proportional hazards likelihood; (2)\ninferring patient-specific topic hyperparameters using the PheCode concepts\nsuch that each topic can be identified with exactly one PheCode-associated\nphenotype; (3) multi-modal survival topic inference. This leads to a highly\ninterpretable survival and guided topic model that can infer PheCode-specific\nphenotype topics associated with patient mortality. We evaluated MixEHR-G using\na simulated dataset and two real-world EHR datasets: the Quebec Congenital\nHeart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient\nclaim data of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458\nsubjects with multi-modal EHR records.\n Results: Compared to the baselines, MixEHR-G achieved a superior dynamic\nAUROC for mortality prediction, with a mean AUROC score of 0.89 in the\nsimulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively,\nMixEHR-G associates severe cardiac conditions with high mortality risk among\nthe CHD patients after the first heart failure hospitalization and critical\nbrain injuries with increased mortality among the MIMIC-III patients after\ntheir ICU discharge.\n Conclusion: The integration of the Cox proportional hazards model and EHR\ntopic inference in MixEHR-SurG led to not only competitive mortality prediction\nbut also meaningful phenotype topics for systematic survival analysis. The\nsoftware is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.\n","authors":["Yixuan Li","Ariane Marelli","Archer Y. Yang","Yue Li"],"pdf_url":"https://arxiv.org/pdf/2312.13454v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13451v1","updated":"2023-12-20T22:11:54Z","published":"2023-12-20T22:11:54Z","title":"Learning the Factors Controlling Mineralization for Geologic Carbon\n Sequestration","summary":" We perform a set of flow and reactive transport simulations within\nthree-dimensional fracture networks to learn the factors controlling mineral\nreactions. CO$_2$ mineralization requires CO$_2$-laden water, dissolution of a\nmineral that then leads to precipitation of a CO$_2$-bearing mineral. Our\ndiscrete fracture networks (DFN) are partially filled with quartz that\ngradually dissolves until it reaches a quasi-steady state. At the end of the\nsimulation, we measure the quartz remaining in each fracture within the domain.\nWe observe that a small backbone of fracture exists, where the quartz is fully\ndissolved which leads to increased flow and transport. However, depending on\nthe DFN topology and the rate of dissolution, we observe a large variability of\nthese changes, which indicates an interplay between the fracture network\nstructure and the impact of geochemical dissolution. In this work, we developed\na machine learning framework to extract the important features that support\nmineralization in the form of dissolution. In addition, we use structural and\ntopological features of the fracture network to predict the remaining quartz\nvolume in quasi-steady state conditions. As a first step to characterizing\ncarbon mineralization, we study dissolution with this framework. We studied a\nvariety of reaction and fracture parameters and their impact on the dissolution\nof quartz in fracture networks. We found that the dissolution reaction rate\nconstant of quartz and the distance to the flowing backbone in the fracture\nnetwork are the two most important features that control the amount of quartz\nleft in the system. For the first time, we use a combination of a finite-volume\nreservoir model and graph-based approach to study reactive transport in a\ncomplex fracture network to determine the key features that control\ndissolution.\n","authors":["Aleksandra Pachalieva","Jeffrey D. Hyman","Daniel O'Malley","Hari Viswanathan","Gowri Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2312.13451v1.pdf","comment":"23 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.05964v2","updated":"2023-12-20T22:10:27Z","published":"2023-12-10T18:43:37Z","title":"ConSequence: Synthesizing Logically Constrained Sequences for Electronic\n Health Record Generation","summary":" Generative models can produce synthetic patient records for analytical tasks\nwhen real data is unavailable or limited. However, current methods struggle\nwith adhering to domain-specific knowledge and removing invalid data. We\npresent ConSequence, an effective approach to integrating domain knowledge into\nsequential generative neural network outputs. Our rule-based formulation\nincludes temporal aggregation and antecedent evaluation modules, ensured by an\nefficient matrix multiplication formulation, to satisfy hard and soft logical\nconstraints across time steps. Existing constraint methods often fail to\nguarantee constraint satisfaction, lack the ability to handle temporal\nconstraints, and hinder the learning and computational efficiency of the model.\nIn contrast, our approach efficiently handles all types of constraints with\nguaranteed logical coherence. We demonstrate ConSequence's effectiveness in\ngenerating electronic health records, outperforming competitors in achieving\ncomplete temporal and spatial constraint satisfaction without compromising\nruntime performance or generative quality. Specifically, ConSequence\nsuccessfully prevents all rule violations while improving the model quality in\nreducing its test perplexity by 5% and incurring less than a 13% slowdown in\ngeneration speed compared to an unconstrained model.\n","authors":["Brandon Theodorou","Shrusti Jain","Cao Xiao","Jimeng Sun"],"pdf_url":"https://arxiv.org/pdf/2312.05964v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02679v2","updated":"2023-12-20T22:09:19Z","published":"2023-10-04T09:39:05Z","title":"Diffusion Generative Flow Samplers: Improving learning signals through\n partial trajectory optimization","summary":" We tackle the problem of sampling from intractable high-dimensional density\nfunctions, a fundamental task that often appears in machine learning and\nstatistics. We extend recent sampling-based approaches that leverage controlled\nstochastic processes to model approximate samples from these target densities.\nThe main drawback of these approaches is that the training objective requires\nfull trajectories to compute, resulting in sluggish credit assignment issues\ndue to use of entire trajectories and a learning signal present only at the\nterminal time. In this work, we present Diffusion Generative Flow Samplers\n(DGFS), a sampling-based framework where the learning process can be tractably\nbroken down into short partial trajectory segments, via parameterizing an\nadditional \"flow function\". Our method takes inspiration from the theory\ndeveloped for generative flow networks (GFlowNets), allowing us to make use of\nintermediate learning signals. Through various challenging experiments, we\ndemonstrate that DGFS achieves more accurate estimates of the normalization\nconstant than closely-related prior methods.\n","authors":["Dinghuai Zhang","Ricky T. Q. Chen","Cheng-Hao Liu","Aaron Courville","Yoshua Bengio"],"pdf_url":"https://arxiv.org/pdf/2310.02679v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.14404v5","updated":"2023-12-20T21:30:03Z","published":"2022-10-26T01:00:57Z","title":"Adversarial Purification with the Manifold Hypothesis","summary":" In this work, we formulate a novel framework for adversarial robustness using\nthe manifold hypothesis. This framework provides sufficient conditions for\ndefending against adversarial examples. We develop an adversarial purification\nmethod with this framework. Our method combines manifold learning with\nvariational inference to provide adversarial robustness without the need for\nexpensive adversarial training. Experimentally, our approach can provide\nadversarial robustness even if attackers are aware of the existence of the\ndefense. In addition, our method can also serve as a test-time defense\nmechanism for variational autoencoders.\n","authors":["Zhaoyuan Yang","Zhiwei Xu","Jing Zhang","Richard Hartley","Peter Tu"],"pdf_url":"https://arxiv.org/pdf/2210.14404v5.pdf","comment":"Extended version of paper accepted at AAAI 2024 with supplementary\n materials"},{"id":"http://arxiv.org/abs/2312.13438v1","updated":"2023-12-20T21:29:00Z","published":"2023-12-20T21:29:00Z","title":"Independent Mechanism Analysis and the Manifold Hypothesis","summary":" Independent Mechanism Analysis (IMA) seeks to address non-identifiability in\nnonlinear Independent Component Analysis (ICA) by assuming that the Jacobian of\nthe mixing function has orthogonal columns. As typical in ICA, previous work\nfocused on the case with an equal number of latent components and observed\nmixtures. Here, we extend IMA to settings with a larger number of mixtures that\nreside on a manifold embedded in a higher-dimensional than the latent space --\nin line with the manifold hypothesis in representation learning. For this\nsetting, we show that IMA still circumvents several non-identifiability issues,\nsuggesting that it can also be a beneficial principle for higher-dimensional\nobservations when the manifold hypothesis holds. Further, we prove that the IMA\nprinciple is approximately satisfied with high probability (increasing with the\nnumber of observed mixtures) when the directions along which the latent\ncomponents influence the observations are chosen independently at random. This\nprovides a new and rigorous statistical interpretation of IMA.\n","authors":["Shubhangi Ghosh","Luigi Gresele","Julius von Kügelgen","Michel Besserve","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2312.13438v1.pdf","comment":"6 pages, Accepted at Neurips Causal Representation Learning 2023"},{"id":"http://arxiv.org/abs/2312.13437v1","updated":"2023-12-20T21:28:35Z","published":"2023-12-20T21:28:35Z","title":"A General Model for Aggregating Annotations Across Simple, Complex, and\n Multi-Object Annotation Tasks","summary":" Human annotations are vital to supervised learning, yet annotators often\ndisagree on the correct label, especially as annotation tasks increase in\ncomplexity. A strategy to improve label quality is to ask multiple annotators\nto label the same item and aggregate their labels. Many aggregation models have\nbeen proposed for categorical or numerical annotation tasks, but far less work\nhas considered more complex annotation tasks involving open-ended,\nmultivariate, or structured responses. While a variety of bespoke models have\nbeen proposed for specific tasks, our work is the first to introduce\naggregation methods that generalize across many diverse complex tasks,\nincluding sequence labeling, translation, syntactic parsing, ranking, bounding\nboxes, and keypoints. This generality is achieved by devising a task-agnostic\nmethod to model distances between labels rather than the labels themselves.\n This article extends our prior work with investigation of three new research\nquestions. First, how do complex annotation properties impact aggregation\naccuracy? Second, how should a task owner navigate the many modeling choices to\nmaximize aggregation accuracy? Finally, what diagnoses can verify that\naggregation models are specified correctly for the given data? To understand\nhow various factors impact accuracy and to inform model selection, we conduct\nsimulation studies and experiments on real, complex datasets. Regarding\ntesting, we introduce unit tests for aggregation models and present a suite of\nsuch tests to ensure that a given model is not mis-specified and exhibits\nexpected behavior.\n Beyond investigating these research questions above, we discuss the\nfoundational concept of annotation complexity, present a new aggregation model\nas a bridge between traditional models and our own, and contribute a new\nsemi-supervised learning method for complex label aggregation that outperforms\nprior work.\n","authors":["Alexander Braylan","Madalyn Marabella","Omar Alonso","Matthew Lease"],"pdf_url":"https://arxiv.org/pdf/2312.13437v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13426v1","updated":"2023-12-20T21:12:19Z","published":"2023-12-20T21:12:19Z","title":"Consistent Long-Term Forecasting of Ergodic Dynamical Systems","summary":" We study the evolution of distributions under the action of an ergodic\ndynamical system, which may be stochastic in nature. By employing tools from\nKoopman and transfer operator theory one can evolve any initial distribution of\nthe state forward in time, and we investigate how estimators of these operators\nperform on long-term forecasting. Motivated by the observation that standard\nestimators may fail at this task, we introduce a learning paradigm that neatly\ncombines classical techniques of eigenvalue deflation from operator theory and\nfeature centering from statistics. This paradigm applies to any operator\nestimator based on empirical risk minimization, making them satisfy learning\nbounds which hold uniformly on the entire trajectory of future distributions,\nand abide to the conservation of mass for each of the forecasted distributions.\nNumerical experiments illustrates the advantages of our approach in practice.\n","authors":["Prune Inzerilli","Vladimir Kostic","Karim Lounici","Pietro Novelli","Massimiliano Pontil"],"pdf_url":"https://arxiv.org/pdf/2312.13426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.10512v2","updated":"2023-12-20T20:56:14Z","published":"2023-03-18T22:36:25Z","title":"AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning","summary":" Fine-tuning large pre-trained language models on downstream tasks has become\nan important paradigm in NLP. However, common practice fine-tunes all of the\nparameters in a pre-trained model, which becomes prohibitive when a large\nnumber of downstream tasks are present. Therefore, many fine-tuning methods are\nproposed to learn incremental updates of pre-trained weights in a parameter\nefficient way, e.g., low-rank increments. These methods often evenly distribute\nthe budget of incremental updates across all pre-trained weight matrices, and\noverlook the varying importance of different weight parameters. As a\nconsequence, the fine-tuning performance is suboptimal. To bridge this gap, we\npropose AdaLoRA, which adaptively allocates the parameter budget among weight\nmatrices according to their importance score. In particular, AdaLoRA\nparameterizes the incremental updates in the form of singular value\ndecomposition. Such a novel approach allows us to effectively prune the\nsingular values of unimportant updates, which is essentially to reduce their\nparameter budget but circumvent intensive exact SVD computations. We conduct\nextensive experiments with several pre-trained models on natural language\nprocessing, question answering, and natural language generation to validate the\neffectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable\nimprovement over baselines, especially in the low budget settings. Our code is\npublicly available at https://github.com/QingruZhang/AdaLoRA .\n","authors":["Qingru Zhang","Minshuo Chen","Alexander Bukharin","Nikos Karampatziakis","Pengcheng He","Yu Cheng","Weizhu Chen","Tuo Zhao"],"pdf_url":"https://arxiv.org/pdf/2303.10512v2.pdf","comment":"The 11th International Conference on Learning Representations (ICLR\n 2023)"},{"id":"http://arxiv.org/abs/2312.13422v1","updated":"2023-12-20T20:52:01Z","published":"2023-12-20T20:52:01Z","title":"Texture Matching GAN for CT Image Enhancement","summary":" Deep neural networks (DNN) are commonly used to denoise and sharpen X-ray\ncomputed tomography (CT) images with the goal of reducing patient X-ray dosage\nwhile maintaining reconstruction quality. However, naive application of\nDNN-based methods can result in image texture that is undesirable in clinical\napplications. Alternatively, generative adversarial network (GAN) based methods\ncan produce appropriate texture, but naive application of GANs can introduce\ninaccurate or even unreal image detail. In this paper, we propose a texture\nmatching generative adversarial network (TMGAN) that enhances CT images while\ngenerating an image texture that can be matched to a target texture. We use\nparallel generators to separate anatomical features from the generated texture,\nwhich allows the GAN to be trained to match the desired texture without\ndirectly affecting the underlying CT image. We demonstrate that TMGAN generates\nenhanced image quality while also producing image texture that is desirable for\nclinical application.\n","authors":["Madhuri Nagare","Gregery T. Buzzard","Charles A. Bouman"],"pdf_url":"https://arxiv.org/pdf/2312.13422v1.pdf","comment":"Submitted to IEEE Transactions on Medical Imaging"},{"id":"http://arxiv.org/abs/2307.15043v2","updated":"2023-12-20T20:48:57Z","published":"2023-07-27T17:49:12Z","title":"Universal and Transferable Adversarial Attacks on Aligned Language\n Models","summary":" Because \"out-of-the-box\" large language models are capable of generating a\ngreat deal of objectionable content, recent work has focused on aligning these\nmodels in an attempt to prevent undesirable generation. While there has been\nsome success at circumventing these measures -- so-called \"jailbreaks\" against\nLLMs -- these attacks have required significant human ingenuity and are brittle\nin practice. In this paper, we propose a simple and effective attack method\nthat causes aligned language models to generate objectionable behaviors.\nSpecifically, our approach finds a suffix that, when attached to a wide range\nof queries for an LLM to produce objectionable content, aims to maximize the\nprobability that the model produces an affirmative response (rather than\nrefusing to answer). However, instead of relying on manual engineering, our\napproach automatically produces these adversarial suffixes by a combination of\ngreedy and gradient-based search techniques, and also improves over past\nautomatic prompt generation methods.\n Surprisingly, we find that the adversarial prompts generated by our approach\nare quite transferable, including to black-box, publicly released LLMs.\nSpecifically, we train an adversarial attack suffix on multiple prompts (i.e.,\nqueries asking for many different types of objectionable content), as well as\nmultiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting\nattack suffix is able to induce objectionable content in the public interfaces\nto ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,\nPythia, Falcon, and others. In total, this work significantly advances the\nstate-of-the-art in adversarial attacks against aligned language models,\nraising important questions about how such systems can be prevented from\nproducing objectionable information. Code is available at\ngithub.com/llm-attacks/llm-attacks.\n","authors":["Andy Zou","Zifan Wang","Nicholas Carlini","Milad Nasr","J. Zico Kolter","Matt Fredrikson"],"pdf_url":"https://arxiv.org/pdf/2307.15043v2.pdf","comment":"Website: http://llm-attacks.org/"},{"id":"http://arxiv.org/abs/2308.00788v3","updated":"2023-12-20T20:30:24Z","published":"2023-08-01T18:59:07Z","title":"An Introduction to Bi-level Optimization: Foundations and Applications\n in Signal Processing and Machine Learning","summary":" Recently, bi-level optimization (BLO) has taken center stage in some very\nexciting developments in the area of signal processing (SP) and machine\nlearning (ML). Roughly speaking, BLO is a classical optimization problem that\ninvolves two levels of hierarchy (i.e., upper and lower levels), wherein\nobtaining the solution to the upper-level problem requires solving the\nlower-level one. BLO has become popular largely because it is powerful in\nmodeling problems in SP and ML, among others, that involve optimizing nested\nobjective functions. Prominent applications of BLO range from resource\nallocation for wireless systems to adversarial machine learning. In this work,\nwe focus on a class of tractable BLO problems that often appear in SP and ML\napplications. We provide an overview of some basic concepts of this class of\nBLO problems, such as their optimality conditions, standard algorithms\n(including their optimization principles and practical implementations), as\nwell as how they can be leveraged to obtain state-of-the-art results for a\nnumber of key SP and ML applications. Further, we discuss some recent advances\nin BLO theory, its implications for applications, and point out some\nlimitations of the state-of-the-art that require significant future research\nefforts. Overall, we hope that this article can serve to accelerate the\nadoption of BLO as a generic tool to model, analyze, and innovate on a wide\narray of emerging SP and ML applications.\n","authors":["Yihua Zhang","Prashant Khanduri","Ioannis Tsaknakis","Yuguang Yao","Mingyi Hong","Sijia Liu"],"pdf_url":"https://arxiv.org/pdf/2308.00788v3.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2311.13073v2","updated":"2023-12-20T15:58:26Z","published":"2023-11-22T00:26:15Z","title":"FusionFrames: Efficient Architectural Aspects for Text-to-Video\n Generation Pipeline","summary":" Multimedia generation approaches occupy a prominent place in artificial\nintelligence research. Text-to-image models achieved high-quality results over\nthe last few years. However, video synthesis methods recently started to\ndevelop. This paper presents a new two-stage latent diffusion text-to-video\ngeneration architecture based on the text-to-image diffusion model. The first\nstage concerns keyframes synthesis to figure the storyline of a video, while\nthe second one is devoted to interpolation frames generation to make movements\nof the scene and objects smooth. We compare several temporal conditioning\napproaches for keyframes generation. The results show the advantage of using\nseparate temporal blocks over temporal layers in terms of metrics reflecting\nvideo generation quality aspects and human preference. The design of our\ninterpolation model significantly reduces computational costs compared to other\nmasked frame interpolation approaches. Furthermore, we evaluate different\nconfigurations of MoVQ-based video decoding scheme to improve consistency and\nachieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our\npipeline with existing solutions and achieve top-2 scores overall and top-1\namong open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:\nhttps://ai-forever.github.io/kandinsky-video/\n","authors":["Vladimir Arkhipkin","Zein Shaheen","Viacheslav Vasilev","Elizaveta Dakhova","Andrey Kuznetsov","Denis Dimitrov"],"pdf_url":"https://arxiv.org/pdf/2311.13073v2.pdf","comment":"Project page: https://ai-forever.github.io/kandinsky-video/"},{"id":"http://arxiv.org/abs/2312.12436v2","updated":"2023-12-20T12:40:47Z","published":"2023-12-19T18:59:22Z","title":"A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise","summary":" The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.\n","authors":["Chaoyou Fu","Renrui Zhang","Zihan Wang","Yubo Huang","Zhengye Zhang","Longtian Qiu","Gaoxiang Ye","Yunhang Shen","Mengdan Zhang","Peixian Chen","Sirui Zhao","Shaohui Lin","Deqiang Jiang","Di Yin","Peng Gao","Ke Li","Hongsheng Li","Xing Sun"],"pdf_url":"https://arxiv.org/pdf/2312.12436v2.pdf","comment":"Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models"},{"id":"http://arxiv.org/abs/2312.12680v1","updated":"2023-12-20T00:44:04Z","published":"2023-12-20T00:44:04Z","title":"Trajectory Approximation of Video Based on Phase Correlation for Forward\n Facing Camera","summary":" In this paper, we introduce an innovative approach for extracting\ntrajectories from a camera sensor in GPS-denied environments, leveraging visual\nodometry. The system takes video footage captured by a forward-facing camera\nmounted on a vehicle as input, with the output being a chain code representing\nthe camera's trajectory. The proposed methodology involves several key steps.\nFirstly, we employ phase correlation between consecutive frames of the video to\nextract essential information. Subsequently, we introduce a novel chain code\nmethod termed \"dynamic chain code,\" which is based on the x-shift values\nderived from the phase correlation. The third step involves determining\ndirectional changes (forward, left, right) by establishing thresholds and\nextracting the corresponding chain code. This extracted code is then stored in\na buffer for further processing. Notably, our system outperforms traditional\nmethods reliant on spatial features, exhibiting greater speed and robustness in\nnoisy environments. Importantly, our approach operates without external camera\ncalibration information. Moreover, by incorporating visual odometry, our system\nenhances its accuracy in estimating camera motion, providing a more\ncomprehensive understanding of trajectory dynamics. Finally, the system\nculminates in the visualization of the normalized camera motion trajectory.\n","authors":["Abdulkadhem A. Abdulkadhem"],"pdf_url":"https://arxiv.org/pdf/2312.12680v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.05152v2","updated":"2023-12-20T23:06:09Z","published":"2023-11-09T05:24:20Z","title":"Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual\n Downstream Tasks","summary":" In recent years, the deployment of large-scale pre-trained models in\naudio-visual downstream tasks has yielded remarkable outcomes. However, these\nmodels, primarily trained on single-modality unconstrained datasets, still\nencounter challenges in feature extraction for multi-modal tasks, leading to\nsuboptimal performance. This limitation arises due to the introduction of\nirrelevant modality-specific information during encoding, which adversely\naffects the performance of downstream tasks. To address this challenge, this\npaper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention\nmechanism. This mechanism leverages audio and visual modalities as soft prompts\nto dynamically adjust the parameters of pre-trained models based on the current\nmulti-modal input features. Specifically, the DG-SCT module incorporates\ntrainable cross-modal interaction layers into pre-trained audio-visual\nencoders, allowing adaptive extraction of crucial information from the current\nmodality across spatial, channel, and temporal dimensions, while preserving the\nfrozen parameters of large-scale pre-trained models. Experimental evaluations\ndemonstrate that our proposed model achieves state-of-the-art results across\nmultiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our\nmodel exhibits promising performance in challenging few-shot and zero-shot\nscenarios. The source code and pre-trained models are available at\nhttps://github.com/haoyi-duan/DG-SCT.\n","authors":["Haoyi Duan","Yan Xia","Mingze Zhou","Li Tang","Jieming Zhu","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2311.05152v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13470v1","updated":"2023-12-20T22:39:55Z","published":"2023-12-20T22:39:55Z","title":"Coffee: Cost-Effective Edge Caching for 360 Degree Live Video Streaming","summary":" While live 360 degree video streaming delivers immersive viewing experience,\nit poses significant bandwidth and latency challenges for content delivery\nnetworks. Edge servers are expected to play an important role in facilitating\nlive streaming of 360 degree videos. In this paper, we propose a novel\npredictive edge caching algorithm (Coffee) for live 360 degree video that\nemploy collaborative FoV prediction and predictive tile prefetching to reduce\nbandwidth consumption, streaming cost and improve the streaming quality and\nrobustness. Our light-weight caching algorithms exploit the unique tile\nconsumption patterns of live 360 degree video streaming to achieve high tile\ncaching gains. Through extensive experiments driven by real 360 degree video\nstreaming traces, we demonstrate that edge caching algorithms specifically\ndesigned for live 360 degree video streaming can achieve high streaming cost\nsavings with small edge cache space consumption. Coffee, guided by viewer FoV\npredictions, significantly reduces back-haul traffic up to 76% compared to\nstate-of-the-art edge caching algorithms. Furthermore, we develop a\ntranscoding-aware variant (TransCoffee) and evaluate it using comprehensive\nexperiments, which demonstrate that TransCoffee can achieve 63\\% lower cost\ncompared to state-of-the-art transcoding-aware approaches.\n","authors":["Chen Li","Tingwei Ye","Tongyu Zong","Liyang Sun","Houwei Cao","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13470v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.08738v2","updated":"2023-12-20T22:20:46Z","published":"2023-09-15T19:56:15Z","title":"AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual\n Masked Autoencoder","summary":" Learning high-quality video representation has shown significant applications\nin computer vision and remains challenging. Previous work based on mask\nautoencoders such as ImageMAE and VideoMAE has proven the effectiveness of\nlearning representations in images and videos through reconstruction strategy\nin the visual modality. However, these models exhibit inherent limitations,\nparticularly in scenarios where extracting features solely from the visual\nmodality proves challenging, such as when dealing with low-resolution and\nblurry original videos. Based on this, we propose AV-MaskEnhancer for learning\nhigh-quality video representation by combining visual and audio information.\nOur approach addresses the challenge by demonstrating the complementary nature\nof audio and video features in cross-modality content. Moreover, our result of\nthe video classification task on the UCF101 dataset outperforms the existing\nwork and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a\ntop-5 accuracy of 99.9%.\n","authors":["Xingjian Diao","Ming Cheng","Shitong Cheng"],"pdf_url":"https://arxiv.org/pdf/2309.08738v2.pdf","comment":"2023 IEEE 35th International Conference on Tools with Artificial\n Intelligence (ICTAI)"},{"id":"http://arxiv.org/abs/2311.11059v2","updated":"2023-12-20T07:58:43Z","published":"2023-11-18T12:33:19Z","title":"HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment","summary":" We introduce HIDRO-VQA, a no-reference (NR) video quality assessment model\ndesigned to provide precise quality evaluations of High Dynamic Range (HDR)\nvideos. HDR videos exhibit a broader spectrum of luminance, detail, and color\nthan Standard Dynamic Range (SDR) videos. As HDR content becomes increasingly\npopular, there is a growing demand for video quality assessment (VQA)\nalgorithms that effectively address distortions unique to HDR content. To\naddress this challenge, we propose a self-supervised contrastive fine-tuning\napproach to transfer quality-aware features from the SDR to the HDR domain,\nutilizing unlabeled HDR videos. Our findings demonstrate that self-supervised\npre-trained neural networks on SDR content can be further fine-tuned in a\nself-supervised setting using limited unlabeled HDR videos to achieve\nstate-of-the-art performance on the only publicly available VQA database for\nHDR content, the LIVE-HDR VQA database. Moreover, our algorithm can be extended\nto the Full Reference VQA setting, also achieving state-of-the-art performance.\nOur code is available publicly at https://github.com/avinabsaha/HIDRO-VQA.\n","authors":["Shreshth Saini","Avinab Saha","Alan C. Bovik"],"pdf_url":"https://arxiv.org/pdf/2311.11059v2.pdf","comment":"WACV 2024 Workshop Paper. Shreshth Saini, Avinab Saha contributed\n equally to this work"}]},"2023-12-21T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2312.11462v2","updated":"2023-12-21T18:46:59Z","published":"2023-12-18T18:59:46Z","title":"Cascade Speculative Drafting for Even Faster LLM Inference","summary":" Speculative decoding enhances the efficiency of large language models (LLMs)\nby leveraging a draft model to draft for a larger target model to review.\nHowever, drafting in speculative decoding involves slow autoregressive\ngeneration and generating tokens of different importance with the same time\nallocation. These two inefficiencies lead to its suboptimal performance. To\naddress this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a\nnovel approach that employs two types of cascades. The Vertical Cascade\neliminates autoregressive generation from neural models. The Horizontal Cascade\nconstitutes efficient time allocation in drafting with its optimality supported\nby our theoretical analysis. Combining both cascades, our CS. Drafting\nalgorithm has achieved up to 72 percent additional speedup over speculative\ndecoding in our experiments while keeping the same output distribution.\n","authors":["Ziyi Chen","Xiaocong Yang","Jiacheng Lin","Chenkai Sun","Jie Huang","Kevin Chen-Chuan Chang"],"pdf_url":"https://arxiv.org/pdf/2312.11462v2.pdf","comment":"Preprint in progress"},{"id":"http://arxiv.org/abs/2310.14859v3","updated":"2023-12-21T18:19:58Z","published":"2023-10-23T12:29:10Z","title":"3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for\n Embodied Turn-Taking Prediction","summary":" Predicting turn-taking in multiparty conversations has many practical\napplications in human-computer/robot interaction. However, the complexity of\nhuman communication makes it a challenging task. Recent advances have shown\nthat synchronous multi-perspective egocentric data can significantly improve\nturn-taking prediction compared to asynchronous, single-perspective\ntranscriptions. Building on this research, we propose a new multimodal\ntransformer-based architecture for predicting turn-taking in embodied,\nsynchronized multi-perspective data. Our experimental results on the recently\nintroduced EgoCom dataset show a substantial performance improvement of up to\n14.01% on average compared to existing baselines and alternative\ntransformer-based approaches. The source code, and the pre-trained models of\nour 3M-Transformer will be available upon acceptance.\n","authors":["Mehdi Fatan","Emanuele Mincato","Dimitra Pintzou","Mariella Dimiccoli"],"pdf_url":"https://arxiv.org/pdf/2310.14859v3.pdf","comment":"Accepted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.14069v1","updated":"2023-12-21T17:47:33Z","published":"2023-12-21T17:47:33Z","title":"EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in\n Speech-to-Speech Models","summary":" We introduce EmphAssess, a prosodic benchmark designed to evaluate the\ncapability of speech-to-speech models to encode and reproduce prosodic\nemphasis. We apply this to two tasks: speech resynthesis and speech-to-speech\ntranslation. In both cases, the benchmark evaluates the ability of the model to\nencode emphasis in the speech input and accurately reproduce it in the output,\npotentially across a change of speaker and language. As part of the evaluation\npipeline, we introduce EmphaClass, a new model that classifies emphasis at the\nframe or word level.\n","authors":["Maureen de Seyssel","Antony D'Avirro","Adina Williams","Emmanuel Dupoux"],"pdf_url":"https://arxiv.org/pdf/2312.14069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14033v1","updated":"2023-12-21T17:02:06Z","published":"2023-12-21T17:02:06Z","title":"T-Eval: Evaluating the Tool Utilization Capability Step by Step","summary":" Large language models (LLM) have achieved remarkable performance on various\nNLP tasks and are augmented by tools for broader applications. Yet, how to\nevaluate and analyze the tool-utilization capability of LLMs is still\nunder-explored. In contrast to previous works that evaluate models\nholistically, we comprehensively decompose the tool utilization into multiple\nsub-processes, including instruction following, planning, reasoning, retrieval,\nunderstanding, and review. Based on that, we further introduce \\shortname~to\nevaluate the tool utilization capability step by step. \\shortname~disentangles\nthe tool utilization evaluation into several sub-domains along model\ncapabilities, facilitating the inner understanding of both holistic and\nisolated competency of LLMs. We conduct extensive experiments on \\shortname~and\nin-depth analysis of various LLMs. \\shortname~ not only exhibits consistency\nwith the outcome-oriented evaluation but also provides a more fine-grained\nanalysis of the capabilities of LLMs, providing a new perspective in LLM\nevaluation on tool-utilization ability. The benchmark will be available at\n\\href{https://github.com/open-compass/T-Eval}{https://github.com/open-compass/T-Eval}.\n","authors":["Zehui Chen","Weihua Du","Wenwei Zhang","Kuikun Liu","Jiangning Liu","Miao Zheng","Jingming Zhuo","Songyang Zhang","Dahua Lin","Kai Chen","Feng Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.14033v1.pdf","comment":"Code: https://github.com/open-compass/T-Eval"},{"id":"http://arxiv.org/abs/2307.14367v2","updated":"2023-12-21T16:46:35Z","published":"2023-07-25T09:35:43Z","title":"Prot2Text: Multimodal Protein's Function Generation with GNNs and\n Transformers","summary":" The complex nature of big biological systems pushed some scientists to\nclassify its understanding under the inconceivable missions. Different leveled\nchallenges complicated this task, one of is the prediction of a protein's\nfunction. In recent years, significant progress has been made in this field\nthrough the development of various machine learning approaches. However, most\nexisting methods formulate the task as a multi-classification problem, i.e\nassigning predefined labels to proteins. In this work, we propose a novel\napproach, \\textbf{Prot2Text}, which predicts a protein function's in a free\ntext style, moving beyond the conventional binary or categorical\nclassifications. By combining Graph Neural Networks(GNNs) and Large Language\nModels(LLMs), in an encoder-decoder framework, our model effectively integrates\ndiverse data types including proteins' sequences, structures, and textual\nannotations. This multimodal approach allows for a holistic representation of\nproteins' functions, enabling the generation of detailed and accurate\ndescriptions. To evaluate our model, we extracted a multimodal protein dataset\nfrom SwissProt, and demonstrate empirically the effectiveness of Prot2Text.\nThese results highlight the transformative impact of multimodal models,\nspecifically the fusion of GNNs and LLMs, empowering researchers with powerful\ntools for more accurate prediction of proteins' functions. The code, the models\nand a demo will be publicly released.\n","authors":["Hadi Abdine","Michail Chatzianastasis","Costas Bouyioukos","Michalis Vazirgiannis"],"pdf_url":"https://arxiv.org/pdf/2307.14367v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11032v2","updated":"2023-12-21T16:13:05Z","published":"2023-03-20T11:34:37Z","title":"DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4","summary":" The digitization of healthcare has facilitated the sharing and re-using of\nmedical data but has also raised concerns about confidentiality and privacy.\nHIPAA (Health Insurance Portability and Accountability Act) mandates removing\nre-identifying information before the dissemination of medical records. Thus,\neffective and efficient solutions for de-identifying medical data, especially\nthose in free-text forms, are highly needed. While various computer-assisted\nde-identification methods, including both rule-based and learning-based, have\nbeen developed and used in prior practice, such solutions still lack\ngeneralizability or need to be fine-tuned according to different scenarios,\nsignificantly imposing restrictions in wider use. The advancement of large\nlanguage models (LLM), such as ChatGPT and GPT-4, have shown great potential in\nprocessing text data in the medical domain with zero-shot in-context learning,\nespecially in the task of privacy protection, as these models can identify\nconfidential information by their powerful named entity recognition (NER)\ncapability. In this work, we developed a novel GPT4-enabled de-identification\nframework (``DeID-GPT\") to automatically identify and remove the identifying\ninformation. Compared to existing commonly used medical text data\nde-identification methods, our developed DeID-GPT showed the highest accuracy\nand remarkable reliability in masking private information from the unstructured\nmedical text while preserving the original structure and meaning of the text.\nThis study is one of the earliest to utilize ChatGPT and GPT-4 for medical text\ndata processing and de-identification, which provides insights for further\nresearch and solution development on the use of LLMs such as ChatGPT/GPT-4 in\nhealthcare. Codes and benchmarking data information are available at\nhttps://github.com/yhydhx/ChatGPT-API.\n","authors":["Zhengliang Liu","Yue Huang","Xiaowei Yu","Lu Zhang","Zihao Wu","Chao Cao","Haixing Dai","Lin Zhao","Yiwei Li","Peng Shu","Fang Zeng","Lichao Sun","Wei Liu","Dinggang Shen","Quanzheng Li","Tianming Liu","Dajiang Zhu","Xiang Li"],"pdf_url":"https://arxiv.org/pdf/2303.11032v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13961v1","updated":"2023-12-21T15:46:36Z","published":"2023-12-21T15:46:36Z","title":"ChatGPT as a commenter to the news: can LLMs generate human-like\n opinions?","summary":" ChatGPT, GPT-3.5, and other large language models (LLMs) have drawn\nsignificant attention since their release, and the abilities of these models\nhave been investigated for a wide variety of tasks. In this research we\ninvestigate to what extent GPT-3.5 can generate human-like comments on Dutch\nnews articles. We define human likeness as `not distinguishable from human\ncomments', approximated by the difficulty of automatic classification between\nhuman and GPT comments. We analyze human likeness across multiple prompting\ntechniques. In particular, we utilize zero-shot, few-shot and context prompts,\nfor two generated personas. We found that our fine-tuned BERT models can easily\ndistinguish human-written comments from GPT-3.5 generated comments, with none\nof the used prompting methods performing noticeably better. We further analyzed\nthat human comments consistently showed higher lexical diversity than\nGPT-generated comments. This indicates that although generative LLMs can\ngenerate fluent text, their capability to create human-like opinionated\ncomments is still limited.\n","authors":["Rayden Tseng","Suzan Verberne","Peter van der Putten"],"pdf_url":"https://arxiv.org/pdf/2312.13961v1.pdf","comment":"Published as Tseng, R., Verberne, S., van der Putten, P. (2023).\n ChatGPT as a Commenter to the News: Can LLMs Generate Human-Like Opinions?.\n In: Ceolin, D., Caselli, T., Tulin, M. (eds) Disinformation in Open Online\n Media. MISDOOM 2023. Lecture Notes in Computer Science, vol 14397. Springer,\n Cham"},{"id":"http://arxiv.org/abs/2312.13951v1","updated":"2023-12-21T15:38:41Z","published":"2023-12-21T15:38:41Z","title":"Typhoon: Thai Large Language Models","summary":" Typhoon is a series of Thai large language models (LLMs) developed\nspecifically for the Thai language. This technical report presents challenges\nand insights in developing Thai LLMs, including data preparation, pretraining,\ninstruction-tuning, and evaluation. As one of the challenges of low-resource\nlanguages is the amount of pretraining data, we apply continual training to\ntransfer existing world knowledge from a strong LLM. To evaluate the Thai\nknowledge encapsulated in each model from the pretraining stage, we develop\nThaiExam, a benchmark based on examinations for high-school students and\ninvestment professionals in Thailand. In addition, we fine-tune Typhoon to\nfollow Thai instructions, and we evaluate instruction-tuned models on Thai\ninstruction datasets as well as translation, summarization, and\nquestion-answering tasks. Experimental results on a suite of Thai benchmarks\nshow that Typhoon outperforms all open-source Thai language models, and its\nperformance is on par with GPT-3.5 in Thai while having only 7 billion\nparameters and being 2.62 times more efficient in tokenizing Thai text.\n","authors":["Kunat Pipatanakul","Phatrasek Jirabovonvisut","Potsawee Manakul","Sittipong Sripaisarnmongkol","Ruangsak Patomwong","Pathomporn Chokchainant","Kasima Tharnpipitchai"],"pdf_url":"https://arxiv.org/pdf/2312.13951v1.pdf","comment":"technical report, 12 pages"},{"id":"http://arxiv.org/abs/2312.13933v1","updated":"2023-12-21T15:28:02Z","published":"2023-12-21T15:28:02Z","title":"Structured Probabilistic Coding","summary":" This paper presents a new supervised representation learning framework,\nnamely Structured Probabilistic Coding (SPC), to learn compact and informative\nrepresentations from input related to the target task. SPC is an encoder-only\nprobabilistic coding technology with a structured regularization from the\ntarget label space. By extracting compact and informative representations from\ninput related to the target task, SPC can enhance the generalization ability of\npre-trained language models for better language understanding. Specifically,\nthe hidden representation is encoded into a Gaussian distribution space, while\nmaximizing the prior entropy of latent representations concerning label space.\nThis technique can simultaneously perform information encoding and task\nprediction in one module to more fully utilize the effective information from\ninput data, and use variational inference in the output space to reduce\nrandomness and uncertainty. To better control the probability distribution in\nthe latent space, a structured regularization is proposed to promote\nclass-level uniformity in the latent space. With the regularization term, SPC\ncan preserve the Gaussian distribution structure of latent code as well as\nbetter cover the hidden space with class uniformly. We conduct evaluations on\n12 natural language understanding tasks. The results show that our SPC can\neffectively improve the performance of pre-trained language models for various\nclassification and regression tasks. Experiments demonstrate that SPC can\nenhance the generalization capability, robustness to label noise, and\nclustering quality of output representations.\n","authors":["Dou Hu","Lingwei Wei","Yaxin Liu","Wei Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2312.13933v1.pdf","comment":"11 pages, accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2308.12466v2","updated":"2023-12-21T15:14:46Z","published":"2023-08-23T23:16:35Z","title":"Are ChatGPT and GPT-4 Good Poker Players? -- A Pre-Flop Analysis","summary":" Since the introduction of ChatGPT and GPT-4, these models have been tested\nacross a large number of tasks. Their adeptness across domains is evident, but\ntheir aptitude in playing games, and specifically their aptitude in the realm\nof poker has remained unexplored. Poker is a game that requires decision making\nunder uncertainty and incomplete information. In this paper, we put ChatGPT and\nGPT-4 through the poker test and evaluate their poker skills. Our findings\nreveal that while both models display an advanced understanding of poker,\nencompassing concepts like the valuation of starting hands, playing positions\nand other intricacies of game theory optimal (GTO) poker, both ChatGPT and\nGPT-4 are NOT game theory optimal poker players.\n Profitable strategies in poker are evaluated in expectations over large\nsamples. Through a series of experiments, we first discover the characteristics\nof optimal prompts and model parameters for playing poker with these models.\nOur observations then unveil the distinct playing personas of the two models.\nWe first conclude that GPT-4 is a more advanced poker player than ChatGPT. This\nexploration then sheds light on the divergent poker tactics of the two models:\nChatGPT's conservativeness juxtaposed against GPT-4's aggression. In poker\nvernacular, when tasked to play GTO poker, ChatGPT plays like a nit, which\nmeans that it has a propensity to only engage with premium hands and folds a\nmajority of hands. When subjected to the same directive, GPT-4 plays like a\nmaniac, showcasing a loose and aggressive style of play. Both strategies,\nalthough relatively advanced, are not game theory optimal.\n","authors":["Akshat Gupta"],"pdf_url":"https://arxiv.org/pdf/2308.12466v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13905v1","updated":"2023-12-21T14:51:04Z","published":"2023-12-21T14:51:04Z","title":"Domain-Specific Fine-Tuning of Large Language Models for Interactive\n Robot Programming","summary":" Industrial robots are applied in a widening range of industries, but robot\nprogramming mostly remains a task limited to programming experts. We propose a\nnatural language-based assistant for programming of advanced, industrial\nrobotic applications and investigate strategies for domain-specific fine-tuning\nof foundation models with limited data and compute.\n","authors":["Benjamin Alt","Urs Keßner","Aleksandar Taranovic","Darko Katic","Andreas Hermann","Rainer Jäkel","Gerhard Neumann"],"pdf_url":"https://arxiv.org/pdf/2312.13905v1.pdf","comment":"5 pages, 1 figure, accepted to the 2024 European Robotics Forum"},{"id":"http://arxiv.org/abs/2312.13881v1","updated":"2023-12-21T14:26:57Z","published":"2023-12-21T14:26:57Z","title":"Diversifying Knowledge Enhancement of Biomedical Language Models using\n Adapter Modules and Knowledge Graphs","summary":" Recent advances in natural language processing (NLP) owe their success to\npre-training language models on large amounts of unstructured data. Still,\nthere is an increasing effort to combine the unstructured nature of LMs with\nstructured knowledge and reasoning. Particularly in the rapidly evolving field\nof biomedical NLP, knowledge-enhanced language models (KELMs) have emerged as\npromising tools to bridge the gap between large language models and\ndomain-specific knowledge, considering the available biomedical knowledge\ngraphs (KGs) curated by experts over the decades. In this paper, we develop an\napproach that uses lightweight adapter modules to inject structured biomedical\nknowledge into pre-trained language models (PLMs). We use two large KGs, the\nbiomedical knowledge system UMLS and the novel biochemical ontology OntoChem,\nwith two prominent biomedical PLMs, PubMedBERT and BioLinkBERT. The approach\nincludes partitioning knowledge graphs into smaller subgraphs, fine-tuning\nadapter modules for each subgraph, and combining the knowledge in a fusion\nlayer. We test the performance on three downstream tasks: document\nclassification,question answering, and natural language inference. We show that\nour methodology leads to performance improvements in several instances while\nkeeping requirements in computing power low. Finally, we provide a detailed\ninterpretation of the results and report valuable insights for future work.\n","authors":["Juraj Vladika","Alexander Fichtl","Florian Matthes"],"pdf_url":"https://arxiv.org/pdf/2312.13881v1.pdf","comment":"Accepted as Full Paper to ICAART 2024"},{"id":"http://arxiv.org/abs/2312.13876v1","updated":"2023-12-21T14:20:06Z","published":"2023-12-21T14:20:06Z","title":"Capture the Flag: Uncovering Data Insights with Large Language Models","summary":" The extraction of a small number of relevant insights from vast amounts of\ndata is a crucial component of data-driven decision-making. However,\naccomplishing this task requires considerable technical skills, domain\nexpertise, and human labor. This study explores the potential of using Large\nLanguage Models (LLMs) to automate the discovery of insights in data,\nleveraging recent advances in reasoning and code generation techniques. We\npropose a new evaluation methodology based on a \"capture the flag\" principle,\nmeasuring the ability of such models to recognize meaningful and pertinent\ninformation (flags) in a dataset. We further propose two proof-of-concept\nagents, with different inner workings, and compare their ability to capture\nsuch flags in a real-world sales dataset. While the work reported here is\npreliminary, our results are sufficiently interesting to mandate future\nexploration by the community.\n","authors":["Issam Laradji","Perouz Taslakian","Sai Rajeswar","Valentina Zantedeschi","Alexandre Lacoste","Nicolas Chapados","David Vazquez","Christopher Pal","Alexandre Drouin"],"pdf_url":"https://arxiv.org/pdf/2312.13876v1.pdf","comment":"14 pages, 1 figure, Foundation Models for Decision Making Workshop at\n NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13871v1","updated":"2023-12-21T14:15:46Z","published":"2023-12-21T14:15:46Z","title":"Evaluating Task-oriented Dialogue Systems: A Systematic Review of\n Measures, Constructs and their Operationalisations","summary":" This review gives an extensive overview of evaluation methods for\ntask-oriented dialogue systems, paying special attention to practical\napplications of dialogue systems, for example for customer service. The review\n(1) provides an overview of the used constructs and metrics in previous work,\n(2) discusses challenges in the context of dialogue system evaluation and (3)\ndevelops a research agenda for the future of dialogue system evaluation. We\nconducted a systematic review of four databases (ACL, ACM, IEEE and Web of\nScience), which after screening resulted in 122 studies. Those studies were\ncarefully analysed for the constructs and methods they proposed for evaluation.\nWe found a wide variety in both constructs and methods. Especially the\noperationalisation is not always clearly reported. We hope that future work\nwill take a more critical approach to the operationalisation and specification\nof the used constructs. To work towards this aim, this review ends with\nrecommendations for evaluation and suggestions for outstanding questions.\n","authors":["Anouck Braggaar","Christine Liebrecht","Emiel van Miltenburg","Emiel Krahmer"],"pdf_url":"https://arxiv.org/pdf/2312.13871v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13866v1","updated":"2023-12-21T14:03:30Z","published":"2023-12-21T14:03:30Z","title":"Understanding Inter-Session Intentions via Complex Logical Reasoning","summary":" Understanding user intentions is crucial for enhancing product\nrecommendations, navigation suggestions, and query reformulations. However,\nuser intentions can be complex, involving multiple sessions and attribute\nrequirements connected by logical operators such as And, Or, and Not. For\nexample, a user may search for Nike or Adidas running shoes across various\nsessions, with a preference for the color purple. In another case, a user may\nhave purchased a mattress in a previous session and is now seeking a\ncorresponding bed frame without intending to buy another mattress. Prior\nresearch on session understanding has not sufficiently addressed how to make\nproduct or attribute recommendations for such complex intentions. In this\npaper, we introduce the task of logical session complex query answering, where\nsessions are treated as hyperedges of items, and we formulate the problem of\ncomplex intention understanding as a task of logical session complex queries\nanswering (LS-CQA) on an aggregated hypergraph of sessions, items, and\nattributes. The proposed task is a special type of complex query answering task\nwith sessions as ordered hyperedges. We also propose a new model, the Logical\nSession Graph Transformer (LSGT), which captures interactions among items\nacross different sessions and their logical connections using a transformer\nstructure. We analyze the expressiveness of LSGT and prove the permutation\ninvariance of the inputs for the logical operators. We evaluate LSGT on three\ndatasets and demonstrate that it achieves state-of-the-art results.\n","authors":["Jiaxin Bai","Chen Luo","Zheng Li","Qingyu Yin","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2312.13866v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11562v3","updated":"2023-12-21T13:21:59Z","published":"2023-12-17T15:16:13Z","title":"A Survey of Reasoning with Foundation Models: Concepts, Methodologies,\n and Outlook","summary":" Reasoning, a crucial ability for complex problem-solving, plays a pivotal\nrole in various real-world settings such as negotiation, medical diagnosis, and\ncriminal investigation. It serves as a fundamental methodology in the field of\nArtificial General Intelligence (AGI). With the ongoing development of\nfoundation models, there is a growing interest in exploring their abilities in\nreasoning tasks. In this paper, we introduce seminal foundation models proposed\nor adaptable for reasoning, highlighting the latest advancements in various\nreasoning tasks, methods, and benchmarks. We then delve into the potential\nfuture directions behind the emergence of reasoning abilities within foundation\nmodels. We also discuss the relevance of multimodal learning, autonomous\nagents, and super alignment in the context of reasoning. By discussing these\nfuture research directions, we hope to inspire researchers in their exploration\nof this field, stimulate further advancements in reasoning with foundation\nmodels, and contribute to the development of AGI.\n","authors":["Jiankai Sun","Chuanyang Zheng","Enze Xie","Zhengying Liu","Ruihang Chu","Jianing Qiu","Jiaqi Xu","Mingyu Ding","Hongyang Li","Mengzhe Geng","Yue Wu","Wenhai Wang","Junsong Chen","Zhangyue Yin","Xiaozhe Ren","Jie Fu","Junxian He","Wu Yuan","Qi Liu","Xihui Liu","Yu Li","Hao Dong","Yu Cheng","Ming Zhang","Pheng Ann Heng","Jifeng Dai","Ping Luo","Jingdong Wang","Ji-Rong Wen","Xipeng Qiu","Yike Guo","Hui Xiong","Qun Liu","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2312.11562v3.pdf","comment":"20 Figures, 160 Pages, 750+ References, Project Page\n https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models"},{"id":"http://arxiv.org/abs/2312.13816v1","updated":"2023-12-21T13:08:09Z","published":"2023-12-21T13:08:09Z","title":"Team Flow at DRC2023: Building Common Ground and Text-based Turn-taking\n in a Travel Agent Spoken Dialogue System","summary":" At the Dialogue Robot Competition 2023 (DRC2023), which was held to improve\nthe capability of dialogue robots, our team developed a system that could build\ncommon ground and take more natural turns based on user utterance texts. Our\nsystem generated queries for sightseeing spot searches using the common ground\nand engaged in dialogue while waiting for user comprehension.\n","authors":["Ryu Hirai","Shinya Iizuka","Haruhisa Iseno","Ao Guo","Jingjing Jiang","Atsumoto Ohashi","Ryuichiro Higashinaka"],"pdf_url":"https://arxiv.org/pdf/2312.13816v1.pdf","comment":"This paper is part of the proceedings of the Dialogue Robot\n Competition 2023"},{"id":"http://arxiv.org/abs/2312.13772v1","updated":"2023-12-21T11:55:10Z","published":"2023-12-21T11:55:10Z","title":"On Task Performance and Model Calibration with Supervised and\n Self-Ensembled In-Context Learning","summary":" Following the standard supervised fine-tuning (SFT) paradigm, in-context\nlearning (ICL) has become an efficient approach propelled by the recent\nadvancements in large language models (LLMs), yielding promising performance\nacross various tasks in few-shot data setups. However, both paradigms are prone\nto suffer from the critical problem of overconfidence (i.e., miscalibration),\nespecially in such limited data setups. In this work, we deliver an in-depth\nanalysis of the behavior across different choices of learning methods from the\nperspective of both performance and calibration, as well as their interplay.\nThrough extensive controlled experiments, we find that simultaneous gains for\nboth task performance and calibration are difficult to achieve, and the problem\nof miscalibration exists across all learning methods in low-resource\nscenarios.To address this challenging trade-off between performance and\ncalibration, we then investigate the potential of self-ensembling techniques\napplied at different modeling stages (e.g., variations of in-context examples\nor variations in prompts or different ensembling strategies). We justify the\nfeasibility of self-ensembling on SFT in addition to ICL, to make the\npredictions more calibrated and have comparable or even better performance. Our\nwork sheds light on which learning paradigm to choose and how to enhance both\ntask performance and calibration of LLMs.\n","authors":["Chengzu Li","Han Zhou","Goran Glavaš","Anna Korhonen","Ivan Vulić"],"pdf_url":"https://arxiv.org/pdf/2312.13772v1.pdf","comment":"9 pages, 4 figures, 5 tables (20 pages, 5 figures, 13 tables\n including references and appendices)"},{"id":"http://arxiv.org/abs/2312.11779v2","updated":"2023-12-21T11:45:55Z","published":"2023-12-19T01:28:46Z","title":"Are you talking to ['xem'] or ['x', 'em']? On Tokenization and\n Addressing Misgendering in LLMs with Pronoun Tokenization Parity","summary":" A large body of NLP research has documented the ways gender biases manifest\nand amplify within large language models (LLMs), though this research has\npredominantly operated within a gender binary-centric context. A growing body\nof work has identified the harmful limitations of this gender-exclusive\nframing; many LLMs cannot correctly and consistently refer to persons outside\nthe gender binary, especially if they use neopronouns. While data scarcity has\nbeen identified as a possible culprit, the precise mechanisms through which it\ninfluences LLM misgendering remain underexplored. Our work addresses this gap\nby studying data scarcity's role in subword tokenization and, consequently, the\nformation of LLM word representations. We uncover how the Byte-Pair Encoding\n(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun\nmisgendering through out-of-vocabulary behavior. We introduce pronoun\ntokenization parity (PTP), a novel approach to reduce LLM neopronoun\nmisgendering by preserving a token's functional structure. We evaluate PTP's\nefficacy using pronoun consistency-based metrics and a novel syntax-based\nmetric. Through several controlled experiments, finetuning LLMs with PTP\nimproves neopronoun consistency from 14.5% to 58.4%, highlighting the\nsignificant role tokenization plays in LLM pronoun consistency.\n","authors":["Anaelia Ovalle","Ninareh Mehrabi","Palash Goyal","Jwala Dhamala","Kai-Wei Chang","Richard Zemel","Aram Galstyan","Rahul Gupta"],"pdf_url":"https://arxiv.org/pdf/2312.11779v2.pdf","comment":"Accepted to 2023 Neurips Queer in AI workshop"},{"id":"http://arxiv.org/abs/2312.13766v1","updated":"2023-12-21T11:45:28Z","published":"2023-12-21T11:45:28Z","title":"Exploiting Contextual Target Attributes for Target Sentiment\n Classification","summary":" Existing PTLM-based models for TSC can be categorized into two groups: 1)\nfine-tuning-based models that adopt PTLM as the context encoder; 2)\nprompting-based models that transfer the classification task to the text/word\ngeneration task. In this paper, we present a new perspective of leveraging PTLM\nfor TSC: simultaneously leveraging the merits of both language modeling and\nexplicit target-context interactions via contextual target attributes.\nSpecifically, we design the domain- and target-constrained cloze test, which\ncan leverage the PTLMs' strong language modeling ability to generate the given\ntarget's attributes pertaining to the review context. The attributes contain\nthe background and property information of the target, which can help to enrich\nthe semantics of the review context and the target. To exploit the attributes\nfor tackling TSC, we first construct a heterogeneous information graph by\ntreating the attributes as nodes and combining them with (1) the syntax graph\nautomatically produced by the off-the-shelf dependency parser and (2) the\nsemantics graph of the review context, which is derived from the self-attention\nmechanism. Then we propose a heterogeneous information gated graph\nconvolutional network to model the interactions among the attribute\ninformation, the syntactic information, and the contextual information. The\nexperimental results on three benchmark datasets demonstrate the superiority of\nour model, which achieves new state-of-the-art performance.\n","authors":["Bowen Xing","Ivor W. Tsang"],"pdf_url":"https://arxiv.org/pdf/2312.13766v1.pdf","comment":"Accepted by Journal of Artificial Intelligence Research (JAIR)"},{"id":"http://arxiv.org/abs/2312.13764v1","updated":"2023-12-21T11:43:41Z","published":"2023-12-21T11:43:41Z","title":"A Semantic Space is Worth 256 Language Descriptions: Make Stronger\n Segmentation Models with Descriptive Properties","summary":" This paper introduces ProLab, a novel approach using property-level label\nspace for creating strong interpretable segmentation models. Instead of relying\nsolely on category-specific annotations, ProLab uses descriptive properties\ngrounded in common sense knowledge for supervising segmentation models. It is\nbased on two core designs. First, we employ Large Language Models (LLMs) and\ncarefully crafted prompts to generate descriptions of all involved categories\nthat carry meaningful common sense knowledge and follow a structured format.\nSecond, we introduce a description embedding model preserving semantic\ncorrelation across descriptions and then cluster them into a set of descriptive\nproperties (e.g., 256) using K-Means. These properties are based on\ninterpretable common sense knowledge consistent with theories of human\nrecognition. We empirically show that our approach makes segmentation models\nperform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal\nContext, Cityscapes, and BDD). Our method also shows better scalability with\nextended training steps than category-level supervision. Our interpretable\nsegmentation framework also emerges with the generalization ability to segment\nout-of-domain or unknown categories using only in-domain descriptive\nproperties. Code is available at https://github.com/lambert-x/ProLab.\n","authors":["Junfei Xiao","Ziqi Zhou","Wenxuan Li","Shiyi Lan","Jieru Mei","Zhiding Yu","Alan Yuille","Yuyin Zhou","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13764v1.pdf","comment":"Preprint. Code is available at https://github.com/lambert-x/ProLab"},{"id":"http://arxiv.org/abs/2205.02047v2","updated":"2023-12-21T11:30:54Z","published":"2022-05-04T13:13:52Z","title":"Hyperbolic Relevance Matching for Neural Keyphrase Extraction","summary":" Keyphrase extraction is a fundamental task in natural language processing and\ninformation retrieval that aims to extract a set of phrases with important\ninformation from a source document. Identifying important keyphrase is the\ncentral component of the keyphrase extraction task, and its main challenge is\nhow to represent information comprehensively and discriminate importance\naccurately. In this paper, to address these issues, we design a new hyperbolic\nmatching model (HyperMatch) to represent phrases and documents in the same\nhyperbolic space and explicitly estimate the phrase-document relevance via the\nPoincar\\'e distance as the important score of each phrase. Specifically, to\ncapture the hierarchical syntactic and semantic structure information,\nHyperMatch takes advantage of the hidden representations in multiple layers of\nRoBERTa and integrates them as the word embeddings via an adaptive mixing\nlayer. Meanwhile, considering the hierarchical structure hidden in the\ndocument, HyperMatch embeds both phrases and documents in the same hyperbolic\nspace via a hyperbolic phrase encoder and a hyperbolic document encoder. This\nstrategy can further enhance the estimation of phrase-document relevance due to\nthe good properties of hyperbolic space. In this setting, the keyphrase\nextraction can be taken as a matching problem and effectively implemented by\nminimizing a hyperbolic margin-based triplet loss. Extensive experiments are\nconducted on six benchmarks and demonstrate that HyperMatch outperforms the\nstate-of-the-art baselines.\n","authors":["Mingyang Song","Yi Feng","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2205.02047v2.pdf","comment":"12 pages, 3 figures, Accepted by NAACL2022"},{"id":"http://arxiv.org/abs/2110.09749v5","updated":"2023-12-21T10:56:50Z","published":"2021-10-19T05:48:22Z","title":"Importance Estimation from Multiple Perspectives for Keyphrase\n Extraction","summary":" Keyphrase extraction is a fundamental task in Natural Language Processing,\nwhich usually contains two main parts: candidate keyphrase extraction and\nkeyphrase importance estimation. From the view of human understanding\ndocuments, we typically measure the importance of phrase according to its\nsyntactic accuracy, information saliency, and concept consistency\nsimultaneously. However, most existing keyphrase extraction approaches only\nfocus on the part of them, which leads to biased results. In this paper, we\npropose a new approach to estimate the importance of keyphrase from multiple\nperspectives (called as \\textit{KIEMP}) and further improve the performance of\nkeyphrase extraction. Specifically, \\textit{KIEMP} estimates the importance of\nphrase with three modules: a chunking module to measure its syntactic accuracy,\na ranking module to check its information saliency, and a matching module to\njudge the concept (i.e., topic) consistency between phrase and the whole\ndocument. These three modules are seamlessly jointed together via an end-to-end\nmulti-task learning model, which is helpful for three parts to enhance each\nother and balance the effects of three perspectives. Experimental results on\nsix benchmark datasets show that \\textit{KIEMP} outperforms the existing\nstate-of-the-art keyphrase extraction approaches in most cases.\n","authors":["Mingyang Song","Liping Jing","Lin Xiao"],"pdf_url":"https://arxiv.org/pdf/2110.09749v5.pdf","comment":"11 pages, 2 figures, Accepted by EMNLP2021"},{"id":"http://arxiv.org/abs/2311.07919v2","updated":"2023-12-21T10:20:42Z","published":"2023-11-14T05:34:50Z","title":"Qwen-Audio: Advancing Universal Audio Understanding via Unified\n Large-Scale Audio-Language Models","summary":" Recently, instruction-following audio-language models have received broad\nattention for audio interaction with humans. However, the absence of\npre-trained audio models capable of handling diverse audio types and tasks has\nhindered progress in this field. Consequently, most existing works have only\nbeen able to support a limited range of interaction capabilities. In this\npaper, we develop the Qwen-Audio model and address this limitation by scaling\nup audio-language pre-training to cover over 30 tasks and various audio types,\nsuch as human speech, natural sounds, music, and songs, to facilitate universal\naudio understanding abilities. However, directly co-training all tasks and\ndatasets can lead to interference issues, as the textual labels associated with\ndifferent datasets exhibit considerable variations due to differences in task\nfocus, language, granularity of annotation, and text structure. To overcome the\none-to-many interference, we carefully design a multi-task training framework\nby conditioning on a sequence of hierarchical tags to the decoder for\nencouraging knowledge sharing and avoiding interference through shared and\nspecified tags respectively. Remarkably, Qwen-Audio achieves impressive\nperformance across diverse benchmark tasks without requiring any task-specific\nfine-tuning, surpassing its counterparts. Building upon the capabilities of\nQwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from\nvarious audios and text inputs, enabling multi-turn dialogues and supporting\nvarious audio-central scenarios.\n","authors":["Yunfei Chu","Jin Xu","Xiaohuan Zhou","Qian Yang","Shiliang Zhang","Zhijie Yan","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2311.07919v2.pdf","comment":"The code, checkpoints and demo are released at\n https://github.com/QwenLM/Qwen-Audio"},{"id":"http://arxiv.org/abs/2312.07069v2","updated":"2023-12-21T09:47:19Z","published":"2023-12-12T08:43:20Z","title":"Context Matters: Data-Efficient Augmentation of Large Language Models\n for Scientific Applications","summary":" In this paper, we explore the challenges inherent to Large Language Models\n(LLMs) like GPT-4, particularly their propensity for hallucinations, logic\nmistakes, and incorrect conclusions when tasked with answering complex\nquestions. The capacity of LLMs to present erroneous answers in a coherent and\nsemantically rigorous manner further complicates the detection of factual\ninaccuracies. This issue is especially pronounced in fields that require\nspecialized expertise. Our work delves into these challenges, aiming to enhance\nthe understanding and mitigation of such errors, thereby contributing to the\nimprovement of LLM accuracy and reliability in scientific and other specialized\ndomains. Our findings reveal a non-linear relationship between the context's\nrelevancy and the answers' measured quality. In addition, we demonstrate that\nwith the correct calibration, it is possible to automate the grading procedure\n-- a finding suggesting that, at least to some degree, the LLMs can be used to\nself-examine the quality of their own performance. Finally, we describe an\nexperimental platform that can be seen as a proof-of-concept of the techniques\ndescribed in this work.\n","authors":["Xiang Li","Haoran Tang","Siyu Chen","Ziwei Wang","Anurag Maravi","Marcin Abram"],"pdf_url":"https://arxiv.org/pdf/2312.07069v2.pdf","comment":"11 pages, 6 figures, 4 tables, 3 pages of supplementary material"},{"id":"http://arxiv.org/abs/2312.13694v1","updated":"2023-12-21T09:45:13Z","published":"2023-12-21T09:45:13Z","title":"Data Transformation to Construct a Dataset for Generating\n Entity-Relationship Model from Natural Language","summary":" In order to reduce the manual cost of designing ER models, recent approaches\nhave been proposed to address the task of NL2ERM, i.e., automatically\ngenerating entity-relationship (ER) models from natural language (NL)\nutterances such as software requirements. These approaches are typically\nrule-based ones, which rely on rigid heuristic rules; these approaches cannot\ngeneralize well to various linguistic ways of describing the same requirement.\nDespite having better generalization capability than rule-based approaches,\ndeep-learning-based models are lacking for NL2ERM due to lacking a large-scale\ndataset. To address this issue, in this paper, we report our insight that there\nexists a high similarity between the task of NL2ERM and the increasingly\npopular task of text-to-SQL, and propose a data transformation algorithm that\ntransforms the existing data of text-to-SQL into the data of NL2ERM. We apply\nour data transformation algorithm on Spider, one of the most popular\ntext-to-SQL datasets, and we also collect some data entries with different NL\ntypes, to obtain a large-scale NL2ERM dataset. Because NL2ERM can be seen as a\nspecial information extraction (IE) task, we train two state-of-the-art IE\nmodels on our dataset. The experimental results show that both the two models\nachieve high performance and outperform existing baselines.\n","authors":["Zhenwen Li","Jian-Guang Lou","Tao Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05203v2","updated":"2023-12-21T09:32:57Z","published":"2023-09-11T02:35:36Z","title":"From Artificially Real to Real: Leveraging Pseudo Data from Large\n Language Models for Low-Resource Molecule Discovery","summary":" Molecule discovery serves as a cornerstone in numerous scientific domains,\nfueling the development of new materials and innovative drug designs. Recent\ndevelopments of in-silico molecule discovery have highlighted the promising\nresults of cross-modal techniques, which bridge molecular structures with their\ndescriptive annotations. However, these cross-modal methods frequently\nencounter the issue of data scarcity, hampering their performance and\napplication. In this paper, we address the low-resource challenge by utilizing\nartificially-real data generated by Large Language Models (LLMs). We first\nintroduce a retrieval-based prompting strategy to construct high-quality pseudo\ndata, then explore the optimal method to effectively leverage this pseudo data.\nExperiments show that using pseudo data for domain adaptation outperforms all\nexisting methods, while also requiring a smaller model scale, reduced data size\nand lower training cost, highlighting its efficiency. Furthermore, our method\nshows a sustained improvement as the volume of pseudo data increases, revealing\nthe great potential of pseudo data in advancing low-resource cross-modal\nmolecule discovery. Our code and data are available at\nhttps://github.com/SCIR-HI/ArtificiallyR2R.\n","authors":["Yuhan Chen","Nuwa Xi","Yanrui Du","Haochun Wang","Chen Jianyu","Sendong Zhao","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2309.05203v2.pdf","comment":"Accepted to AAAI2024"},{"id":"http://arxiv.org/abs/2312.13671v1","updated":"2023-12-21T08:50:41Z","published":"2023-12-21T08:50:41Z","title":"Text2Analysis: A Benchmark of Table Question Answering with Advanced\n Data Analysis and Unclear Queries","summary":" Tabular data analysis is crucial in various fields, and large language models\nshow promise in this area. However, current research mostly focuses on\nrudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like\nforecasting and chart generation. To address this gap, we developed the\nText2Analysis benchmark, incorporating advanced analysis tasks that go beyond\nthe SQL-compatible operations and require more in-depth analysis. We also\ndevelop five innovative and effective annotation methods, harnessing the\ncapabilities of large language models to enhance data quality and quantity.\nAdditionally, we include unclear queries that resemble real-world user\nquestions to test how well models can understand and tackle such challenges.\nFinally, we collect 2249 query-result pairs with 347 tables. We evaluate five\nstate-of-the-art models using three different metrics and the results show that\nour benchmark presents introduces considerable challenge in the field of\ntabular data analysis, paving the way for more advanced research opportunities.\n","authors":["Xinyi He","Mengyu Zhou","Xinrun Xu","Xiaojun Ma","Rui Ding","Lun Du","Yan Gao","Ran Jia","Xu Chen","Shi Han","Zejian Yuan","Dongmei Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13671v1.pdf","comment":"Accepted by AAAI'2024"},{"id":"http://arxiv.org/abs/2309.08173v2","updated":"2023-12-21T08:47:33Z","published":"2023-09-15T05:45:44Z","title":"FedJudge: Federated Legal Large Language Model","summary":" Large Language Models (LLMs) have gained prominence in the field of Legal\nIntelligence, offering potential applications in assisting legal professionals\nand laymen. However, the centralized training of these Legal LLMs raises data\nprivacy concerns, as legal data is distributed among various institutions\ncontaining sensitive individual information. This paper addresses this\nchallenge by exploring the integration of Legal LLMs with Federated Learning\n(FL) methodologies. By employing FL, Legal LLMs can be fine-tuned locally on\ndevices or clients, and their parameters are aggregated and distributed on a\ncentral server, ensuring data privacy without directly sharing raw data.\nHowever, computation and communication overheads hinder the full fine-tuning of\nLLMs under the FL setting. Moreover, the distribution shift of legal data\nreduces the effectiveness of FL methods. To this end, in this paper, we propose\nthe first Federated Legal Large Language Model (FedJudge) framework, which\nfine-tunes Legal LLMs efficiently and effectively. Specifically, FedJudge\nutilizes parameter-efficient fine-tuning methods to update only a few\nadditional parameters during the FL training. Besides, we explore the continual\nlearning methods to preserve the global model's important parameters when\ntraining local clients to mitigate the problem of data shifts. Extensive\nexperimental results on three real-world datasets clearly validate the\neffectiveness of FedJudge. Code is released at\nhttps://github.com/yuelinan/FedJudge.\n","authors":["Linan Yue","Qi Liu","Yichao Du","Weibo Gao","Ye Liu","Fangzhou Yao"],"pdf_url":"https://arxiv.org/pdf/2309.08173v2.pdf","comment":"Submitted to DASFAA 2024"},{"id":"http://arxiv.org/abs/2312.13655v1","updated":"2023-12-21T08:29:41Z","published":"2023-12-21T08:29:41Z","title":"Compositional Zero-Shot Learning for Attribute-Based Object Reference in\n Human-Robot Interaction","summary":" Language-enabled robots have been widely studied over the past years to\nenable natural human-robot interaction and teaming in various real-world\napplications. Language-enabled robots must be able to comprehend referring\nexpressions to identify a particular object from visual perception using a set\nof referring attributes extracted from natural language. However, visual\nobservations of an object may not be available when it is referred to, and the\nnumber of objects and attributes may also be unbounded in open worlds. To\naddress the challenges, we implement an attribute-based compositional zero-shot\nlearning method that uses a list of attributes to perform referring expression\ncomprehension in open worlds. We evaluate the approach on two datasets\nincluding the MIT-States and the Clothing 16K. The preliminary experimental\nresults show that our implemented approach allows a robot to correctly identify\nthe objects referred to by human commands.\n","authors":["Peng Gao","Ahmed Jaafar","Brian Reily","Christopher Reardon","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13655v1.pdf","comment":"Equal contribution from the first two authors"},{"id":"http://arxiv.org/abs/2307.05722v2","updated":"2023-12-21T08:20:40Z","published":"2023-07-10T11:29:41Z","title":"Exploring Large Language Model for Graph Data Understanding in Online\n Job Recommendations","summary":" Large Language Models (LLMs) have revolutionized natural language processing\ntasks, demonstrating their exceptional capabilities in various domains.\nHowever, their potential for behavior graph understanding in job\nrecommendations remains largely unexplored. This paper focuses on unveiling the\ncapability of large language models in understanding behavior graphs and\nleveraging this understanding to enhance recommendations in online recruitment,\nincluding the promotion of out-of-distribution (OOD) application. We present a\nnovel framework that harnesses the rich contextual information and semantic\nrepresentations provided by large language models to analyze behavior graphs\nand uncover underlying patterns and relationships. Specifically, we propose a\nmeta-path prompt constructor that leverages LLM recommender to understand\nbehavior graphs for the first time and design a corresponding path augmentation\nmodule to alleviate the prompt bias introduced by path-based sequence input. By\nleveraging this capability, our framework enables personalized and accurate job\nrecommendations for individual users. We evaluate the effectiveness of our\napproach on a comprehensive dataset and demonstrate its ability to improve the\nrelevance and quality of recommended quality. This research not only sheds\nlight on the untapped potential of large language models but also provides\nvaluable insights for developing advanced recommendation systems in the\nrecruitment market. The findings contribute to the growing field of natural\nlanguage processing and offer practical implications for enhancing job search\nexperiences. We release the code at https://github.com/WLiK/GLRec.\n","authors":["Likang Wu","Zhaopeng Qiu","Zhi Zheng","Hengshu Zhu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2307.05722v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12999v2","updated":"2023-12-21T07:45:43Z","published":"2023-12-20T12:59:31Z","title":"Machine Mindset: An MBTI Exploration of Large Language Models","summary":" We present a novel approach for integrating Myers-Briggs Type Indicator\n(MBTI) personality traits into large language models (LLMs), addressing the\nchallenges of personality consistency in personalized AI. Our method, \"Machine\nMindset,\" involves a two-phase fine-tuning and Direct Preference Optimization\n(DPO) to embed MBTI traits into LLMs. This approach ensures that models\ninternalize these traits, offering a stable and consistent personality profile.\nWe demonstrate the effectiveness of our models across various domains, showing\nalignment between model performance and their respective MBTI traits. The paper\nhighlights significant contributions in the development of personality datasets\nand a new training methodology for personality integration in LLMs, enhancing\nthe potential for personalized AI applications. We also open-sourced our model\nand part of the data at \\url{https://github.com/PKU-YuanGroup/Machine-Mindset}.\n","authors":["Jiaxi Cui","Liuzhenghao Lv","Jing Wen","Jing Tang","YongHong Tian","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.12999v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10799v2","updated":"2023-12-21T07:38:56Z","published":"2023-07-20T12:01:40Z","title":"Layer-wise Representation Fusion for Compositional Generalization","summary":" Existing neural models are demonstrated to struggle with compositional\ngeneralization (CG), i.e., the ability to systematically generalize to unseen\ncompositions of seen components. A key reason for failure on CG is that the\nsyntactic and semantic representations of sequences in both the uppermost layer\nof the encoder and decoder are entangled. However, previous work concentrates\non separating the learning of syntax and semantics instead of exploring the\nreasons behind the representation entanglement (RE) problem to solve it. We\nexplain why it exists by analyzing the representation evolving mechanism from\nthe bottom to the top of the Transformer layers. We find that the ``shallow''\nresidual connections within each layer fail to fuse previous layers'\ninformation effectively, leading to information forgetting between layers and\nfurther the RE problems. Inspired by this, we propose LRF, a novel\n\\textbf{L}ayer-wise \\textbf{R}epresentation \\textbf{F}usion framework for CG,\nwhich learns to fuse previous layers' information back into the encoding and\ndecoding process effectively through introducing a \\emph{fuse-attention module}\nat each encoder and decoder layer. LRF achieves promising results on two\nrealistic benchmarks, empirically demonstrating the effectiveness of our\nproposal.\n","authors":["Yafang Zheng","Lei Lin","Shuangtao Li","Yuxuan Yuan","Zhaohong Lai","Shan Liu","Biao Fu","Yidong Chen","Xiaodong Shi"],"pdf_url":"https://arxiv.org/pdf/2307.10799v2.pdf","comment":"accepted by aaai24. arXiv admin note: substantial text overlap with\n arXiv:2305.12169"},{"id":"http://arxiv.org/abs/2303.02846v3","updated":"2023-12-21T07:35:18Z","published":"2023-03-06T02:52:37Z","title":"Contrastive variational information bottleneck for aspect-based\n sentiment analysis","summary":" Deep learning techniques have dominated the literature on aspect-based\nsentiment analysis (ABSA), achieving state-of-the-art performance. However,\ndeep models generally suffer from spurious correlations between input features\nand output labels, which hurts the robustness and generalization capability by\na large margin. In this paper, we propose to reduce spurious correlations for\nABSA, via a novel Contrastive Variational Information Bottleneck framework\n(called CVIB). The proposed CVIB framework is composed of an original network\nand a self-pruned network, and these two networks are optimized simultaneously\nvia contrastive learning. Concretely, we employ the Variational Information\nBottleneck (VIB) principle to learn an informative and compressed network\n(self-pruned network) from the original network, which discards the superfluous\npatterns or spurious correlations between input features and prediction labels.\nThen, self-pruning contrastive learning is devised to pull together\nsemantically similar positive pairs and push away dissimilar pairs, where the\nrepresentations of the anchor learned by the original and self-pruned networks\nrespectively are regarded as a positive pair while the representations of two\ndifferent sentences within a mini-batch are treated as a negative pair. To\nverify the effectiveness of our CVIB method, we conduct extensive experiments\non five benchmark ABSA datasets and the experimental results show that our\napproach achieves better performance than the strong competitors in terms of\noverall prediction performance, robustness, and generalization. Code and data\nto reproduce the results in this paper is available at:\nhttps://github.com/shesshan/CVIB.\n","authors":["Mingshan Chang","Min Yang","Qingshan Jiang","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2303.02846v3.pdf","comment":"Accepted by Knowledge-Based Systems (KBS)"},{"id":"http://arxiv.org/abs/2308.14034v2","updated":"2023-12-21T07:30:31Z","published":"2023-08-27T07:53:00Z","title":"Confucius: Iterative Tool Learning from Introspection Feedback by\n Easy-to-Difficult Curriculum","summary":" Augmenting large language models (LLMs) with external tools has emerged as a\npromising approach to extending the capability of LLMs. Although some works\nemploy open-source LLMs for the tool learning task, most of them are trained in\na controlled environment in which LLMs only learn to execute the human-provided\ntools. However, selecting proper tools from the large toolset is also a crucial\nability for the tool learning model to be applied in real-world applications.\nExisting methods usually directly employ self-instruction methods to train the\nmodel, which ignores differences in tool complexity. In this paper, we propose\nthe Confucius, a novel tool learning framework to train LLM to use complicated\ntools in real-world scenarios, which contains two main phases: (1) We first\npropose a multi-stage learning method to teach the LLM to use various tools\nfrom an easy-to-difficult curriculum; (2) thenceforth, we propose the Iterative\nSelf-instruct from Introspective Feedback (ISIF) to dynamically construct the\ndataset to improve the ability to use the complicated tool. Extensive\nexperiments conducted on both controlled and real-world settings demonstrate\nthe superiority of our tool learning framework in the real-world application\nscenarios compared to both tuning-free (e.g. ChatGPT, Claude) and tuning-based\nbaselines (e.g. GPT4Tools).\n","authors":["Shen Gao","Zhengliang Shi","Minghang Zhu","Bowen Fang","Xin Xin","Pengjie Ren","Zhumin Chen","Jun Ma","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2308.14034v2.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13614v1","updated":"2023-12-21T07:03:15Z","published":"2023-12-21T07:03:15Z","title":"Structure-Aware Path Inference for Neural Finite State Transducers","summary":" Neural finite-state transducers (NFSTs) form an expressive family of\nneurosymbolic sequence transduction models. An NFST models each string pair as\nhaving been generated by a latent path in a finite-state transducer. As they\nare deep generative models, both training and inference of NFSTs require\ninference networks that approximate posterior distributions over such latent\nvariables. In this paper, we focus on the resulting challenge of imputing the\nlatent alignment path that explains a given pair of input and output strings\n(e.g., during training). We train three autoregressive approximate models for\namortized inference of the path, which can then be used as proposal\ndistributions for importance sampling. All three models perform lookahead. Our\nmost sophisticated (and novel) model leverages the FST structure to consider\nthe graph of future paths; unfortunately, we find that it loses out to the\nsimpler approaches -- except on an artificial task that we concocted to confuse\nthe simpler approaches.\n","authors":["Weiting Tan","Chu-cheng Lin","Jason Eisner"],"pdf_url":"https://arxiv.org/pdf/2312.13614v1.pdf","comment":"In Proceedings of ICBINB Workshop at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13608v1","updated":"2023-12-21T06:51:34Z","published":"2023-12-21T06:51:34Z","title":"Argue with Me Tersely: Towards Sentence-Level Counter-Argument\n Generation","summary":" Counter-argument generation -- a captivating area in computational\nlinguistics -- seeks to craft statements that offer opposing views. While most\nresearch has ventured into paragraph-level generation, sentence-level\ncounter-argument generation beckons with its unique constraints and\nbrevity-focused challenges. Furthermore, the diverse nature of\ncounter-arguments poses challenges for evaluating model performance solely\nbased on n-gram-based metrics. In this paper, we present the ArgTersely\nbenchmark for sentence-level counter-argument generation, drawing from a\nmanually annotated dataset from the ChangeMyView debate forum. We also propose\nArg-LlaMA for generating high-quality counter-argument. For better evaluation,\nwe trained a BERT-based evaluator Arg-Judge with human preference data. We\nconducted comparative experiments involving various baselines such as LlaMA,\nAlpaca, GPT-3, and others. The results show the competitiveness of our proposed\nframework and evaluator in counter-argument generation tasks. Code and data are\navailable at https://github.com/amazingljy1206/ArgTersely.\n","authors":["Jiayu Lin","Rong Ye","Meng Han","Qi Zhang","Ruofei Lai","Xinyu Zhang","Zhao Cao","Xuanjing Huang","Zhongyu Wei"],"pdf_url":"https://arxiv.org/pdf/2312.13608v1.pdf","comment":"EMNLP2023, main conference"},{"id":"http://arxiv.org/abs/2303.17564v3","updated":"2023-12-21T06:21:11Z","published":"2023-03-30T17:30:36Z","title":"BloombergGPT: A Large Language Model for Finance","summary":" The use of NLP in the realm of financial technology is broad and complex,\nwith applications ranging from sentiment analysis and named entity recognition\nto question answering. Large Language Models (LLMs) have been shown to be\neffective on a variety of tasks; however, no LLM specialized for the financial\ndomain has been reported in literature. In this work, we present BloombergGPT,\na 50 billion parameter language model that is trained on a wide range of\nfinancial data. We construct a 363 billion token dataset based on Bloomberg's\nextensive data sources, perhaps the largest domain-specific dataset yet,\naugmented with 345 billion tokens from general purpose datasets. We validate\nBloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite\nof internal benchmarks that most accurately reflect our intended usage. Our\nmixed dataset training leads to a model that outperforms existing models on\nfinancial tasks by significant margins without sacrificing performance on\ngeneral LLM benchmarks. Additionally, we explain our modeling choices, training\nprocess, and evaluation methodology. We release Training Chronicles (Appendix\nC) detailing our experience in training BloombergGPT.\n","authors":["Shijie Wu","Ozan Irsoy","Steven Lu","Vadim Dabravolski","Mark Dredze","Sebastian Gehrmann","Prabhanjan Kambadur","David Rosenberg","Gideon Mann"],"pdf_url":"https://arxiv.org/pdf/2303.17564v3.pdf","comment":"Updated to include Training Chronicles (Appendix C)"},{"id":"http://arxiv.org/abs/2312.13594v1","updated":"2023-12-21T05:51:55Z","published":"2023-12-21T05:51:55Z","title":"Towards More Faithful Natural Language Explanation Using Multi-Level\n Contrastive Learning in VQA","summary":" Natural language explanation in visual question answer (VQA-NLE) aims to\nexplain the decision-making process of models by generating natural language\nsentences to increase users' trust in the black-box systems. Existing post-hoc\nmethods have achieved significant progress in obtaining a plausible\nexplanation. However, such post-hoc explanations are not always aligned with\nhuman logical inference, suffering from the issues on: 1) Deductive\nunsatisfiability, the generated explanations do not logically lead to the\nanswer; 2) Factual inconsistency, the model falsifies its counterfactual\nexplanation for answers without considering the facts in images; and 3)\nSemantic perturbation insensitivity, the model can not recognize the semantic\nchanges caused by small perturbations. These problems reduce the faithfulness\nof explanations generated by models. To address the above issues, we propose a\nnovel self-supervised \\textbf{M}ulti-level \\textbf{C}ontrastive\n\\textbf{L}earning based natural language \\textbf{E}xplanation model (MCLE) for\nVQA with semantic-level, image-level, and instance-level factual and\ncounterfactual samples. MCLE extracts discriminative features and aligns the\nfeature spaces from explanations with visual question and answer to generate\nmore consistent explanations. We conduct extensive experiments, ablation\nanalysis, and case study to demonstrate the effectiveness of our method on two\nVQA-NLE benchmarks.\n","authors":["Chengen Lai","Shengli Song","Shiqi Meng","Jingyang Li","Sitong Yan","Guangneng Hu"],"pdf_url":"https://arxiv.org/pdf/2312.13594v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13585v1","updated":"2023-12-21T05:32:49Z","published":"2023-12-21T05:32:49Z","title":"Speech Translation with Large Language Models: An Industrial Practice","summary":" Given the great success of large language models (LLMs) across various tasks,\nin this paper, we introduce LLM-ST, a novel and effective speech translation\nmodel constructed upon a pre-trained LLM. By integrating the large language\nmodel (LLM) with a speech encoder and employing multi-task instruction tuning,\nLLM-ST can produce accurate timestamped transcriptions and translations, even\nfrom long audio inputs. Furthermore, our findings indicate that the\nimplementation of Chain-of-Thought (CoT) prompting can yield advantages in the\ncontext of LLM-ST. Through rigorous experimentation on English and Chinese\ndatasets, we showcase the exceptional performance of LLM-ST, establishing a new\nbenchmark in the field of speech translation. Demo:\nhttps://speechtranslation.github.io/llm-st/.\n","authors":["Zhichao Huang","Rong Ye","Tom Ko","Qianqian Dong","Shanbo Cheng","Mingxuan Wang","Hang Li"],"pdf_url":"https://arxiv.org/pdf/2312.13585v1.pdf","comment":"Technical report. 13 pages. Demo:\n https://speechtranslation.github.io/llm-st/"},{"id":"http://arxiv.org/abs/2312.12655v2","updated":"2023-12-21T04:29:24Z","published":"2023-12-19T22:57:13Z","title":"Can Transformers Learn Sequential Function Classes In Context?","summary":" In-context learning (ICL) has revolutionized the capabilities of transformer\nmodels in NLP. In our project, we extend the understanding of the mechanisms\nunderpinning ICL by exploring whether transformers can learn from sequential,\nnon-textual function class data distributions. We introduce a novel sliding\nwindow sequential function class and employ toy-sized transformers with a GPT-2\narchitecture to conduct our experiments. Our analysis indicates that these\nmodels can indeed leverage ICL when trained on non-textual sequential function\nclasses. Additionally, our experiments with randomized y-label sequences\nhighlights that transformers retain some ICL capabilities even when the label\nassociations are obfuscated. We provide evidence that transformers can reason\nwith and understand sequentiality encoded within function classes, as reflected\nby the effective learning of our proposed tasks. Our results also show that the\nperformance deteriorated with increasing randomness in the labels, though not\nto the extent one might expect, implying a potential robustness of learned\nsequentiality against label noise. Future research may want to look into how\nprevious explanations of transformers, such as induction heads and task\nvectors, relate to sequentiality in ICL in these toy examples. Our\ninvestigation lays the groundwork for further research into how transformers\nprocess and perceive sequential data.\n","authors":["Ryan Campbell","Emma Guo","Evan Hu","Reya Vir","Ethan Hsiao"],"pdf_url":"https://arxiv.org/pdf/2312.12655v2.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2308.10045v2","updated":"2023-12-21T04:01:11Z","published":"2023-08-19T15:08:10Z","title":"An Empirical Study of CLIP for Text-based Person Search","summary":" Text-based Person Search (TBPS) aims to retrieve the person images using\nnatural language descriptions. Recently, Contrastive Language Image Pretraining\n(CLIP), a universal large cross-modal vision-language pre-training model, has\nremarkably performed over various cross-modal downstream tasks due to its\npowerful cross-modal semantic learning capacity. TPBS, as a fine-grained\ncross-modal retrieval task, is also facing the rise of research on the\nCLIP-based TBPS. In order to explore the potential of the visual-language\npre-training model for downstream TBPS tasks, this paper makes the first\nattempt to conduct a comprehensive empirical study of CLIP for TBPS and thus\ncontribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the\nTBPS community. We revisit critical design considerations under CLIP, including\ndata augmentation and loss function. The model, with the aforementioned designs\nand practical training tricks, can attain satisfactory performance without any\nsophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in\nmodel generalization and model compression, demonstrating the effectiveness of\nTBPS-CLIP from various aspects. This work is expected to provide empirical\ninsights and highlight future CLIP-based TBPS research.\n","authors":["Min Cao","Yang Bai","Ziyin Zeng","Mang Ye","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.10045v2.pdf","comment":"Accepted by AAAI 2024. Code is available at\n https://github.com/Flame-Chasers/TBPS-CLIP"},{"id":"http://arxiv.org/abs/2312.13558v1","updated":"2023-12-21T03:51:08Z","published":"2023-12-21T03:51:08Z","title":"The Truth is in There: Improving Reasoning in Language Models with\n Layer-Selective Rank Reduction","summary":" Transformer-based Large Language Models (LLMs) have become a fixture in\nmodern machine learning. Correspondingly, significant resources are allocated\ntowards research that aims to further advance this technology, typically\nresulting in models of increasing size that are trained on increasing amounts\nof data. This work, however, demonstrates the surprising result that it is\noften possible to significantly improve the performance of LLMs by selectively\nremoving higher-order components of their weight matrices. This simple\nintervention, which we call LAyer-SElective Rank reduction (LASER), can be done\non a model after training has completed, and requires no additional parameters\nor data. We show extensive experiments demonstrating the generality of this\nfinding across language models and datasets, and provide in-depth analyses\noffering insights into both when LASER is effective and the mechanism by which\nit operates.\n","authors":["Pratyusha Sharma","Jordan T. Ash","Dipendra Misra"],"pdf_url":"https://arxiv.org/pdf/2312.13558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13547v1","updated":"2023-12-21T03:11:30Z","published":"2023-12-21T03:11:30Z","title":"How to Prune Your Language Model: Recovering Accuracy on the \"Sparsity\n May Cry'' Benchmark","summary":" Pruning large language models (LLMs) from the BERT family has emerged as a\nstandard compression benchmark, and several pruning methods have been proposed\nfor this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into\nquestion the validity of all existing methods, exhibiting a more complex setup\nwhere many known pruning methods appear to fail. We revisit the question of\naccurate BERT-pruning during fine-tuning on downstream datasets, and propose a\nset of general guidelines for successful pruning, even on the challenging SMC\nbenchmark. First, we perform a cost-vs-benefits analysis of pruning model\ncomponents, such as the embeddings and the classification head; second, we\nprovide a simple-yet-general way of scaling training, sparsification and\nlearning rate schedules relative to the desired target sparsity; finally, we\ninvestigate the importance of proper parametrization for Knowledge Distillation\nin the context of LLMs. Our simple insights lead to state-of-the-art results,\nboth on classic BERT-pruning benchmarks, as well as on the SMC benchmark,\nshowing that even classic gradual magnitude pruning (GMP) can yield competitive\nresults, with the right approach.\n","authors":["Eldar Kurtic","Torsten Hoefler","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2312.13547v1.pdf","comment":"Accepted as oral to CPAL 2024"},{"id":"http://arxiv.org/abs/2312.13545v1","updated":"2023-12-21T03:09:38Z","published":"2023-12-21T03:09:38Z","title":"Developing Interactive Tourism Planning: A Dialogue Robot System Powered\n by a Large Language Mode","summary":" In recent years, large language models (LLMs) have rapidly proliferated and\nhave been utilized in various tasks, including research in dialogue systems. We\naimed to construct a system that not only leverages the flexible conversational\nabilities of LLMs but also their advanced planning capabilities to reduce the\nspeaking load on human interlocutors and efficiently plan trips. Furthermore,\nwe propose a method that divides the complex task of a travel agency into\nmultiple subtasks, managing each as a separate phase to effectively accomplish\nthe task. Our proposed system confirmed a certain level of success by achieving\nfourth place in the Dialogue Robot Competition 2023 preliminaries rounds. We\nreport on the challenges identified through the competition.\n","authors":["Katsumasa Yoshikawa","Takato Yamazaki","Masaya Ohagi","Tomoya Mizumoto","Keiya Sato"],"pdf_url":"https://arxiv.org/pdf/2312.13545v1.pdf","comment":"This paper is part of the proceedings of the Dialogue Robot\n Competition 2023"},{"id":"http://arxiv.org/abs/2312.12464v2","updated":"2023-12-21T02:43:26Z","published":"2023-12-18T21:11:17Z","title":"Towards Better Serialization of Tabular Data for Few-shot Classification\n with Large Language Models","summary":" We present a study on the integration of Large Language Models (LLMs) in\ntabular data classification, emphasizing an efficient framework. Building upon\nexisting work done in TabLLM (arXiv:2210.10723), we introduce three novel\nserialization techniques, including the standout LaTeX serialization method.\nThis method significantly boosts the performance of LLMs in processing\ndomain-specific datasets, Our method stands out for its memory efficiency and\nability to fully utilize complex data structures. Through extensive\nexperimentation, including various serialization approaches like feature\ncombination and importance, we demonstrate our work's superiority in accuracy\nand efficiency over traditional models.\n","authors":["Sukriti Jaitly","Tanay Shah","Ashish Shugani","Razik Singh Grewal"],"pdf_url":"https://arxiv.org/pdf/2312.12464v2.pdf","comment":"4 pages, 2 figures"},{"id":"http://arxiv.org/abs/2312.13533v1","updated":"2023-12-21T02:28:29Z","published":"2023-12-21T02:28:29Z","title":"Automated Clinical Coding for Outpatient Departments","summary":" Computerised clinical coding approaches aim to automate the process of\nassigning a set of codes to medical records. While there is active research\npushing the state of the art on clinical coding for hospitalized patients, the\noutpatient setting -- where doctors tend to non-hospitalised patients -- is\noverlooked. Although both settings can be formalised as a multi-label\nclassification task, they present unique and distinct challenges, which raises\nthe question of whether the success of inpatient clinical coding approaches\ntranslates to the outpatient setting. This paper is the first to investigate\nhow well state-of-the-art deep learning-based clinical coding approaches work\nin the outpatient setting at hospital scale. To this end, we collect a large\noutpatient dataset comprising over 7 million notes documenting over half a\nmillion patients. We adapt four state-of-the-art clinical coding approaches to\nthis setting and evaluate their potential to assist coders. We find evidence\nthat clinical coding in outpatient settings can benefit from more innovations\nin popular inpatient coding benchmarks. A deeper analysis of the factors\ncontributing to the success -- amount and form of data and choice of document\nrepresentation -- reveals the presence of easy-to-solve examples, the coding of\nwhich can be completely automated with a low error rate.\n","authors":["Viktor Schlegel","Abhinav Ramesh Kashyap","Thanh-Tung Nguyen","Tsung-Han Yang","Vijay Prakash Dwivedi","Wei-Hsian Yin","Jeng Wei","Stefan Winkle"],"pdf_url":"https://arxiv.org/pdf/2312.13533v1.pdf","comment":"9 pages, preprint under review"},{"id":"http://arxiv.org/abs/2312.12918v2","updated":"2023-12-21T02:09:52Z","published":"2023-12-20T10:53:53Z","title":"Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors","summary":" To combat the potential misuse of Natural Language Generation (NLG)\ntechnology, a variety of algorithms have been developed for the detection of\nAI-generated texts. Traditionally, this task is treated as a binary\nclassification problem. Although supervised learning has demonstrated promising\nresults, acquiring labeled data for detection purposes poses real-world\nchallenges and the risk of overfitting. In an effort to address these issues,\nwe delve into the realm of zero-shot machine-generated text detection. Existing\nzero-shot detectors, typically designed for specific tasks or topics, often\nassume uniform testing scenarios, limiting their practicality. In our research,\nwe explore various advanced Large Language Models (LLMs) and their specialized\nvariants, contributing to this field in several ways. In empirical studies, we\nuncover a significant correlation between topics and detection performance.\nSecondly, we delve into the influence of topic shifts on zero-shot detectors.\nThese investigations shed light on the adaptability and robustness of these\ndetection methods across diverse topics. The code is available at\n\\url{https://github.com/yfzhang114/robustness-detection}.\n","authors":["Yi-Fan Zhang","Zhang Zhang","Liang Wang","Tieniu Tan","Rong Jin"],"pdf_url":"https://arxiv.org/pdf/2312.12918v2.pdf","comment":"8 pages, 3 figures, AAAI 2024 Workshop on Responsible Language Models"},{"id":"http://arxiv.org/abs/2312.01057v2","updated":"2023-12-21T01:30:38Z","published":"2023-12-02T08:04:29Z","title":"RLHF and IIA: Perverse Incentives","summary":" Existing algorithms for reinforcement learning from human feedback (RLHF) can\nincentivize responses at odds with preferences because they are based on models\nthat assume independence of irrelevant alternatives (IIA). The perverse\nincentives induced by IIA give rise to egregious behavior when innovating on\nquery formats or learning algorithms.\n","authors":["Wanqiao Xu","Shi Dong","Xiuyuan Lu","Grace Lam","Zheng Wen","Benjamin Van Roy"],"pdf_url":"https://arxiv.org/pdf/2312.01057v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06762v3","updated":"2023-12-21T00:18:48Z","published":"2023-04-13T18:04:19Z","title":"Shall We Pretrain Autoregressive Language Models with Retrieval? A\n Comprehensive Study","summary":" Large decoder-only language models (LMs) can be largely improved in terms of\nperplexity by retrieval (e.g., RETRO), but its impact on text generation\nquality and downstream task accuracy is unclear. Thus, it is still an open\nquestion: shall we pretrain large autoregressive LMs with retrieval? To answer\nit, we perform a comprehensive study on a scalable pre-trained\nretrieval-augmented LM (i.e., RETRO) compared with standard GPT and\nretrieval-augmented GPT incorporated at fine-tuning or inference stages. We\nfirst provide the recipe to reproduce RETRO up to 9.5B parameters while\nretrieving a text corpus with 330B tokens. Based on that, we have the following\nnovel findings: i) RETRO outperforms GPT on text generation with much less\ndegeneration (i.e., repetition), moderately higher factual accuracy, and\nslightly lower toxicity with a nontoxic retrieval database. ii) On the LM\nEvaluation Harness benchmark, RETRO largely outperforms GPT on\nknowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore,\nwe introduce a simple variant of the model, RETRO++, which largely improves\nopen-domain QA results of original RETRO (e.g., EM score +8.6 on Natural\nQuestion) and significantly outperforms retrieval-augmented GPT in both\nfine-tuning and zero-shot evaluation settings. Our findings highlight the\npromising direction of pretraining autoregressive LMs with retrieval as future\nfoundation models. We release our code and model at:\nhttps://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md\n","authors":["Boxin Wang","Wei Ping","Peng Xu","Lawrence McAfee","Zihan Liu","Mohammad Shoeybi","Yi Dong","Oleksii Kuchaiev","Bo Li","Chaowei Xiao","Anima Anandkumar","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2304.06762v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2312.13495v1","updated":"2023-12-21T00:16:21Z","published":"2023-12-21T00:16:21Z","title":"Decoupling Representation and Knowledge for Few-Shot Intent\n Classification and Slot Filling","summary":" Few-shot intent classification and slot filling are important but challenging\ntasks due to the scarcity of finely labeled data. Therefore, current works\nfirst train a model on source domains with sufficiently labeled data, and then\ntransfer the model to target domains where only rarely labeled data is\navailable. However, experience transferring as a whole usually suffers from\ngaps that exist among source domains and target domains. For instance,\ntransferring domain-specific-knowledge-related experience is difficult. To\ntackle this problem, we propose a new method that explicitly decouples the\ntransferring of general-semantic-representation-related experience and the\ndomain-specific-knowledge-related experience. Specifically, for\ndomain-specific-knowledge-related experience, we design two modules to capture\nintent-slot relation and slot-slot relation respectively. Extensive experiments\non Snips and FewJoint datasets show that our method achieves state-of-the-art\nperformance. The method improves the joint accuracy metric from 27.72% to\n42.20% in the 1-shot setting, and from 46.54% to 60.79% in the 5-shot setting.\n","authors":["Jie Han","Yixiong Zou","Haozhao Wang","Jun Wang","Wei Liu","Yao Wu","Tao Zhang","Ruixuan Li"],"pdf_url":"https://arxiv.org/pdf/2312.13495v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.14335v1","updated":"2023-12-21T23:42:13Z","published":"2023-12-21T23:42:13Z","title":"Context-aware Decoding Reduces Hallucination in Query-focused\n Summarization","summary":" Query-focused summarization (QFS) aims to provide a summary of a single\ndocument/multi documents that can satisfy the information needs of a given\nquery. It is useful for various real-world applications, such as abstractive\nsnippet generation or more recent retrieval augmented generation (RAG). A\nprototypical QFS pipeline consists of a retriever (sparse or dense retrieval)\nand a generator (usually a large language model). However, applying large\nlanguage models (LLM) potentially leads to hallucinations, especially when the\nevidence contradicts the prior belief of LLMs. There has been growing interest\nin developing new decoding methods to improve generation quality and reduce\nhallucination. In this work, we conduct a large-scale reproducibility on one\nrecently proposed decoding method -- Context-aware Decoding (CAD). In addition\nto replicating CAD's experiments on news summarization datasets, we include\nexperiments on QFS datasets, and conduct more rigorous analysis on\ncomputational complexity and hyperparameter sensitivity. Experiments with eight\ndifferent language models show that performance-wise, CAD improves QFS quality\nby (1) reducing factuality errors/hallucinations while (2) mostly retaining the\nmatch of lexical patterns, measured by ROUGE scores, while also at a cost of\nincreased inference-time FLOPs and reduced decoding speed. The code\nimplementation based on Huggingface Library is made available\nhttps://github.com/zhichaoxu-shufe/context-aware-decoding-qfs\n","authors":["Zhichao Xu"],"pdf_url":"https://arxiv.org/pdf/2312.14335v1.pdf","comment":"technical report"},{"id":"http://arxiv.org/abs/2312.14327v1","updated":"2023-12-21T22:52:44Z","published":"2023-12-21T22:52:44Z","title":"Parameter Efficient Tuning Allows Scalable Personalization of LLMs for\n Text Entry: A Case Study on Abbreviation Expansion","summary":" Abbreviation expansion is a strategy used to speed up communication by\nlimiting the amount of typing and using a language model to suggest expansions.\nHere we look at personalizing a Large Language Model's (LLM) suggestions based\non prior conversations to enhance the relevance of predictions, particularly\nwhen the user data is small (~1000 samples). Specifically, we compare\nfine-tuning, prompt-tuning, and retrieval augmented generation of expanded text\nsuggestions for abbreviated inputs. Our case study with a deployed 8B parameter\nLLM on a real user living with ALS, and experiments on movie character\npersonalization indicates that (1) customization may be necessary in some\nscenarios and prompt-tuning generalizes well to those, (2) fine-tuning on\nin-domain data (with as few as 600 samples) still shows some gains, however (3)\nretrieval augmented few-shot selection also outperforms fine-tuning. (4)\nParameter efficient tuning allows for efficient and scalable personalization.\nFor prompt-tuning, we also find that initializing the learned \"soft-prompts\" to\nuser relevant concept tokens leads to higher accuracy than random\ninitialization.\n","authors":["Katrin Tomanek","Shanqing Cai","Subhashini Venugopalan"],"pdf_url":"https://arxiv.org/pdf/2312.14327v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14302v1","updated":"2023-12-21T21:22:41Z","published":"2023-12-21T21:22:41Z","title":"Exploiting Novel GPT-4 APIs","summary":" Language model attacks typically assume one of two extreme threat models:\nfull white-box access to model weights, or black-box access limited to a text\ngeneration API. However, real-world APIs are often more flexible than just text\ngeneration: these APIs expose ``gray-box'' access leading to new threat\nvectors. To explore this, we red-team three new functionalities exposed in the\nGPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that\nfine-tuning a model on as few as 15 harmful examples or 100 benign examples can\nremove core safeguards from GPT-4, enabling a range of harmful outputs.\nFurthermore, we find that GPT-4 Assistants readily divulge the function call\nschema and can be made to execute arbitrary function calls. Finally, we find\nthat knowledge retrieval can be hijacked by injecting instructions into\nretrieval documents. These vulnerabilities highlight that any additions to the\nfunctionality exposed by an API can create new vulnerabilities.\n","authors":["Kellin Pelrine","Mohammad Taufeeque","Michał Zając","Euan McLean","Adam Gleave"],"pdf_url":"https://arxiv.org/pdf/2312.14302v1.pdf","comment":"10 pages, 1 figure, 4 tables"},{"id":"http://arxiv.org/abs/2312.14279v1","updated":"2023-12-21T20:17:01Z","published":"2023-12-21T20:17:01Z","title":"Characterizing and Classifying Developer Forum Posts with their\n Intentions","summary":" With the rapid growth of the developer community, the amount of posts on\nonline technical forums has been growing rapidly, which poses difficulties for\nusers to filter useful posts and find important information. Tags provide a\nconcise feature dimension for users to locate their interested posts and for\nsearch engines to index the most relevant posts according to the queries.\nHowever, most tags are only focused on the technical perspective (e.g., program\nlanguage, platform, tool). In most cases, forum posts in online developer\ncommunities reveal the author's intentions to solve a problem, ask for advice,\nshare information, etc. The modeling of the intentions of posts can provide an\nextra dimension to the current tag taxonomy. By referencing previous studies\nand learning from industrial perspectives, we create a refined taxonomy for the\nintentions of technical forum posts. Through manual labeling and analysis on a\nsampled post dataset extracted from online forums, we understand the relevance\nbetween the constitution of posts (code, error messages) and their intentions.\nFurthermore, inspired by our manual study, we design a pre-trained\ntransformer-based model to automatically predict post intentions. The best\nvariant of our intention prediction framework, which achieves a Micro F1-score\nof 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787,\noutperforms the state-of-the-art baseline approach. Our characterization and\nautomated classification of forum posts regarding their intentions may help\nforum maintainers or third-party tool developers improve the organization and\nretrieval of posts on technical forums. We have released our annotated dataset\nand codes in our supplementary material package.\n","authors":["Xingfang Wu","Eric Laufer","Heng Li","Foutse Khomh","Santhosh Srinivasan","Jayden Luo"],"pdf_url":"https://arxiv.org/pdf/2312.14279v1.pdf","comment":"39 pages"},{"id":"http://arxiv.org/abs/2312.14226v1","updated":"2023-12-21T16:44:39Z","published":"2023-12-21T16:44:39Z","title":"Deep de Finetti: Recovering Topic Distributions from Large Language\n Models","summary":" Large language models (LLMs) can produce long, coherent passages of text,\nsuggesting that LLMs, although trained on next-word prediction, must represent\nthe latent structure that characterizes a document. Prior work has found that\ninternal representations of LLMs encode one aspect of latent structure, namely\nsyntax; here we investigate a complementary aspect, namely the document's topic\nstructure. We motivate the hypothesis that LLMs capture topic structure by\nconnecting LLM optimization to implicit Bayesian inference. De Finetti's\ntheorem shows that exchangeable probability distributions can be represented as\na mixture with respect to a latent generating distribution. Although text is\nnot exchangeable at the level of syntax, exchangeability is a reasonable\nstarting assumption for topic structure. We thus hypothesize that predicting\nthe next token in text will lead LLMs to recover latent topic distributions. We\nexamine this hypothesis using Latent Dirichlet Allocation (LDA), an\nexchangeable probabilistic topic model, as a target, and we show that the\nrepresentations formed by LLMs encode both the topics used to generate\nsynthetic data and those used to explain natural corpus data.\n","authors":["Liyi Zhang","R. Thomas McCoy","Theodore R. Sumers","Jian-Qiao Zhu","Thomas L. Griffiths"],"pdf_url":"https://arxiv.org/pdf/2312.14226v1.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.07661v2","updated":"2023-12-21T12:08:55Z","published":"2023-12-12T19:00:04Z","title":"CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor","summary":" Existing open-vocabulary image segmentation methods require a fine-tuning\nstep on mask annotations and/or image-text datasets. Mask labels are\nlabor-intensive, which limits the number of categories in segmentation\ndatasets. As a result, the open-vocabulary capacity of pre-trained VLMs is\nseverely reduced after fine-tuning. However, without fine-tuning, VLMs trained\nunder weak image-text supervision tend to make suboptimal mask predictions when\nthere are text queries referring to non-existing concepts in the image. To\nalleviate these issues, we introduce a novel recurrent framework that\nprogressively filters out irrelevant texts and enhances mask quality without\ntraining efforts. The recurrent unit is a two-stage segmenter built upon a VLM\nwith frozen weights. Thus, our model retains the VLM's broad vocabulary space\nand strengthens its segmentation capability. Experimental results show that our\nmethod outperforms not only the training-free counterparts, but also those\nfine-tuned with millions of additional data samples, and sets new\nstate-of-the-art records for both zero-shot semantic and referring image\nsegmentation tasks. Specifically, we improve the current record by 28.8, 16.0,\nand 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.\n","authors":["Shuyang Sun","Runjia Li","Philip Torr","Xiuye Gu","Siyang Li"],"pdf_url":"https://arxiv.org/pdf/2312.07661v2.pdf","comment":"Project page: https://torrvision.com/clip_as_rnn/"},{"id":"http://arxiv.org/abs/2312.14215v1","updated":"2023-12-21T12:05:19Z","published":"2023-12-21T12:05:19Z","title":"SimLM: Can Language Models Infer Parameters of Physical Systems?","summary":" Recent developments in large-scale machine learning models for\ngeneral-purpose understanding, translation and generation of language are\ndriving impact across a variety of sectors including medicine, robotics, and\nscientific discovery. The strength of such Large Language Models (LLMs) stems\nfrom the large corpora that they are trained with. While this imbues them with\na breadth of capabilities, they have been found unsuitable for some specific\ntypes of problems such as advanced mathematics. In this paper, we highlight the\ninability of LLMs to reason about physics tasks. We demonstrate that their\nability to infer parameters of physical systems can be improved, without\nretraining, by augmenting their context with feedback from physical simulation.\n","authors":["Sean Memery","Mirella Lapata","Kartic Subr"],"pdf_url":"https://arxiv.org/pdf/2312.14215v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14211v1","updated":"2023-12-21T10:19:58Z","published":"2023-12-21T10:19:58Z","title":"Experimenting with Large Language Models and vector embeddings in NASA\n SciX","summary":" Open-source Large Language Models enable projects such as NASA SciX (i.e.,\nNASA ADS) to think out of the box and try alternative approaches for\ninformation retrieval and data augmentation, while respecting data copyright\nand users' privacy. However, when large language models are directly prompted\nwith questions without any context, they are prone to hallucination. At NASA\nSciX we have developed an experiment where we created semantic vectors for our\nlarge collection of abstracts and full-text content, and we designed a prompt\nsystem to ask questions using contextual chunks from our system. Based on a\nnon-systematic human evaluation, the experiment shows a lower degree of\nhallucination and better responses when using Retrieval Augmented Generation.\nFurther exploration is required to design new features and data augmentation\nprocesses at NASA SciX that leverages this technology while respecting the high\nlevel of trust and quality that the project holds.\n","authors":["Sergi Blanco-Cuaresma","Ioana Ciucă","Alberto Accomazzi","Michael J. Kurtz","Edwin A. Henneken","Kelly E. Lockhart","Felix Grezes","Thomas Allen","Golnaz Shapurian","Carolyn S. Grant","Donna M. Thompson","Timothy W. Hostetler","Matthew R. Templeton","Shinyi Chen","Jennifer Koch","Taylor Jacovich","Daniel Chivvis","Fernanda de Macedo Alves","Jean-Claude Paquin","Jennifer Bartlett","Mugdha Polimera","Stephanie Jarmak"],"pdf_url":"https://arxiv.org/pdf/2312.14211v1.pdf","comment":"To appear in the proceedings of the 33th annual international\n Astronomical Data Analysis Software & Systems (ADASS XXXIII)"},{"id":"http://arxiv.org/abs/2312.14203v1","updated":"2023-12-21T05:08:57Z","published":"2023-12-21T05:08:57Z","title":"Shai: A large language model for asset management","summary":" This paper introduces \"Shai\" a 10B level large language model specifically\ndesigned for the asset management industry, built upon an open-source\nfoundational model. With continuous pre-training and fine-tuning using a\ntargeted corpus, Shai demonstrates enhanced performance in tasks relevant to\nits domain, outperforming baseline models. Our research includes the\ndevelopment of an innovative evaluation framework, which integrates\nprofessional qualification exams, tailored tasks, open-ended question\nanswering, and safety assessments, to comprehensively assess Shai's\ncapabilities. Furthermore, we discuss the challenges and implications of\nutilizing large language models like GPT-4 for performance assessment in asset\nmanagement, suggesting a combination of automated evaluation and human\njudgment. Shai's development, showcasing the potential and versatility of\n10B-level large language models in the financial sector with significant\nperformance and modest computational requirements, hopes to provide practical\ninsights and methodologies to assist industry peers in their similar endeavors.\n","authors":["Zhongyang Guo","Guanran Jiang","Zhongdan Zhang","Peng Li","Zhefeng Wang","Yinchun Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14203v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14202v1","updated":"2023-12-21T04:57:21Z","published":"2023-12-21T04:57:21Z","title":"Illuminating the Black Box: A Psychometric Investigation into the\n Multifaceted Nature of Large Language Models","summary":" This study explores the idea of AI Personality or AInality suggesting that\nLarge Language Models (LLMs) exhibit patterns similar to human personalities.\nAssuming that LLMs share these patterns with humans, we investigate using\nhuman-centered psychometric tests such as the Myers-Briggs Type Indicator\n(MBTI), Big Five Inventory (BFI), and Short Dark Triad (SD3) to identify and\nconfirm LLM personality types. By introducing role-play prompts, we demonstrate\nthe adaptability of LLMs, showing their ability to switch dynamically between\ndifferent personality types. Using projective tests, such as the Washington\nUniversity Sentence Completion Test (WUSCT), we uncover hidden aspects of LLM\npersonalities that are not easily accessible through direct questioning.\nProjective tests allowed for a deep exploration of LLMs cognitive processes and\nthought patterns and gave us a multidimensional view of AInality. Our machine\nlearning analysis revealed that LLMs exhibit distinct AInality traits and\nmanifest diverse personality types, demonstrating dynamic shifts in response to\nexternal instructions. This study pioneers the application of projective tests\non LLMs, shedding light on their diverse and adaptable AInality traits.\n","authors":["Yang Lu","Jordan Yu","Shou-Hsuan Stephen Huang"],"pdf_url":"https://arxiv.org/pdf/2312.14202v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16502v3","updated":"2023-12-21T04:06:49Z","published":"2023-11-27T17:33:21Z","title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning\n Benchmark for Expert AGI","summary":" We introduce MMMU: a new benchmark designed to evaluate multimodal models on\nmassive multi-discipline tasks demanding college-level subject knowledge and\ndeliberate reasoning. MMMU includes 11.5K meticulously collected multimodal\nquestions from college exams, quizzes, and textbooks, covering six core\ndisciplines: Art & Design, Business, Science, Health & Medicine, Humanities &\nSocial Science, and Tech & Engineering. These questions span 30 subjects and\n183 subfields, comprising 30 highly heterogeneous image types, such as charts,\ndiagrams, maps, tables, music sheets, and chemical structures. Unlike existing\nbenchmarks, MMMU focuses on advanced perception and reasoning with\ndomain-specific knowledge, challenging models to perform tasks akin to those\nfaced by experts. The evaluation of 14 open-source LMMs as well as the\nproprietary GPT-4V(ision) and Gemini highlights the substantial challenges\nposed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve\naccuracies of 56% and 59% respectively, indicating significant room for\nimprovement. We believe MMMU will stimulate the community to build\nnext-generation multimodal foundation models towards expert artificial general\nintelligence.\n","authors":["Xiang Yue","Yuansheng Ni","Kai Zhang","Tianyu Zheng","Ruoqi Liu","Ge Zhang","Samuel Stevens","Dongfu Jiang","Weiming Ren","Yuxuan Sun","Cong Wei","Botao Yu","Ruibin Yuan","Renliang Sun","Ming Yin","Boyuan Zheng","Zhenzhu Yang","Yibo Liu","Wenhao Huang","Huan Sun","Yu Su","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2311.16502v3.pdf","comment":"117 pages, 99 figures"},{"id":"http://arxiv.org/abs/2312.14197v1","updated":"2023-12-21T01:08:39Z","published":"2023-12-21T01:08:39Z","title":"Benchmarking and Defending Against Indirect Prompt Injection Attacks on\n Large Language Models","summary":" Recent remarkable advancements in large language models (LLMs) have led to\ntheir widespread adoption in various applications. A key feature of these\napplications is the combination of LLMs with external content, where user\ninstructions and third-party content are combined to create prompts for LLM\nprocessing. These applications, however, are vulnerable to indirect prompt\ninjection attacks, where malicious instructions embedded within external\ncontent compromise LLM's output, causing their responses to deviate from user\nexpectations. Despite the discovery of this security issue, no comprehensive\nanalysis of indirect prompt injection attacks on different LLMs is available\ndue to the lack of a benchmark. Furthermore, no effective defense has been\nproposed.\n In this work, we introduce the first benchmark, BIPIA, to measure the\nrobustness of various LLMs and defenses against indirect prompt injection\nattacks. Our experiments reveal that LLMs with greater capabilities exhibit\nmore vulnerable to indirect prompt injection attacks for text tasks, resulting\nin a higher ASR. We hypothesize that indirect prompt injection attacks are\nmainly due to the LLMs' inability to distinguish between instructions and\nexternal content. Based on this conjecture, we propose four black-box methods\nbased on prompt learning and a white-box defense methods based on fine-tuning\nwith adversarial training to enable LLMs to distinguish between instructions\nand external content and ignore instructions in the external content. Our\nexperimental results show that our black-box defense methods can effectively\nreduce ASR but cannot completely thwart indirect prompt injection attacks,\nwhile our white-box defense method can reduce ASR to nearly zero with little\nadverse impact on the LLM's performance on general tasks. We hope that our\nbenchmark and defenses can inspire future work in this important area.\n","authors":["Jingwei Yi","Yueqi Xie","Bin Zhu","Keegan Hines","Emre Kiciman","Guangzhong Sun","Xing Xie","Fangzhao Wu"],"pdf_url":"https://arxiv.org/pdf/2312.14197v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2312.14157v1","updated":"2023-12-21T18:59:57Z","published":"2023-12-21T18:59:57Z","title":"3D Pose Estimation of Two Interacting Hands from a Monocular Event\n Camera","summary":" 3D hand tracking from a monocular video is a very challenging problem due to\nhand interactions, occlusions, left-right hand ambiguity, and fast motion. Most\nexisting methods rely on RGB inputs, which have severe limitations under\nlow-light conditions and suffer from motion blur. In contrast, event cameras\ncapture local brightness changes instead of full image frames and do not suffer\nfrom the described effects. Unfortunately, existing image-based techniques\ncannot be directly applied to events due to significant differences in the data\nmodalities. In response to these challenges, this paper introduces the first\nframework for 3D tracking of two fast-moving and interacting hands from a\nsingle monocular event camera. Our approach tackles the left-right hand\nambiguity with a novel semi-supervised feature-wise attention mechanism and\nintegrates an intersection loss to fix hand collisions. To facilitate advances\nin this research domain, we release a new synthetic large-scale dataset of two\ninteracting hands, Ev2Hands-S, and a new real benchmark with real event streams\nand ground-truth 3D annotations, Ev2Hands-R. Our approach outperforms existing\nmethods in terms of the 3D reconstruction accuracy and generalises to real data\nunder severe light conditions.\n","authors":["Christen Millerdurai","Diogo Luvizon","Viktor Rudnev","André Jonas","Jiayi Wang","Christian Theobalt","Vladislav Golyanik"],"pdf_url":"https://arxiv.org/pdf/2312.14157v1.pdf","comment":"17 pages, 12 figures, 7 tables; project page:\n https://4dqv.mpi-inf.mpg.de/Ev2Hands/"},{"id":"http://arxiv.org/abs/2312.14154v1","updated":"2023-12-21T18:59:30Z","published":"2023-12-21T18:59:30Z","title":"Virtual Pets: Animatable Animal Generation in 3D Scenes","summary":" Toward unlocking the potential of generative models in immersive 4D\nexperiences, we introduce Virtual Pet, a novel pipeline to model realistic and\ndiverse motions for target animal species within a 3D environment. To\ncircumvent the limited availability of 3D motion data aligned with\nenvironmental geometry, we leverage monocular internet videos and extract\ndeformable NeRF representations for the foreground and static NeRF\nrepresentations for the background. For this, we develop a reconstruction\nstrategy, encompassing species-level shared template learning and per-video\nfine-tuning. Utilizing the reconstructed data, we then train a conditional 3D\nmotion model to learn the trajectory and articulation of foreground animals in\nthe context of 3D backgrounds. We showcase the efficacy of our pipeline with\ncomprehensive qualitative and quantitative evaluations using cat videos. We\nalso demonstrate versatility across unseen cats and indoor environments,\nproducing temporally coherent 4D outputs for enriched virtual experiences.\n","authors":["Yen-Chi Cheng","Chieh Hubert Lin","Chaoyang Wang","Yash Kant","Sergey Tulyakov","Alexander Schwing","Liangyan Gui","Hsin-Ying Lee"],"pdf_url":"https://arxiv.org/pdf/2312.14154v1.pdf","comment":"Preprint. Project page: https://yccyenchicheng.github.io/VirtualPets/"},{"id":"http://arxiv.org/abs/2312.14150v1","updated":"2023-12-21T18:59:12Z","published":"2023-12-21T18:59:12Z","title":"DriveLM: Driving with Graph Visual Question Answering","summary":" We study how vision-language models (VLMs) trained on web-scale data can be\nintegrated into end-to-end driving systems to boost generalization and enable\ninteractivity with human users. While recent approaches adapt VLMs to driving\nvia single-round visual question answering (VQA), human drivers reason about\ndecisions in multiple steps. Starting from the localization of key objects,\nhumans estimate object interactions before taking actions. The key insight is\nthat with our proposed task, Graph VQA, where we model graph-structured\nreasoning through perception, prediction and planning question-answer pairs, we\nobtain a suitable proxy task to mimic the human reasoning process. We\ninstantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose\na VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA\nand end-to-end driving. The experiments demonstrate that Graph VQA provides a\nsimple, principled framework for reasoning about a driving scene, and\nDriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent\nbaseline performs end-to-end autonomous driving competitively in comparison to\nstate-of-the-art driving-specific architectures. Notably, its benefits are\npronounced when it is evaluated zero-shot on unseen objects or sensor\nconfigurations. We hope this work can be the starting point to shed new light\non how to apply VLMs for autonomous driving. To facilitate future research, all\ncode, data, and models are available to the public.\n","authors":["Chonghao Sima","Katrin Renz","Kashyap Chitta","Li Chen","Hanxue Zhang","Chengen Xie","Ping Luo","Andreas Geiger","Hongyang Li"],"pdf_url":"https://arxiv.org/pdf/2312.14150v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14149v1","updated":"2023-12-21T18:59:06Z","published":"2023-12-21T18:59:06Z","title":"TagAlign: Improving Vision-Language Alignment with Multi-Tag\n Classification","summary":" The crux of learning vision-language models is to extract semantically\naligned information from visual and linguistic data. Existing attempts usually\nface the problem of coarse alignment, \\textit{e.g.}, the vision encoder\nstruggles in localizing an attribute-specified object. In this work, we propose\nan embarrassingly simple approach to better align image and text features with\nno need of additional data formats other than image-text pairs. Concretely,\ngiven an image and its paired text, we manage to parse objects (\\textit{e.g.},\ncat) and attributes (\\textit{e.g.}, black) from the description, which are\nhighly likely to exist in the image. It is noteworthy that the parsing pipeline\nis fully automatic and thus enjoys good scalability. With these parsed\nsemantics as supervision signals, we can complement the commonly used\nimage-text contrastive loss with the multi-tag classification loss. Extensive\nexperimental results on a broad suite of semantic segmentation datasets\nsubstantiate the average 3.65\\% improvement of our framework over existing\nalternatives. Furthermore, the visualization results indicate that attribute\nsupervision makes vision-language models accurately localize\nattribute-specified objects. Project page can be found at\nhttps://qinying-liu.github.io/Tag-Align/\n","authors":["Qinying Liu","Kecheng Zheng","Wu Wei","Zhan Tong","Yu Liu","Wei Chen","Zilei Wang","Yujun Shen"],"pdf_url":"https://arxiv.org/pdf/2312.14149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14140v1","updated":"2023-12-21T18:57:52Z","published":"2023-12-21T18:57:52Z","title":"HeadCraft: Modeling High-Detail Shape Variations for Animated 3DMMs","summary":" Current advances in human head modeling allow to generate plausible-looking\n3D head models via neural representations. Nevertheless, constructing complete\nhigh-fidelity head models with explicitly controlled animation remains an\nissue. Furthermore, completing the head geometry based on a partial\nobservation, e.g. coming from a depth sensor, while preserving details is often\nproblematic for the existing methods. We introduce a generative model for\ndetailed 3D head meshes on top of an articulated 3DMM which allows explicit\nanimation and high-detail preservation at the same time. Our method is trained\nin two stages. First, we register a parametric head model with vertex\ndisplacements to each mesh of the recently introduced NPHM dataset of accurate\n3D head scans. The estimated displacements are baked into a hand-crafted UV\nlayout. Second, we train a StyleGAN model in order to generalize over the UV\nmaps of displacements. The decomposition of the parametric model and\nhigh-quality vertex displacements allows us to animate the model and modify it\nsemantically. We demonstrate the results of unconditional generation and\nfitting to the full or partial observation. The project page is available at\nhttps://seva100.github.io/headcraft.\n","authors":["Artem Sevastopolsky","Philip-William Grassal","Simon Giebenhain","ShahRukh Athar","Luisa Verdoliva","Matthias Niessner"],"pdf_url":"https://arxiv.org/pdf/2312.14140v1.pdf","comment":"Project page: https://seva100.github.io/headcraft. Video:\n https://youtu.be/uBeBT2f1CL0. 23 pages, 19 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.14138v1","updated":"2023-12-21T18:57:12Z","published":"2023-12-21T18:57:12Z","title":"Revisiting Foreground and Background Separation in Weakly-supervised\n Temporal Action Localization: A Clustering-based Approach","summary":" Weakly-supervised temporal action localization aims to localize action\ninstances in videos with only video-level action labels. Existing methods\nmainly embrace a localization-by-classification pipeline that optimizes the\nsnippet-level prediction with a video classification loss. However, this\nformulation suffers from the discrepancy between classification and detection,\nresulting in inaccurate separation of foreground and background (F\\&B)\nsnippets. To alleviate this problem, we propose to explore the underlying\nstructure among the snippets by resorting to unsupervised snippet clustering,\nrather than heavily relying on the video classification loss. Specifically, we\npropose a novel clustering-based F\\&B separation algorithm. It comprises two\ncore components: a snippet clustering component that groups the snippets into\nmultiple latent clusters and a cluster classification component that further\nclassifies the cluster as foreground or background. As there are no\nground-truth labels to train these two components, we introduce a unified\nself-labeling mechanism based on optimal transport to produce high-quality\npseudo-labels that match several plausible prior distributions. This ensures\nthat the cluster assignments of the snippets can be accurately associated with\ntheir F\\&B labels, thereby boosting the F\\&B separation. We evaluate our method\non three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves\npromising performance on all three benchmarks while being significantly more\nlightweight than previous methods. Code is available at\nhttps://github.com/Qinying-Liu/CASE\n","authors":["Qinying Liu","Zilei Wang","Shenghai Rong","Junjie Li","Yixin Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.14138v1.pdf","comment":"ICCV2023"},{"id":"http://arxiv.org/abs/2312.14135v1","updated":"2023-12-21T18:55:06Z","published":"2023-12-21T18:55:06Z","title":"$\\textit{V}^*$: Guided Visual Search as a Core Mechanism in Multimodal\n LLMs","summary":" When we look around and perform complex tasks, how we see and selectively\nprocess what we see is crucial. However, the lack of this visual search\nmechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on\nimportant visual details, especially when handling high-resolution and visually\ncrowded images. To address this, we introduce $\\textit{V}^*$, an LLM-guided\nvisual search mechanism that employs the world knowledge in LLMs for efficient\nvisual querying. When combined with an MLLM, this mechanism enhances\ncollaborative reasoning, contextual understanding, and precise targeting of\nspecific visual elements. This integration results in a new MLLM\nmeta-architecture, named $\\textbf{S}$how, s$\\textbf{EA}$rch, and\nTel$\\textbf{L}$ (SEAL). We further create $\\textit{V}^*$Bench, a benchmark\nspecifically designed to evaluate MLLMs in their ability to process\nhigh-resolution images and focus on visual details. Our study highlights the\nnecessity of incorporating visual search capabilities into multimodal systems.\nThe code is available https://github.com/penghao-wu/vstar.\n","authors":["Penghao Wu","Saining Xie"],"pdf_url":"https://arxiv.org/pdf/2312.14135v1.pdf","comment":"Project page: https://vstar-seal.github.io/"},{"id":"http://arxiv.org/abs/2312.14134v1","updated":"2023-12-21T18:55:05Z","published":"2023-12-21T18:55:05Z","title":"Diffusion Reward: Learning Rewards via Conditional Video Diffusion","summary":" Learning rewards from expert videos offers an affordable and effective\nsolution to specify the intended behaviors for reinforcement learning tasks. In\nthis work, we propose Diffusion Reward, a novel framework that learns rewards\nfrom expert videos via conditional video diffusion models for solving complex\nvisual RL problems. Our key insight is that lower generative diversity is\nobserved when conditioned on expert trajectories. Diffusion Reward is\naccordingly formalized by the negative of conditional entropy that encourages\nproductive exploration of expert-like behaviors. We show the efficacy of our\nmethod over 10 robotic manipulation tasks from MetaWorld and Adroit with visual\ninput and sparse reward. Moreover, Diffusion Reward could even solve unseen\ntasks successfully and effectively, largely surpassing baseline methods.\nProject page and code: https://diffusion-reward.github.io/.\n","authors":["Tao Huang","Guangqi Jiang","Yanjie Ze","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2312.14134v1.pdf","comment":"Project page and code: https://diffusion-reward.github.io/"},{"id":"http://arxiv.org/abs/2312.14132v1","updated":"2023-12-21T18:52:14Z","published":"2023-12-21T18:52:14Z","title":"DUSt3R: Geometric 3D Vision Made Easy","summary":" Multi-view stereo reconstruction (MVS) in the wild requires to first estimate\nthe camera parameters e.g. intrinsic and extrinsic parameters. These are\nusually tedious and cumbersome to obtain, yet they are mandatory to triangulate\ncorresponding pixels in 3D space, which is the core of all best performing MVS\nalgorithms. In this work, we take an opposite stance and introduce DUSt3R, a\nradically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction\nof arbitrary image collections, i.e. operating without prior information about\ncamera calibration nor viewpoint poses. We cast the pairwise reconstruction\nproblem as a regression of pointmaps, relaxing the hard constraints of usual\nprojective camera models. We show that this formulation smoothly unifies the\nmonocular and binocular reconstruction cases. In the case where more than two\nimages are provided, we further propose a simple yet effective global alignment\nstrategy that expresses all pairwise pointmaps in a common reference frame. We\nbase our network architecture on standard Transformer encoders and decoders,\nallowing us to leverage powerful pretrained models. Our formulation directly\nprovides a 3D model of the scene as well as depth information, but\ninterestingly, we can seamlessly recover from it, pixel matches, relative and\nabsolute camera. Exhaustive experiments on all these tasks showcase that the\nproposed DUSt3R can unify various 3D vision tasks and set new SoTAs on\nmonocular/multi-view depth estimation as well as relative pose estimation. In\nsummary, DUSt3R makes many geometric 3D vision tasks easy.\n","authors":["Shuzhe Wang","Vincent Leroy","Yohann Cabon","Boris Chidlovskii","Jerome Revaud"],"pdf_url":"https://arxiv.org/pdf/2312.14132v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14126v1","updated":"2023-12-21T18:47:12Z","published":"2023-12-21T18:47:12Z","title":"Entropic Open-set Active Learning","summary":" Active Learning (AL) aims to enhance the performance of deep models by\nselecting the most informative samples for annotation from a pool of unlabeled\ndata. Despite impressive performance in closed-set settings, most AL methods\nfail in real-world scenarios where the unlabeled data contains unknown\ncategories. Recently, a few studies have attempted to tackle the AL problem for\nthe open-set setting. However, these methods focus more on selecting known\nsamples and do not efficiently utilize unknown samples obtained during AL\nrounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which\nleverages both known and unknown distributions effectively to select\ninformative samples during AL rounds. Specifically, our approach employs two\ndifferent entropy scores. One measures the uncertainty of a sample with respect\nto the known-class distributions. The other measures the uncertainty of the\nsample with respect to the unknown-class distributions. By utilizing these two\nentropy scores we effectively separate the known and unknown samples from the\nunlabeled data resulting in better sampling. Through extensive experiments, we\nshow that the proposed method outperforms existing state-of-the-art methods on\nCIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at\n\\url{https://github.com/bardisafa/EOAL}.\n","authors":["Bardia Safaei","Vibashan VS","Celso M. de Melo","Vishal M. Patel"],"pdf_url":"https://arxiv.org/pdf/2312.14126v1.pdf","comment":"Accepted in AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14125v1","updated":"2023-12-21T18:46:41Z","published":"2023-12-21T18:46:41Z","title":"VideoPoet: A Large Language Model for Zero-Shot Video Generation","summary":" We present VideoPoet, a language model capable of synthesizing high-quality\nvideo, with matching audio, from a large variety of conditioning signals.\nVideoPoet employs a decoder-only transformer architecture that processes\nmultimodal inputs -- including images, videos, text, and audio. The training\nprotocol follows that of Large Language Models (LLMs), consisting of two\nstages: pretraining and task-specific adaptation. During pretraining, VideoPoet\nincorporates a mixture of multimodal generative objectives within an\nautoregressive Transformer framework. The pretrained LLM serves as a foundation\nthat can be adapted for a range of video generation tasks. We present empirical\nresults demonstrating the model's state-of-the-art capabilities in zero-shot\nvideo generation, specifically highlighting VideoPoet's ability to generate\nhigh-fidelity motions. Project page: http://sites.research.google/videopoet/\n","authors":["Dan Kondratyuk","Lijun Yu","Xiuye Gu","José Lezama","Jonathan Huang","Rachel Hornung","Hartwig Adam","Hassan Akbari","Yair Alon","Vighnesh Birodkar","Yong Cheng","Ming-Chang Chiu","Josh Dillon","Irfan Essa","Agrim Gupta","Meera Hahn","Anja Hauth","David Hendon","Alonso Martinez","David Minnen","David Ross","Grant Schindler","Mikhail Sirotenko","Kihyuk Sohn","Krishna Somandepalli","Huisheng Wang","Jimmy Yan","Ming-Hsuan Yang","Xuan Yang","Bryan Seybold","Lu Jiang"],"pdf_url":"https://arxiv.org/pdf/2312.14125v1.pdf","comment":"Project page: http://sites.research.google/videopoet/"},{"id":"http://arxiv.org/abs/2312.14124v1","updated":"2023-12-21T18:46:27Z","published":"2023-12-21T18:46:27Z","title":"Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance\n Generation","summary":" Controllable generation of 3D assets is important for many practical\napplications like content creation in movies, games and engineering, as well as\nin AR/VR. Recently, diffusion models have shown remarkable results in\ngeneration quality of 3D objects. However, none of the existing models enable\ndisentangled generation to control the shape and appearance separately. For the\nfirst time, we present a suitable representation for 3D diffusion models to\nenable such disentanglement by introducing a hybrid point cloud and neural\nradiance field approach. We model a diffusion process over point positions\njointly with a high-dimensional feature space for a local density and radiance\ndecoder. While the point positions represent the coarse shape of the object,\nthe point features allow modeling the geometry and appearance details. This\ndisentanglement enables us to sample both independently and therefore to\ncontrol both separately. Our approach sets a new state of the art in generation\ncompared to previous disentanglement-capable methods by reduced FID scores of\n30-90% and is on-par with other non disentanglement-capable state-of-the art\nmethods.\n","authors":["Philipp Schröppel","Christopher Wewer","Jan Eric Lenssen","Eddy Ilg","Thomas Brox"],"pdf_url":"https://arxiv.org/pdf/2312.14124v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14115v1","updated":"2023-12-21T18:40:34Z","published":"2023-12-21T18:40:34Z","title":"LingoQA: Video Question Answering for Autonomous Driving","summary":" Autonomous driving has long faced a challenge with public acceptance due to\nthe lack of explainability in the decision-making process. Video\nquestion-answering (QA) in natural language provides the opportunity for\nbridging this gap. Nonetheless, evaluating the performance of Video QA models\nhas proved particularly tough due to the absence of comprehensive benchmarks.\nTo fill this gap, we introduce LingoQA, a benchmark specifically for autonomous\ndriving Video QA. The LingoQA trainable metric demonstrates a 0.95 Spearman\ncorrelation coefficient with human evaluations. We introduce a Video QA dataset\nof central London consisting of 419k samples that we release with the paper. We\nestablish a baseline vision-language model and run extensive ablation studies\nto understand its performance.\n","authors":["Ana-Maria Marcu","Long Chen","Jan Hünermann","Alice Karnsund","Benoit Hanotte","Prajwal Chidananda","Saurabh Nair","Vijay Badrinarayanan","Alex Kendall","Jamie Shotton","Oleg Sinavski"],"pdf_url":"https://arxiv.org/pdf/2312.14115v1.pdf","comment":"Benchmark and dataset are available at\n https://github.com/wayveai/LingoQA/"},{"id":"http://arxiv.org/abs/2307.00764v2","updated":"2023-12-21T18:28:31Z","published":"2023-07-03T06:02:15Z","title":"Hierarchical Open-vocabulary Universal Image Segmentation","summary":" Open-vocabulary image segmentation aims to partition an image into semantic\nregions according to arbitrary text descriptions. However, complex visual\nscenes can be naturally decomposed into simpler parts and abstracted at\nmultiple levels of granularity, introducing inherent segmentation ambiguity.\nUnlike existing methods that typically sidestep this ambiguity and treat it as\nan external factor, our approach actively incorporates a hierarchical\nrepresentation encompassing different semantic-levels into the learning\nprocess. We propose a decoupled text-image fusion mechanism and representation\nlearning modules for both \"things\" and \"stuff\". Additionally, we systematically\nexamine the differences that exist in the textual and visual features between\nthese types of categories. Our resulting model, named HIPIE, tackles\nHIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a\nunified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,\nPascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the\nstate-of-the-art results at various levels of image comprehension, including\nsemantic-level (e.g., semantic segmentation), instance-level (e.g.,\npanoptic/referring segmentation and object detection), as well as part-level\n(e.g., part/subpart segmentation) tasks. Our code is released at\nhttps://github.com/berkeley-hipie/HIPIE.\n","authors":["Xudong Wang","Shufan Li","Konstantinos Kallidromitis","Yusuke Kato","Kazuki Kozuka","Trevor Darrell"],"pdf_url":"https://arxiv.org/pdf/2307.00764v2.pdf","comment":"Project web-page:\n http://people.eecs.berkeley.edu/~xdwang/projects/HIPIE/; NeurIPS 2023\n Camera-ready"},{"id":"http://arxiv.org/abs/2312.13016v2","updated":"2023-12-21T18:26:21Z","published":"2023-12-20T13:31:11Z","title":"DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View\n Synthesis","summary":" We present DiffPortrait3D, a conditional diffusion model that is capable of\nsynthesizing 3D-consistent photo-realistic novel views from as few as a single\nin-the-wild portrait. Specifically, given a single RGB input, we aim to\nsynthesize plausible but consistent facial details rendered from novel camera\nviews with retained both identity and facial expression. In lieu of\ntime-consuming optimization and fine-tuning, our zero-shot method generalizes\nwell to arbitrary face portraits with unposed camera views, extreme facial\nexpressions, and diverse artistic depictions. At its core, we leverage the\ngenerative prior of 2D diffusion models pre-trained on large-scale image\ndatasets as our rendering backbone, while the denoising is guided with\ndisentangled attentive control of appearance and camera pose. To achieve this,\nwe first inject the appearance context from the reference image into the\nself-attention layers of the frozen UNets. The rendering view is then\nmanipulated with a novel conditional control module that interprets the camera\npose by watching a condition image of a crossed subject from the same view.\nFurthermore, we insert a trainable cross-view attention module to enhance view\nconsistency, which is further strengthened with a novel 3D-aware noise\ngeneration process during inference. We demonstrate state-of-the-art results\nboth qualitatively and quantitatively on our challenging in-the-wild and\nmulti-view benchmarks.\n","authors":["Yuming Gu","Hongyi Xu","You Xie","Guoxian Song","Yichun Shi","Di Chang","Jing Yang","Linjie Luo"],"pdf_url":"https://arxiv.org/pdf/2312.13016v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07915v5","updated":"2023-12-21T18:24:08Z","published":"2023-06-13T17:18:01Z","title":"Image Captioners Are Scalable Vision Learners Too","summary":" Contrastive pretraining on image-text pairs from the web is one of the most\npopular large-scale pretraining strategies for vision backbones, especially in\nthe context of large multimodal models. At the same time, image captioning on\nthis type of data is commonly considered an inferior pretraining strategy. In\nthis paper, we perform a fair comparison of these two pretraining strategies,\ncarefully matching training data, compute, and model capacity. Using a standard\nencoder-decoder transformer, we find that captioning alone is surprisingly\neffective: on classification tasks, captioning produces vision encoders\ncompetitive with contrastively pretrained encoders, while surpassing them on\nvision & language tasks. We further analyze the effect of the model\narchitecture and scale, as well as the pretraining data on the representation\nquality, and find that captioning exhibits the same or better scaling behavior\nalong these axes. Overall our results show that plain image captioning is a\nmore powerful pretraining strategy than was previously believed.\n","authors":["Michael Tschannen","Manoj Kumar","Andreas Steiner","Xiaohua Zhai","Neil Houlsby","Lucas Beyer"],"pdf_url":"https://arxiv.org/pdf/2306.07915v5.pdf","comment":"Accepted at NeurIPS 2023. v2 adds SugarCrepe results and more\n ablations, v3 has minor fixes. v4 adds a code link (\n https://github.com/google-research/big_vision ). v5 has minor fixes"},{"id":"http://arxiv.org/abs/2310.14859v3","updated":"2023-12-21T18:19:58Z","published":"2023-10-23T12:29:10Z","title":"3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for\n Embodied Turn-Taking Prediction","summary":" Predicting turn-taking in multiparty conversations has many practical\napplications in human-computer/robot interaction. However, the complexity of\nhuman communication makes it a challenging task. Recent advances have shown\nthat synchronous multi-perspective egocentric data can significantly improve\nturn-taking prediction compared to asynchronous, single-perspective\ntranscriptions. Building on this research, we propose a new multimodal\ntransformer-based architecture for predicting turn-taking in embodied,\nsynchronized multi-perspective data. Our experimental results on the recently\nintroduced EgoCom dataset show a substantial performance improvement of up to\n14.01% on average compared to existing baselines and alternative\ntransformer-based approaches. The source code, and the pre-trained models of\nour 3M-Transformer will be available upon acceptance.\n","authors":["Mehdi Fatan","Emanuele Mincato","Dimitra Pintzou","Mariella Dimiccoli"],"pdf_url":"https://arxiv.org/pdf/2310.14859v3.pdf","comment":"Accepted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2305.16150v3","updated":"2023-12-21T18:16:33Z","published":"2023-05-25T15:20:10Z","title":"Unifying GANs and Score-Based Diffusion as Generative Particle Models","summary":" Particle-based deep generative models, such as gradient flows and score-based\ndiffusion models, have recently gained traction thanks to their striking\nperformance. Their principle of displacing particle distributions using\ndifferential equations is conventionally seen as opposed to the previously\nwidespread generative adversarial networks (GANs), which involve training a\npushforward generator network. In this paper we challenge this interpretation,\nand propose a novel framework that unifies particle and adversarial generative\nmodels by framing generator training as a generalization of particle models.\nThis suggests that a generator is an optional addition to any such generative\nmodel. Consequently, integrating a generator into a score-based diffusion model\nand training a GAN without a generator naturally emerge from our framework. We\nempirically test the viability of these original models as proofs of concepts\nof potential applications of our framework.\n","authors":["Jean-Yves Franceschi","Mike Gartrell","Ludovic Dos Santos","Thibaut Issenhuth","Emmanuel de Bézenac","Mickaël Chen","Alain Rakotomamonjy"],"pdf_url":"https://arxiv.org/pdf/2305.16150v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14091v1","updated":"2023-12-21T18:09:30Z","published":"2023-12-21T18:09:30Z","title":"HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image\n Inpainting with Diffusion Models","summary":" Recent progress in text-guided image inpainting, based on the unprecedented\nsuccess of text-to-image diffusion models, has led to exceptionally realistic\nand visually plausible results. However, there is still significant potential\nfor improvement in current text-to-image inpainting models, particularly in\nbetter aligning the inpainted area with user prompts and performing\nhigh-resolution inpainting. Therefore, in this paper we introduce HD-Painter, a\ncompletely training-free approach that accurately follows to prompts and\ncoherently scales to high-resolution image inpainting. To this end, we design\nthe Prompt-Aware Introverted Attention (PAIntA) layer enhancing self-attention\nscores by prompt information and resulting in better text alignment\ngenerations. To further improve the prompt coherence we introduce the\nReweighting Attention Score Guidance (RASG) mechanism seamlessly integrating a\npost-hoc sampling strategy into general form of DDIM to prevent\nout-of-distribution latent shifts. Moreover, HD-Painter allows extension to\nlarger scales by introducing a specialized super-resolution technique\ncustomized for inpainting, enabling the completion of missing regions in images\nof up to 2K resolution. Our experiments demonstrate that HD-Painter surpasses\nexisting state-of-the-art approaches qualitatively and quantitatively,\nachieving an impressive generation accuracy improvement of 61.4% vs 51.9%. We\nwill make the codes publicly available at:\nhttps://github.com/Picsart-AI-Research/HD-Painter\n","authors":["Hayk Manukyan","Andranik Sargsyan","Barsegh Atanyan","Zhangyang Wang","Shant Navasardyan","Humphrey Shi"],"pdf_url":"https://arxiv.org/pdf/2312.14091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14074v1","updated":"2023-12-21T17:52:12Z","published":"2023-12-21T17:52:12Z","title":"LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR\n Understanding","summary":" Recently, Large Language Models (LLMs) and Multimodal Large Language Models\n(MLLMs) have shown promise in instruction following and 2D image understanding.\nWhile these models are powerful, they have not yet been developed to comprehend\nthe more challenging 3D physical scenes, especially when it comes to the sparse\noutdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw\nLiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs\nto gain a comprehensive understanding of outdoor 3D scenes. The central insight\nof our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a\nlanguage modeling problem, encompassing tasks such as 3D captioning, 3D\ngrounding, 3D question answering, etc. Specifically, due to the scarcity of 3D\nLiDAR-text pairing data, we introduce a three-stage training strategy and\ngenerate relevant datasets, progressively aligning the 3D modality with the\nlanguage embedding space of LLM. Furthermore, we design a View-Aware\nTransformer (VAT) to connect the 3D encoder with the LLM, which effectively\nbridges the modality gap and enhances the LLM's spatial orientation\ncomprehension of visual features. Our experiments show that LiDAR-LLM possesses\nfavorable capabilities to comprehend various instructions regarding 3D scenes\nand engage in complex spatial reasoning. LiDAR-LLM attains a 40.9 BLEU-1 on the\n3D captioning task and achieves a 63.1\\% classification accuracy and a 14.3\\%\nBEV mIoU on the 3D grounding task. Web page:\nhttps://sites.google.com/view/lidar-llm\n","authors":["Senqiao Yang","Jiaming Liu","Ray Zhang","Mingjie Pan","Zoey Guo","Xiaoqi Li","Zehui Chen","Peng Gao","Yandong Guo","Shanghang Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.14074v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.02998v3","updated":"2023-12-21T17:41:55Z","published":"2022-10-06T15:38:02Z","title":"ThoraX-PriorNet: A Novel Attention-Based Architecture Using Anatomical\n Prior Probability Maps for Thoracic Disease Classification","summary":" Objective: Computer-aided disease diagnosis and prognosis based on medical\nimages is a rapidly emerging field. Many Convolutional Neural Network (CNN)\narchitectures have been developed by researchers for disease classification and\nlocalization from chest X-ray images. It is known that different thoracic\ndisease lesions are more likely to occur in specific anatomical regions\ncompared to others. This article aims to incorporate this disease and\nregion-dependent prior probability distribution within a deep learning\nframework. Methods: We present the ThoraX-PriorNet, a novel attention-based CNN\nmodel for thoracic disease classification. We first estimate a\ndisease-dependent spatial probability, i.e., an anatomical prior, that\nindicates the probability of occurrence of a disease in a specific region in a\nchest X-ray image. Next, we develop a novel attention-based classification\nmodel that combines information from the estimated anatomical prior and\nautomatically extracted chest region of interest (ROI) masks to provide\nattention to the feature maps generated from a deep convolution network. Unlike\nprevious works that utilize various self-attention mechanisms, the proposed\nmethod leverages the extracted chest ROI masks along with the probabilistic\nanatomical prior information, which selects the region of interest for\ndifferent diseases to provide attention. Results: The proposed method shows\nsuperior performance in disease classification on the NIH ChestX-ray14 dataset\ncompared to existing state-of-the-art methods while reaching an area under the\nROC curve (%AUC) of 84.67. Regarding disease localization, the anatomy prior\nattention method shows competitive performance compared to state-of-the-art\nmethods, achieving an accuracy of 0.80, 0.63, 0.49, 0.33, 0.28, 0.21, and 0.04\nwith an Intersection over Union (IoU) threshold of 0.1, 0.2, 0.3, 0.4, 0.5,\n0.6, and 0.7, respectively.\n","authors":["Md. Iqbal Hossain","Mohammad Zunaed","Md. Kawsar Ahmed","S. M. Jawwad Hossain","Anwarul Hasan","Taufiq Hasan"],"pdf_url":"https://arxiv.org/pdf/2210.02998v3.pdf","comment":"Accepted to IEEE ACCESS"},{"id":"http://arxiv.org/abs/2312.14055v1","updated":"2023-12-21T17:28:09Z","published":"2023-12-21T17:28:09Z","title":"A Strong Baseline for Temporal Video-Text Alignment","summary":" In this paper, we consider the problem of temporally aligning the video and\ntexts from instructional videos, specifically, given a long-term video, and\nassociated text sentences, our goal is to determine their corresponding\ntimestamps in the video. To this end, we establish a simple, yet strong model\nthat adopts a Transformer-based architecture with all texts as queries,\niteratively attending to the visual features, to infer the optimal timestamp.\nWe conduct thorough experiments to investigate: (i) the effect of upgrading ASR\nsystems to reduce errors from speech recognition, (ii) the effect of various\nvisual-textual backbones, ranging from CLIP to S3D, to the more recent\nInternVideo, (iii) the effect of transforming noisy ASR transcripts into\ndescriptive steps by prompting a large language model (LLM), to summarize the\ncore activities within the ASR transcript as a new training dataset. As a\nresult, our proposed simple model demonstrates superior performance on both\nnarration alignment and procedural step grounding tasks, surpassing existing\nstate-of-the-art methods by a significant margin on three public benchmarks,\nnamely, 9.3% on HT-Step, 3.4% on HTM-Align and 4.7% on CrossTask. We believe\nthe proposed model and dataset with descriptive steps can be treated as a\nstrong baseline for future research in temporal video-text alignment. All\ncodes, models, and the resulting dataset will be publicly released to the\nresearch community.\n","authors":["Zeqian Li","Qirui Chen","Tengda Han","Ya Zhang","Yanfeng Wang","Weidi Xie"],"pdf_url":"https://arxiv.org/pdf/2312.14055v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14053v1","updated":"2023-12-21T17:23:49Z","published":"2023-12-21T17:23:49Z","title":"Dual Attention U-Net with Feature Infusion: Pushing the Boundaries of\n Multiclass Defect Segmentation","summary":" The proposed architecture, Dual Attentive U-Net with Feature Infusion (DAU-FI\nNet), addresses challenges in semantic segmentation, particularly on multiclass\nimbalanced datasets with limited samples. DAU-FI Net integrates multiscale\nspatial-channel attention mechanisms and feature injection to enhance precision\nin object localization. The core employs a multiscale depth-separable\nconvolution block, capturing localized patterns across scales. This block is\ncomplemented by a spatial-channel squeeze and excitation (scSE) attention unit,\nmodeling inter-dependencies between channels and spatial regions in feature\nmaps. Additionally, additive attention gates refine segmentation by connecting\nencoder-decoder pathways.\n To augment the model, engineered features using Gabor filters for textural\nanalysis, Sobel and Canny filters for edge detection are injected guided by\nsemantic masks to expand the feature space strategically. Comprehensive\nexperiments on a challenging sewer pipe and culvert defect dataset and a\nbenchmark dataset validate DAU-FI Net's capabilities. Ablation studies\nhighlight incremental benefits from attention blocks and feature injection.\nDAU-FI Net achieves state-of-the-art mean Intersection over Union (IoU) of\n95.6% and 98.8% on the defect test set and benchmark respectively, surpassing\nprior methods by 8.9% and 12.6%, respectively. Ablation studies highlight\nincremental benefits from attention blocks and feature injection. The proposed\narchitecture provides a robust solution, advancing semantic segmentation for\nmulticlass problems with limited training data. Our sewer-culvert defects\ndataset, featuring pixel-level annotations, opens avenues for further research\nin this crucial domain. Overall, this work delivers key innovations in\narchitecture, attention, and feature engineering to elevate semantic\nsegmentation efficacy.\n","authors":["Rasha Alshawi","Md Tamjidul Hoque","Md Meftahul Ferdaus","Mahdi Abdelguerfi","Kendall Niles","Ken Prathak","Joe Tom","Jordan Klein","Murtada Mousa","Johny Javier Lopez"],"pdf_url":"https://arxiv.org/pdf/2312.14053v1.pdf","comment":"under review in IEEE Transactions on Artificial Intelligence"},{"id":"http://arxiv.org/abs/2306.09077v2","updated":"2023-12-21T17:07:20Z","published":"2023-06-15T12:10:27Z","title":"Estimating Generic 3D Room Structures from 2D Annotations","summary":" Indoor rooms are among the most common use cases in 3D scene understanding.\nCurrent state-of-the-art methods for this task are driven by large annotated\ndatasets. Room layouts are especially important, consisting of structural\nelements in 3D, such as wall, floor, and ceiling. However, they are difficult\nto annotate, especially on pure RGB video. We propose a novel method to produce\ngeneric 3D room layouts just from 2D segmentation masks, which are easy to\nannotate for humans. Based on these 2D annotations, we automatically\nreconstruct 3D plane equations for the structural elements and their spatial\nextent in the scene, and connect adjacent elements at the appropriate contact\nedges. We annotate and publicly release 2246 3D room layouts on the\nRealEstate10k dataset, containing YouTube videos. We demonstrate the high\nquality of these 3D layouts annotations with extensive experiments.\n","authors":["Denys Rozumnyi","Stefan Popov","Kevis-Kokitsi Maninis","Matthias Nießner","Vittorio Ferrari"],"pdf_url":"https://arxiv.org/pdf/2306.09077v2.pdf","comment":"https://github.com/google-research/cad-estate Accepted at 37th\n Conference on Neural Information Processing Systems (NeurIPS 2023) Track on\n Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2312.14024v1","updated":"2023-12-21T16:54:09Z","published":"2023-12-21T16:54:09Z","title":"Geometric Awareness in Neural Fields for 3D Human Registration","summary":" Aligning a template to 3D human point clouds is a long-standing problem\ncrucial for tasks like animation, reconstruction, and enabling supervised\nlearning pipelines. Recent data-driven methods leverage predicted surface\ncorrespondences; however, they are not robust to varied poses or distributions.\nIn contrast, industrial solutions often rely on expensive manual annotations or\nmulti-view capturing systems. Recently, neural fields have shown promising\nresults, but their purely data-driven nature lacks geometric awareness, often\nresulting in a trivial misalignment of the template registration. In this work,\nwe propose two solutions: LoVD, a novel neural field model that predicts the\ndirection towards the localized SMPL vertices on the target surface; and INT,\nthe first self-supervised task dedicated to neural fields that, at test time,\nrefines the backbone, exploiting the target geometry. We combine them into\nINLoVD, a robust 3D Human body registration pipeline trained on a large MoCap\ndataset. INLoVD is efficient (takes less than a minute), solidly achieves the\nstate of the art over public benchmarks, and provides unprecedented\ngeneralization on out-of-distribution data. We will release code and\ncheckpoints in \\url{url}.\n","authors":["Riccardo Marin","Enric Corona","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2312.14024v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.14981v2","updated":"2023-12-21T16:45:12Z","published":"2023-11-25T09:53:42Z","title":"Multi-task Planar Reconstruction with Feature Warping Guidance","summary":" Piece-wise planar 3D reconstruction simultaneously segments plane instances\nand recovers their 3D plane parameters from an image, which is particularly\nuseful for indoor or man-made environments. Efficient reconstruction of 3D\nplanes coupled with semantic predictions offers advantages for a wide range of\napplications requiring scene understanding and concurrent spatial mapping.\nHowever, most existing planar reconstruction models either neglect semantic\npredictions or do not run efficiently enough for real-time applications. We\nintroduce SOLOPlanes, a real-time planar reconstruction model based on a\nmodified instance segmentation architecture which simultaneously predicts\nsemantics for each plane instance, along with plane parameters and piece-wise\nplane instance masks. We achieve an improvement in instance mask segmentation\nby including multi-view guidance for plane predictions in the training process.\nThis cross-task improvement, training for plane prediction but improving the\nmask segmentation, is due to the nature of feature sharing in multi-task\nlearning. Our model simultaneously predicts semantics using single images at\ninference time, while achieving real-time predictions at 43 FPS.\n","authors":["Luan Wei","Anna Hilsmann","Peter Eisert"],"pdf_url":"https://arxiv.org/pdf/2311.14981v2.pdf","comment":"For code, see https://github.com/fraunhoferhhi/SOLOPlanes"},{"id":"http://arxiv.org/abs/2312.14001v1","updated":"2023-12-21T16:35:11Z","published":"2023-12-21T16:35:11Z","title":"Deep Learning Based Face Recognition Method using Siamese Network","summary":" Achieving state-of-the-art results in face verification systems typically\nhinges on the availability of labeled face training data, a resource that often\nproves challenging to acquire in substantial quantities. In this research\nendeavor, we proposed employing Siamese networks for face recognition,\neliminating the need for labeled face images. We achieve this by strategically\nleveraging negative samples alongside nearest neighbor counterparts, thereby\nestablishing positive and negative pairs through an unsupervised methodology.\nThe architectural framework adopts a VGG encoder, trained as a double branch\nsiamese network. Our primary aim is to circumvent the necessity for labeled\nface image data, thus proposing the generation of training pairs in an entirely\nunsupervised manner. Positive training data are selected within a dataset based\non their highest cosine similarity scores with a designated anchor, while\nnegative training data are culled in a parallel fashion, though drawn from an\nalternate dataset. During training, the proposed siamese network conducts\nbinary classification via cross-entropy loss. Subsequently, during the testing\nphase, we directly extract face verification scores from the network's output\nlayer. Experimental results reveal that the proposed unsupervised system\ndelivers a performance on par with a similar but fully supervised baseline.\n","authors":["Enoch Solomon","Abraham Woubie","Eyael Solomon Emiru"],"pdf_url":"https://arxiv.org/pdf/2312.14001v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13993v1","updated":"2023-12-21T16:28:08Z","published":"2023-12-21T16:28:08Z","title":"Open-Set: ID Card Presentation Attack Detection using Neural Transfer\n Style","summary":" The accurate detection of ID card Presentation Attacks (PA) is becoming\nincreasingly important due to the rising number of online/remote services that\nrequire the presentation of digital photographs of ID cards for digital\nonboarding or authentication. Furthermore, cybercriminals are continuously\nsearching for innovative ways to fool authentication systems to gain\nunauthorized access to these services. Although advances in neural network\ndesign and training have pushed image classification to the state of the art,\none of the main challenges faced by the development of fraud detection systems\nis the curation of representative datasets for training and evaluation. The\nhandcrafted creation of representative presentation attack samples often\nrequires expertise and is very time-consuming, thus an automatic process of\nobtaining high-quality data is highly desirable. This work explores ID card\nPresentation Attack Instruments (PAI) in order to improve the generation of\nsamples with four Generative Adversarial Networks (GANs) based image\ntranslation models and analyses the effectiveness of the generated data for\ntraining fraud detection systems. Using open-source data, we show that\nsynthetic attack presentations are an adequate complement for additional real\nattack presentations, where we obtain an EER performance increase of 0.63%\npoints for print attacks and a loss of 0.29% for screen capture attacks.\n","authors":["Reuben Markham","Juan M. Espin","Mario Nieto-Hidalgo","Juan E. Tapia"],"pdf_url":"https://arxiv.org/pdf/2312.13993v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12559v4","updated":"2023-12-21T16:24:00Z","published":"2023-09-22T01:06:16Z","title":"Invariant Learning via Probability of Sufficient and Necessary Causes","summary":" Out-of-distribution (OOD) generalization is indispensable for learning models\nin the wild, where testing distribution typically unknown and different from\nthe training. Recent methods derived from causality have shown great potential\nin achieving OOD generalization. However, existing methods mainly focus on the\ninvariance property of causes, while largely overlooking the property of\n\\textit{sufficiency} and \\textit{necessity} conditions. Namely, a necessary but\ninsufficient cause (feature) is invariant to distribution shift, yet it may not\nhave required accuracy. By contrast, a sufficient yet unnecessary cause\n(feature) tends to fit specific data well but may have a risk of adapting to a\nnew domain. To capture the information of sufficient and necessary causes, we\nemploy a classical concept, the probability of sufficiency and necessary causes\n(PNS), which indicates the probability of whether one is the necessary and\nsufficient cause. To associate PNS with OOD generalization, we propose PNS risk\nand formulate an algorithm to learn representation with a high PNS value. We\ntheoretically analyze and prove the generalizability of the PNS risk.\nExperiments on both synthetic and real-world benchmarks demonstrate the\neffectiveness of the proposed method. The details of the implementation can be\nfound at the GitHub repository: https://github.com/ymy4323460/CaSN.\n","authors":["Mengyue Yang","Zhen Fang","Yonggang Zhang","Yali Du","Furui Liu","Jean-Francois Ton","Jianhong Wang","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2309.12559v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08638v2","updated":"2023-12-21T16:22:44Z","published":"2023-08-16T19:20:06Z","title":"Fair GANs through model rebalancing for extremely imbalanced class\n distributions","summary":" Deep generative models require large amounts of training data. This often\nposes a problem as the collection of datasets can be expensive and difficult,\nin particular datasets that are representative of the appropriate underlying\ndistribution (e.g. demographic). This introduces biases in datasets which are\nfurther propagated in the models. We present an approach to construct an\nunbiased generative adversarial network (GAN) from an existing biased GAN by\nrebalancing the model distribution. We do so by generating balanced data from\nan existing imbalanced deep generative model using an evolutionary algorithm\nand then using this data to train a balanced generative model. Additionally, we\npropose a bias mitigation loss function that minimizes the deviation of the\nlearned class distribution from being equiprobable. We show results for the\nStyleGAN2 models while training on the Flickr Faces High Quality (FFHQ) dataset\nfor racial fairness and see that the proposed approach improves on the fairness\nmetric by almost 5 times, whilst maintaining image quality. We further validate\nour approach by applying it to an imbalanced CIFAR10 dataset where we show that\nwe can obtain comparable fairness and image quality as when training on a\nbalanced CIFAR10 dataset which is also twice as large. Lastly, we argue that\nthe traditionally used image quality metrics such as Frechet inception distance\n(FID) are unsuitable for scenarios where the class distributions are imbalanced\nand a balanced reference set is not available.\n","authors":["Anubhav Jain","Nasir Memon","Julian Togelius"],"pdf_url":"https://arxiv.org/pdf/2308.08638v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13980v1","updated":"2023-12-21T16:10:33Z","published":"2023-12-21T16:10:33Z","title":"Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion\n Models with RL Finetuning","summary":" Recent advancements in the text-to-3D task leverage finetuned text-to-image\ndiffusion models to generate multi-view images, followed by NeRF\nreconstruction. Yet, existing supervised finetuned (SFT) diffusion models still\nsuffer from multi-view inconsistency and the resulting NeRF artifacts. Although\ntraining longer with SFT improves consistency, it also causes distribution\nshift, which reduces diversity and realistic details. We argue that the SFT of\nmulti-view diffusion models resembles the instruction finetuning stage of the\nLLM alignment pipeline and can benefit from RL finetuning (RLFT) methods.\nEssentially, RLFT methods optimize models beyond their SFT data distribution by\nusing their own outputs, effectively mitigating distribution shift. To this\nend, we introduce Carve3D, a RLFT method coupled with the Multi-view\nReconstruction Consistency (MRC) metric, to improve the consistency of\nmulti-view diffusion models. To compute MRC on a set of multi-view images, we\ncompare them with their corresponding renderings of the reconstructed NeRF at\nthe same viewpoints. We validate the robustness of MRC with extensive\nexperiments conducted under controlled inconsistency levels. We enhance the\nbase RLFT algorithm to stabilize the training process, reduce distribution\nshift, and identify scaling laws. Through qualitative and quantitative\nexperiments, along with a user study, we demonstrate Carve3D's improved\nmulti-view consistency, the resulting superior NeRF reconstruction quality, and\nminimal distribution shift compared to longer SFT. Project webpage:\nhttps://desaixie.github.io/carve-3d.\n","authors":["Desai Xie","Jiahao Li","Hao Tan","Xin Sun","Zhixin Shu","Yi Zhou","Sai Bi","Sören Pirk","Arie E. Kaufman"],"pdf_url":"https://arxiv.org/pdf/2312.13980v1.pdf","comment":"Project webpage: https://desaixie.github.io/carve-3d"},{"id":"http://arxiv.org/abs/2312.13977v1","updated":"2023-12-21T16:04:45Z","published":"2023-12-21T16:04:45Z","title":"NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse\n Input Views","summary":" Recently, neural implicit functions have demonstrated remarkable results in\nthe field of multi-view reconstruction. However, most existing methods are\ntailored for dense views and exhibit unsatisfactory performance when dealing\nwith sparse views. Several latest methods have been proposed for generalizing\nimplicit reconstruction to address the sparse view reconstruction task, but\nthey still suffer from high training costs and are merely valid under carefully\nselected perspectives. In this paper, we propose a novel sparse view\nreconstruction framework that leverages on-surface priors to achieve highly\nfaithful surface reconstruction. Specifically, we design several constraints on\nglobal geometry alignment and local geometry refinement for jointly optimizing\ncoarse shapes and fine details. To achieve this, we train a neural network to\nlearn a global implicit field from the on-surface points obtained from SfM and\nthen leverage it as a coarse geometric constraint. To exploit local geometric\nconsistency, we project on-surface points onto seen and unseen views, treating\nthe consistent loss of projected features as a fine geometric constraint. The\nexperimental results with DTU and BlendedMVS datasets in two prevalent sparse\nsettings demonstrate significant improvements over the state-of-the-art\nmethods.\n","authors":["Han Huang","Yulun Wu","Junsheng Zhou","Ge Gao","Ming Gu","Yushen Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13977v1.pdf","comment":"Accepted by AAAI 2024. Project page:\n https://alvin528.github.io/NeuSurf/"},{"id":"http://arxiv.org/abs/2312.13964v1","updated":"2023-12-21T15:51:12Z","published":"2023-12-21T15:51:12Z","title":"PIA: Your Personalized Image Animator via Plug-and-Play Modules in\n Text-to-Image Models","summary":" Recent advancements in personalized text-to-image (T2I) models have\nrevolutionized content creation, empowering non-experts to generate stunning\nimages with unique styles. While promising, adding realistic motions into these\npersonalized images by text poses significant challenges in preserving distinct\nstyles, high-fidelity details, and achieving motion controllability by text. In\nthis paper, we present PIA, a Personalized Image Animator that excels in\naligning with condition images, achieving motion controllability by text, and\nthe compatibility with various personalized T2I models without specific tuning.\nTo achieve these goals, PIA builds upon a base T2I model with well-trained\ntemporal alignment layers, allowing for the seamless transformation of any\npersonalized T2I model into an image animation model. A key component of PIA is\nthe introduction of the condition module, which utilizes the condition frame\nand inter-frame affinity as input to transfer appearance information guided by\nthe affinity hint for individual frame synthesis in the latent space. This\ndesign mitigates the challenges of appearance-related image alignment within\nand allows for a stronger focus on aligning with motion-related guidance.\n","authors":["Yiming Zhang","Zhening Xing","Yanhong Zeng","Youqing Fang","Kai Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13964v1.pdf","comment":"Project page: https://pi-animator.github.io/"},{"id":"http://arxiv.org/abs/2312.13941v1","updated":"2023-12-21T15:32:49Z","published":"2023-12-21T15:32:49Z","title":"Controllable 3D Face Generation with Conditional Style Code Diffusion","summary":" Generating photorealistic 3D faces from given conditions is a challenging\ntask. Existing methods often rely on time-consuming one-by-one optimization\napproaches, which are not efficient for modeling the same distribution content,\ne.g., faces. Additionally, an ideal controllable 3D face generation model\nshould consider both facial attributes and expressions. Thus we propose a novel\napproach called TEx-Face(TExt & Expression-to-Face) that addresses these\nchallenges by dividing the task into three components, i.e., 3D GAN Inversion,\nConditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion,\nwe introduce two methods which aim to enhance the representation of style codes\nand alleviate 3D inconsistencies. Furthermore, we design a style code denoiser\nto incorporate multiple conditions into the style code and propose a data\naugmentation strategy to address the issue of insufficient paired\nvisual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and\nCelebA-Dialog demonstrate the promising performance of our TEx-Face in\nachieving the efficient and controllable generation of photorealistic 3D faces.\nThe code will be available at https://github.com/sxl142/TEx-Face.\n","authors":["Xiaolong Shen","Jianxin Ma","Chang Zhou","Zongxin Yang"],"pdf_url":"https://arxiv.org/pdf/2312.13941v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2309.07277v2","updated":"2023-12-21T15:26:22Z","published":"2023-09-13T19:33:26Z","title":"Limitations of Face Image Generation","summary":" Text-to-image diffusion models have achieved widespread popularity due to\ntheir unprecedented image generation capability. In particular, their ability\nto synthesize and modify human faces has spurred research into using generated\nface images in both training data augmentation and model performance\nassessments. In this paper, we study the efficacy and shortcomings of\ngenerative models in the context of face generation. Utilizing a combination of\nqualitative and quantitative measures, including embedding-based metrics and\nuser studies, we present a framework to audit the characteristics of generated\nfaces conditioned on a set of social attributes. We applied our framework on\nfaces generated through state-of-the-art text-to-image diffusion models. We\nidentify several limitations of face image generation that include faithfulness\nto the text prompt, demographic disparities, and distributional shifts.\nFurthermore, we present an analytical model that provides insights into how\ntraining data selection contributes to the performance of generative models.\n","authors":["Harrison Rosenberg","Shimaa Ahmed","Guruprasad V Ramesh","Ramya Korlakai Vinayak","Kassem Fawaz"],"pdf_url":"https://arxiv.org/pdf/2309.07277v2.pdf","comment":"Accepted to The 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI 2024)"},{"id":"http://arxiv.org/abs/2311.03830v2","updated":"2023-12-21T15:18:34Z","published":"2023-11-07T09:19:28Z","title":"Reducing Spatial Fitting Error in Distillation of Denoising Diffusion\n Models","summary":" Denoising Diffusion models have exhibited remarkable capabilities in image\ngeneration. However, generating high-quality samples requires a large number of\niterations. Knowledge distillation for diffusion models is an effective method\nto address this limitation with a shortened sampling process but causes\ndegraded generative quality. Based on our analysis with bias-variance\ndecomposition and experimental observations, we attribute the degradation to\nthe spatial fitting error occurring in the training of both the teacher and\nstudent model. Accordingly, we propose $\\textbf{S}$patial\n$\\textbf{F}$itting-$\\textbf{E}$rror $\\textbf{R}$eduction\n$\\textbf{D}$istillation model ($\\textbf{SFERD}$). SFERD utilizes attention\nguidance from the teacher model and a designed semantic gradient predictor to\nreduce the student's fitting error. Empirically, our proposed model facilitates\nhigh-quality sample generation in a few function evaluations. We achieve an FID\nof 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\\times$64 with only one step,\noutperforming existing diffusion methods. Our study provides a new perspective\non diffusion distillation by highlighting the intrinsic denoising ability of\nmodels. Project link: \\url{https://github.com/Sainzerjj/SFERD}.\n","authors":["Shengzhe Zhou","Zejian Lee","Shengyuan Zhang","Lefan Hou","Changyuan Yang","Guang Yang","Zhiyuan Yang","Lingyun Sun"],"pdf_url":"https://arxiv.org/pdf/2311.03830v2.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2303.06088v5","updated":"2023-12-21T15:16:32Z","published":"2023-03-10T17:09:04Z","title":"Towards domain-invariant Self-Supervised Learning with Batch Styles\n Standardization","summary":" In Self-Supervised Learning (SSL), models are typically pretrained,\nfine-tuned, and evaluated on the same domains. However, they tend to perform\npoorly when evaluated on unseen domains, a challenge that Unsupervised Domain\nGeneralization (UDG) seeks to address. Current UDG methods rely on domain\nlabels, which are often challenging to collect, and domain-specific\narchitectures that lack scalability when confronted with numerous domains,\nmaking the current methodology impractical and rigid. Inspired by\ncontrastive-based UDG methods that mitigate spurious correlations by\nrestricting comparisons to examples from the same domain, we hypothesize that\neliminating style variability within a batch could provide a more convenient\nand flexible way to reduce spurious correlations without requiring domain\nlabels. To verify this hypothesis, we introduce Batch Styles Standardization\n(BSS), a relatively simple yet powerful Fourier-based method to standardize the\nstyle of images in a batch specifically designed for integration with SSL\nmethods to tackle UDG. Combining BSS with existing SSL methods offers serious\nadvantages over prior UDG methods: (1) It eliminates the need for domain labels\nor domain-specific network components to enhance domain-invariance in SSL\nrepresentations, and (2) offers flexibility as BSS can be seamlessly integrated\nwith diverse contrastive-based but also non-contrastive-based SSL methods.\nExperiments on several UDG datasets demonstrate that it significantly improves\ndownstream task performances on unseen domains, often outperforming or rivaling\nwith UDG methods. Finally, this work clarifies the underlying mechanisms\ncontributing to BSS's effectiveness in improving domain-invariance in SSL\nrepresentations and performances on unseen domain.\n","authors":["Marin Scalbert","Maria Vakalopoulou","Florent Couzinié-Devy"],"pdf_url":"https://arxiv.org/pdf/2303.06088v5.pdf","comment":"Under review as conference paper"},{"id":"http://arxiv.org/abs/2310.19583v3","updated":"2023-12-21T15:14:22Z","published":"2023-10-30T14:41:53Z","title":"GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View\n Stereo","summary":" Traditional multi-view stereo (MVS) methods rely heavily on photometric and\ngeometric consistency constraints, but newer machine learning-based MVS methods\ncheck geometric consistency across multiple source views only as a\npost-processing step. In this paper, we present a novel approach that\nexplicitly encourages geometric consistency of reference view depth maps across\nmultiple source views at different scales during learning (see Fig. 1). We find\nthat adding this geometric consistency loss significantly accelerates learning\nby explicitly penalizing geometrically inconsistent pixels, reducing the\ntraining iteration requirements to nearly half that of other MVS methods. Our\nextensive experiments show that our approach achieves a new state-of-the-art on\nthe DTU and BlendedMVS datasets, and competitive results on the Tanks and\nTemples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt\nto enforce multi-view, multi-scale geometric consistency during learning.\n","authors":["Vibhas K. Vats","Sripad Joshi","David J. Crandall","Md. Alimoor Reza","Soon-heung Jung"],"pdf_url":"https://arxiv.org/pdf/2310.19583v3.pdf","comment":"Accepted in WACV 2024 Link:\n https://openaccess.thecvf.com/content/WACV2024/html/Vats_GC-MVSNet_Multi-View_Multi-Scale_Geometrically-Consistent_Multi-View_Stereo_WACV_2024_paper.html"},{"id":"http://arxiv.org/abs/2312.13913v1","updated":"2023-12-21T15:01:47Z","published":"2023-12-21T15:01:47Z","title":"Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models","summary":" This paper presents Paint3D, a novel coarse-to-fine generative framework that\nis capable of producing high-resolution, lighting-less, and diverse 2K UV\ntexture maps for untextured 3D meshes conditioned on text or image inputs. The\nkey challenge addressed is generating high-quality textures without embedded\nillumination information, which allows the textures to be re-lighted or\nre-edited within modern graphics pipelines. To achieve this, our method first\nleverages a pre-trained depth-aware 2D diffusion model to generate\nview-conditional images and perform multi-view texture fusion, producing an\ninitial coarse texture map. However, as 2D models cannot fully represent 3D\nshapes and disable lighting effects, the coarse texture map exhibits incomplete\nareas and illumination artifacts. To resolve this, we train separate UV\nInpainting and UVHD diffusion models specialized for the shape-aware refinement\nof incomplete areas and the removal of illumination artifacts. Through this\ncoarse-to-fine process, Paint3D can produce high-quality 2K UV textures that\nmaintain semantic consistency while being lighting-less, significantly\nadvancing the state-of-the-art in texturing 3D objects.\n","authors":["Xianfang Zeng"],"pdf_url":"https://arxiv.org/pdf/2312.13913v1.pdf","comment":"Project Website: https://github.com/OpenTexture/Paint3D"},{"id":"http://arxiv.org/abs/2310.16898v3","updated":"2023-12-21T14:56:46Z","published":"2023-10-25T18:00:26Z","title":"MCUFormer: Deploying Vision Transformers on Microcontrollers with\n Limited Memory","summary":" Due to the high price and heavy energy consumption of GPUs, deploying deep\nmodels on IoT devices such as microcontrollers makes significant contributions\nfor ecological AI. Conventional methods successfully enable convolutional\nneural network inference of high resolution images on microcontrollers, while\nthe framework for vision transformers that achieve the state-of-the-art\nperformance in many vision applications still remains unexplored. In this\npaper, we propose a hardware-algorithm co-optimizations method called MCUFormer\nto deploy vision transformers on microcontrollers with extremely limited\nmemory, where we jointly design transformer architecture and construct the\ninference operator library to fit the memory resource constraint. More\nspecifically, we generalize the one-shot network architecture search (NAS) to\ndiscover the optimal architecture with highest task performance given the\nmemory budget from the microcontrollers, where we enlarge the existing search\nspace of vision transformers by considering the low-rank decomposition\ndimensions and patch resolution for memory reduction. For the construction of\nthe inference operator library of vision transformers, we schedule the memory\nbuffer during inference through operator integration, patch embedding\ndecomposition, and token overwriting, allowing the memory buffer to be fully\nutilized to adapt to the forward pass of the vision transformer. Experimental\nresults demonstrate that our MCUFormer achieves 73.62\\% top-1 accuracy on\nImageNet for image classification with 320KB memory on STM32F746\nmicrocontroller. Code is available at https://github.com/liangyn22/MCUFormer.\n","authors":["Yinan Liang","Ziwei Wang","Xiuwei Xu","Yansong Tang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2310.16898v3.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13906v1","updated":"2023-12-21T14:51:23Z","published":"2023-12-21T14:51:23Z","title":"EfficientPPS: Part-aware Panoptic Segmentation of Transparent Objects\n for Robotic Manipulation","summary":" The use of autonomous robots for assistance tasks in hospitals has the\npotential to free up qualified staff and im-prove patient care. However, the\nubiquity of deformable and transparent objects in hospital settings poses\nsignif-icant challenges to vision-based perception systems. We present\nEfficientPPS, a neural architecture for part-aware panoptic segmentation that\nprovides robots with semantically rich visual information for grasping and\nma-nipulation tasks. We also present an unsupervised data collection and\nlabelling method to reduce the need for human involvement in the training\nprocess. EfficientPPS is evaluated on a dataset containing real-world hospital\nobjects and demonstrated to be robust and efficient in grasping transparent\ntransfusion bags with a collaborative robot arm.\n","authors":["Benjamin Alt","Minh Dang Nguyen","Andreas Hermann","Darko Katic","Rainer Jäkel","Rüdiger Dillmann","Eric Sax"],"pdf_url":"https://arxiv.org/pdf/2312.13906v1.pdf","comment":"8 pages, 8 figures, presented at the 56th International Symposium on\n Robotics (ISR Europe)"},{"id":"http://arxiv.org/abs/2312.13271v2","updated":"2023-12-21T14:20:54Z","published":"2023-12-20T18:51:02Z","title":"Repaint123: Fast and High-quality One Image to 3D Generation with\n Progressive Controllable 2D Repainting","summary":" Recent one image to 3D generation methods commonly adopt Score Distillation\nSampling (SDS). Despite the impressive results, there are multiple deficiencies\nincluding multi-view inconsistency, over-saturated and over-smoothed textures,\nas well as the slow generation speed. To address these deficiencies, we present\nRepaint123 to alleviate multi-view bias as well as texture degradation and\nspeed up the generation process. The core idea is to combine the powerful image\ngeneration capability of the 2D diffusion model and the texture alignment\nability of the repainting strategy for generating high-quality multi-view\nimages with consistency. We further propose visibility-aware adaptive\nrepainting strength for overlap regions to enhance the generated image quality\nin the repainting process. The generated high-quality and multi-view consistent\nimages enable the use of simple Mean Square Error (MSE) loss for fast 3D\ncontent generation. We conduct extensive experiments and show that our method\nhas a superior ability to generate high-quality 3D content with multi-view\nconsistency and fine textures in 2 minutes from scratch. Our webpage is\navailable at https://junwuzhang19.github.io/repaint123/.\n","authors":["Junwu Zhang","Zhenyu Tang","Yatian Pang","Xinhua Cheng","Peng Jin","Yida Wei","Munan Ning","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.13271v2.pdf","comment":"Project page: https://junwuzhang19.github.io/repaint123/"},{"id":"http://arxiv.org/abs/2308.06668v3","updated":"2023-12-21T14:18:54Z","published":"2023-08-13T02:59:36Z","title":"Foundation Models in Smart Agriculture: Basics, Opportunities, and\n Challenges","summary":" The past decade has witnessed the rapid development of ML and DL\nmethodologies in agricultural systems, showcased by great successes in variety\nof agricultural applications. However, these conventional ML/DL models have\ncertain limitations: They heavily rely on large, costly-to-acquire labeled\ndatasets for training, require specialized expertise for development and\nmaintenance, and are mostly tailored for specific tasks, thus lacking\ngeneralizability. Recently, foundation models have demonstrated remarkable\nsuccesses in language and vision tasks across various domains. These models are\ntrained on a vast amount of data from multiple domains and modalities. Once\ntrained, they can accomplish versatile tasks with just minor fine-tuning and\nminimal task-specific labeled data. Despite their proven effectiveness and huge\npotential, there has been little exploration of applying FMs to agriculture\nfields. Therefore, this study aims to explore the potential of FMs in the field\nof smart agriculture. In particular, we present conceptual tools and technical\nbackground to facilitate the understanding of the problem space and uncover new\nresearch directions in this field. To this end, we first review recent FMs in\nthe general computer science domain and categorize them into four categories:\nlanguage FMs, vision FMs, multimodal FMs, and reinforcement learning FMs.\nSubsequently, we outline the process of developing agriculture FMs and discuss\ntheir potential applications in smart agriculture. We also discuss the unique\nchallenges associated with developing AFMs, including model training,\nvalidation, and deployment. Through this study, we contribute to the\nadvancement of AI in agriculture by introducing AFMs as a promising paradigm\nthat can significantly mitigate the reliance on extensive labeled datasets and\nenhance the efficiency, effectiveness, and generalization of agricultural AI\nsystems.\n","authors":["Jiajia Li","Mingle Xu","Lirong Xiang","Dong Chen","Weichao Zhuang","Xunyuan Yin","Zhaojian Li"],"pdf_url":"https://arxiv.org/pdf/2308.06668v3.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.13848v1","updated":"2023-12-21T13:45:02Z","published":"2023-12-21T13:45:02Z","title":"Reducing Hallucinations: Enhancing VQA for Flood Disaster Damage\n Assessment with Visual Contexts","summary":" The zero-shot performance of visual question answering (VQA) models relies\nheavily on prompts. For example, a zero-shot VQA for disaster scenarios could\nleverage well-designed Chain of Thought (CoT) prompts to stimulate the model's\npotential. However, using CoT prompts has some problems, such as causing an\nincorrect answer in the end due to the hallucination in the thought process. In\nthis paper, we propose a zero-shot VQA named Flood Disaster VQA with Two-Stage\nPrompt (VQA-TSP). The model generates the thought process in the first stage\nand then uses the thought process to generate the final answer in the second\nstage. In particular, visual context is added in the second stage to relieve\nthe hallucination problem that exists in the thought process. Experimental\nresults show that our method exceeds the performance of state-of-the-art\nzero-shot VQA models for flood disaster scenarios in total. Our study provides\na research basis for improving the performance of CoT-based zero-shot VQA.\n","authors":["Yimin Sun","Chao Wang","Yan Peng"],"pdf_url":"https://arxiv.org/pdf/2312.13848v1.pdf","comment":"already be accepted by 2024 3rd International Conference on Computer,\n Artificial Intelligence and Control Engineering (CAICE 2024)"},{"id":"http://arxiv.org/abs/2312.13845v1","updated":"2023-12-21T13:42:08Z","published":"2023-12-21T13:42:08Z","title":"Image Clustering using Restricted Boltzman Machine","summary":" In various verification systems, Restricted Boltzmann Machines (RBMs) have\ndemonstrated their efficacy in both front-end and back-end processes. In this\nwork, we propose the use of RBMs to the image clustering tasks. RBMs are\ntrained to convert images into image embeddings. We employ the conventional\nbottom-up Agglomerative Hierarchical Clustering (AHC) technique. To address the\nchallenge of limited test face image data, we introduce Agglomerative\nHierarchical Clustering based Method for Image Clustering using Restricted\nBoltzmann Machine (AHC-RBM) with two major steps. Initially, a universal RBM\nmodel is trained using all available training dataset. Subsequently, we train\nan adapted RBM model using the data from each test image. Finally, RBM vectors\nwhich is the embedding vector is generated by concatenating the\nvisible-to-hidden weight matrices of these adapted models, and the bias\nvectors. These vectors effectively preserve class-specific information and are\nutilized in image clustering tasks. Our experimental results, conducted on two\nbenchmark image datasets (MS-Celeb-1M and DeepFashion), demonstrate that our\nproposed approach surpasses well-known clustering algorithms such as k-means,\nspectral clustering, and approximate Rank-order.\n","authors":["Abraham Woubie","Enoch Solomon","Eyael Solomon Emiru"],"pdf_url":"https://arxiv.org/pdf/2312.13845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13841v1","updated":"2023-12-21T13:40:03Z","published":"2023-12-21T13:40:03Z","title":"Towards Efficient Time Stepping for Numerical Shape Correspondence","summary":" The computation of correspondences between shapes is a principal task in\nshape analysis. To this end, methods based on partial differential equations\n(PDEs) have been established, encompassing e.g. the classic heat kernel\nsignature as well as numerical solution schemes for geometric PDEs. In this\nwork we focus on the latter approach.\n We consider here several time stepping schemes. The goal of this\ninvestigation is to assess, if one may identify a useful property of methods\nfor time integration for the shape analysis context. Thereby we investigate the\ndependence on time step size, since the class of implicit schemes that are\nuseful candidates in this context should ideally yield an invariant behaviour\nwith respect to this parameter.\n To this end we study integration of heat and wave equation on a manifold. In\norder to facilitate this study, we propose an efficient, unified model order\nreduction framework for these models. We show that specific $l_0$ stable\nschemes are favourable for numerical shape analysis. We give an experimental\nevaluation of the methods at hand of classical TOSCA data sets.\n","authors":["Alexander Köhler","Michael Breuß"],"pdf_url":"https://arxiv.org/pdf/2312.13841v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.13839v1","updated":"2023-12-21T13:39:18Z","published":"2023-12-21T13:39:18Z","title":"Q-SENN: Quantized Self-Explaining Neural Networks","summary":" Explanations in Computer Vision are often desired, but most Deep Neural\nNetworks can only provide saliency maps with questionable faithfulness.\nSelf-Explaining Neural Networks (SENN) extract interpretable concepts with\nfidelity, diversity, and grounding to combine them linearly for\ndecision-making. While they can explain what was recognized, initial\nrealizations lack accuracy and general applicability. We propose the\nQuantized-Self-Explaining Neural Network Q-SENN. Q-SENN satisfies or exceeds\nthe desiderata of SENN while being applicable to more complex datasets and\nmaintaining most or all of the accuracy of an uninterpretable baseline model,\nout-performing previous work in all considered metrics. Q-SENN describes the\nrelationship between every class and feature as either positive, negative or\nneutral instead of an arbitrary number of possible relations, enforcing more\nbinary human-friendly features. Since every class is assigned just 5\ninterpretable features on average, Q-SENN shows convincing local and global\ninterpretability. Additionally, we propose a feature alignment method, capable\nof aligning learned features with human language-based concepts without\nadditional supervision. Thus, what is learned can be more easily verbalized.\nThe code is published: https://github.com/ThomasNorr/Q-SENN\n","authors":["Thomas Norrenbrock","Marco Rudolph","Bodo Rosenhahn"],"pdf_url":"https://arxiv.org/pdf/2312.13839v1.pdf","comment":"Accepted to AAAI 2024, SRRAI"},{"id":"http://arxiv.org/abs/2301.01841v3","updated":"2023-12-21T13:34:48Z","published":"2023-01-04T22:20:16Z","title":"Classification of Single Tree Decay Stages from Combined Airborne LiDAR\n Data and CIR Imagery","summary":" Understanding forest health is of great importance for the conservation of\nthe integrity of forest ecosystems. In this regard, evaluating the amount and\nquality of dead wood is of utmost interest as they are favorable indicators of\nbiodiversity. Apparently, remote sensing-based machine learning techniques have\nproven to be more efficient and sustainable with unprecedented accuracy in\nforest inventory. This study, for the first time, automatically categorizing\nindividual coniferous trees (Norway spruce) into five decay stages (live,\ndeclining, dead, loose bark, and clean) from combined airborne laser scanning\n(ALS) point clouds and color infrared (CIR) images using three different\nMachine Learning methods - 3D point cloud-based deep learning (KPConv),\nConvolutional Neural Network (CNN), and Random Forest (RF). First, CIR\ncolorized point clouds are created by fusing the ALS point clouds and color\ninfrared images. Then, individual tree segmentation is conducted, after which\nthe results are further projected onto four orthogonal planes. Finally, the\nclassification is conducted on the two datasets (3D multispectral point clouds\nand 2D projected images) based on the three Machine Learning algorithms. All\nmodels achieved promising results, reaching overall accuracy (OA) of up to\n88.8%, 88.4% and 85.9% for KPConv, CNN and RF, respectively. The experimental\nresults reveal that color information, 3D coordinates, and intensity of point\nclouds have significant impact on the promising classification performance. The\nperformance of our models, therefore, shows the significance of machine/deep\nlearning for individual tree decay stages classification and landscape-wide\nassessment of the dead wood amount and quality by using modern airborne remote\nsensing techniques. The proposed method can contribute as an important and\nreliable tool for monitoring biodiversity in forest ecosystems.\n","authors":["Tsz Chung Wong","Abubakar Sani-Mohammed","Jinhong Wang","Puzuo Wang","Wei Yao","Marco Heurich"],"pdf_url":"https://arxiv.org/pdf/2301.01841v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13832v1","updated":"2023-12-21T13:32:38Z","published":"2023-12-21T13:32:38Z","title":"SyncDreamer for 3D Reconstruction of Endangered Animal Species with NeRF\n and NeuS","summary":" The main aim of this study is to demonstrate how innovative view synthesis\nand 3D reconstruction techniques can be used to create models of endangered\nspecies using monocular RGB images. To achieve this, we employed SyncDreamer to\nproduce unique perspectives and NeuS and NeRF to reconstruct 3D\nrepresentations. We chose four different animals, including the oriental stork,\nfrog, dragonfly, and tiger, as our subjects for this study. Our results show\nthat the combination of SyncDreamer, NeRF, and NeuS techniques can successfully\ncreate 3D models of endangered animals. However, we also observed that NeuS\nproduced blurry images, while NeRF generated sharper but noisier images. This\nstudy highlights the potential of modeling endangered animals and offers a new\ndirection for future research in this field. By showcasing the effectiveness of\nthese advanced techniques, we hope to encourage further exploration and\ndevelopment of techniques for preserving and studying endangered species.\n","authors":["Ahmet Haydar Ornek","Deniz Sen","Esmanur Civil"],"pdf_url":"https://arxiv.org/pdf/2312.13832v1.pdf","comment":"8 figures"},{"id":"http://arxiv.org/abs/2312.11562v3","updated":"2023-12-21T13:21:59Z","published":"2023-12-17T15:16:13Z","title":"A Survey of Reasoning with Foundation Models: Concepts, Methodologies,\n and Outlook","summary":" Reasoning, a crucial ability for complex problem-solving, plays a pivotal\nrole in various real-world settings such as negotiation, medical diagnosis, and\ncriminal investigation. It serves as a fundamental methodology in the field of\nArtificial General Intelligence (AGI). With the ongoing development of\nfoundation models, there is a growing interest in exploring their abilities in\nreasoning tasks. In this paper, we introduce seminal foundation models proposed\nor adaptable for reasoning, highlighting the latest advancements in various\nreasoning tasks, methods, and benchmarks. We then delve into the potential\nfuture directions behind the emergence of reasoning abilities within foundation\nmodels. We also discuss the relevance of multimodal learning, autonomous\nagents, and super alignment in the context of reasoning. By discussing these\nfuture research directions, we hope to inspire researchers in their exploration\nof this field, stimulate further advancements in reasoning with foundation\nmodels, and contribute to the development of AGI.\n","authors":["Jiankai Sun","Chuanyang Zheng","Enze Xie","Zhengying Liu","Ruihang Chu","Jianing Qiu","Jiaqi Xu","Mingyu Ding","Hongyang Li","Mengzhe Geng","Yue Wu","Wenhai Wang","Junsong Chen","Zhangyue Yin","Xiaozhe Ren","Jie Fu","Junxian He","Wu Yuan","Qi Liu","Xihui Liu","Yu Li","Hao Dong","Yu Cheng","Ming Zhang","Pheng Ann Heng","Jifeng Dai","Ping Luo","Jingdong Wang","Ji-Rong Wen","Xipeng Qiu","Yike Guo","Hui Xiong","Qun Liu","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2312.11562v3.pdf","comment":"20 Figures, 160 Pages, 750+ References, Project Page\n https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models"},{"id":"http://arxiv.org/abs/2312.13822v1","updated":"2023-12-21T13:12:37Z","published":"2023-12-21T13:12:37Z","title":"Universal Noise Annotation: Unveiling the Impact of Noisy annotation on\n Object Detection","summary":" For object detection task with noisy labels, it is important to consider not\nonly categorization noise, as in image classification, but also localization\nnoise, missing annotations, and bogus bounding boxes. However, previous studies\nhave only addressed certain types of noise (e.g., localization or\ncategorization). In this paper, we propose Universal-Noise Annotation (UNA), a\nmore practical setting that encompasses all types of noise that can occur in\nobject detection, and analyze how UNA affects the performance of the detector.\nWe analyzed the development direction of previous works of detection algorithms\nand examined the factors that impact the robustness of detection model learning\nmethod. We open-source the code for injecting UNA into the dataset and all the\ntraining log and weight are also shared.\n","authors":["Kwangrok Ryoo","Yeonsik Jo","Seungjun Lee","Mira Kim","Ahra Jo","Seung Hwan Kim","Seungryong Kim","Soonyoung Lee"],"pdf_url":"https://arxiv.org/pdf/2312.13822v1.pdf","comment":"appendix and code : https://github.com/Ryoo72/UNA"},{"id":"http://arxiv.org/abs/2312.13820v1","updated":"2023-12-21T13:11:57Z","published":"2023-12-21T13:11:57Z","title":"Super-resolution of THz time-domain images based on low-rank\n representation","summary":" Terahertz time-domain spectroscopy (THz-TDS) employs sub-picosecond pulses to\nprobe dielectric properties of materials giving as a result a 3-dimensional\nhyperspectral data cube. The spatial resolution of THz images is primarily\nlimited by two sources: a non-zero THz beam waist and the acquisition step\nsize. Acquisition with a small step size allows for the visualisation of\nsmaller details in images at the expense of acquisition time, but the\nfrequency-dependent point-spread function remains the biggest bottleneck for\nTHz imaging. This work presents a super-resolution approach to restore THz\ntime-domain images acquired with medium-to-big step sizes. The results show the\noptimized and robust performance for different frequency bands (from 0.5 to 3.5\nTHz) obtaining higher resolution and additionally removing effects of blur at\nlower frequencies and noise at higher frequencies.\n","authors":["Marina Ljubenovic","Alessia Artesani","Stefano Bonetti","Arianna Traviglia"],"pdf_url":"https://arxiv.org/pdf/2312.13820v1.pdf","comment":"This work was presented at the Sixth International Workshop on Mobile\n Terahertz Systems (IWMTS)"},{"id":"http://arxiv.org/abs/2305.15194v2","updated":"2023-12-21T12:55:57Z","published":"2023-05-24T14:31:20Z","title":"DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion\n Models","summary":" In this study, we aim to extend the capabilities of diffusion-based\ntext-to-image (T2I) generation models by incorporating diverse modalities\nbeyond textual description, such as sketch, box, color palette, and style\nembedding, within a single model. We thus design a multimodal T2I diffusion\nmodel, coined as DiffBlender, by separating the channels of conditions into\nthree types, i.e., image forms, spatial tokens, and non-spatial tokens. The\nunique architecture of DiffBlender facilitates adding new input modalities,\npioneering a scalable framework for conditional image generation. Notably, we\nachieve this without altering the parameters of the existing generative model,\nStable Diffusion, only with updating partial components. Our study establishes\nnew benchmarks in multimodal generation through quantitative and qualitative\ncomparisons with existing conditional generation methods. We demonstrate that\nDiffBlender faithfully blends all the provided information and showcase its\nvarious applications in the detailed image synthesis.\n","authors":["Sungnyun Kim","Junsoo Lee","Kibeom Hong","Daesik Kim","Namhyuk Ahn"],"pdf_url":"https://arxiv.org/pdf/2305.15194v2.pdf","comment":"Project page: https://sungnyun.github.io/diffblender/"},{"id":"http://arxiv.org/abs/2305.18295v4","updated":"2023-12-21T12:34:22Z","published":"2023-05-29T17:59:41Z","title":"RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths","summary":" Text-to-image generation has recently witnessed remarkable achievements. We\nintroduce a text-conditional image diffusion model, termed RAPHAEL, to generate\nhighly artistic images, which accurately portray the text prompts, encompassing\nmultiple nouns, adjectives, and verbs. This is achieved by stacking tens of\nmixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling\nbillions of diffusion paths (routes) from the network input to the output. Each\npath intuitively functions as a \"painter\" for depicting a particular textual\nconcept onto a specified image region at a diffusion timestep. Comprehensive\nexperiments reveal that RAPHAEL outperforms recent cutting-edge models, such as\nStable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both\nimage quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior\nperformance in switching images across diverse styles, such as Japanese comics,\nrealism, cyberpunk, and ink illustration. Secondly, a single model with three\nbillion parameters, trained on 1,000 A100 GPUs for two months, achieves a\nstate-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore,\nRAPHAEL significantly surpasses its counterparts in human evaluation on the\nViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the\nfrontiers of image generation research in both academia and industry, paving\nthe way for future breakthroughs in this rapidly evolving field. More details\ncan be found on a webpage: https://raphael-painter.github.io/.\n","authors":["Zeyue Xue","Guanglu Song","Qiushan Guo","Boxiao Liu","Zhuofan Zong","Yu Liu","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2305.18295v4.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13792v1","updated":"2023-12-21T12:32:34Z","published":"2023-12-21T12:32:34Z","title":"An Approach to Colour Morphological Supremum Formation using the\n LogSumExp Approximation","summary":" Mathematical morphology is a part of image processing that has proven to be\nfruitful for numerous applications. Two main operations in mathematical\nmorphology are dilation and erosion. These are based on the construction of a\nsupremum or infimum with respect to an order over the tonal range in a certain\nsection of the image. The tonal ordering can easily be realised in grey-scale\nmorphology, and some morphological methods have been proposed for colour\nmorphology. However, all of these have certain limitations. In this paper we\npresent a novel approach to colour morphology extending upon previous work in\nthe field based on the Loewner order. We propose to consider an approximation\nof the supremum by means of a log-sum exponentiation introduced by Maslov. We\napply this to the embedding of an RGB image in a field of symmetric $2\\times2$\nmatrices. In this way we obtain nearly isotropic matrices representing colours\nand the structural advantage of transitivity. In numerical experiments we\nhighlight some remarkable properties of the proposed approach.\n","authors":["Marvin Kahra","Michael Breuß","Andreas Kleefeld","Martin Welk"],"pdf_url":"https://arxiv.org/pdf/2312.13792v1.pdf","comment":"12 pages, 28 figures, submitted to IAPR Third International\n Conference on Discrete Geometry and Mathematical Morphology"},{"id":"http://arxiv.org/abs/2312.13789v1","updated":"2023-12-21T12:26:11Z","published":"2023-12-21T12:26:11Z","title":"TinySAM: Pushing the Envelope for Efficient Segment Anything Model","summary":" Recently segment anything model (SAM) has shown powerful segmentation\ncapability and has drawn great attention in computer vision fields. Massive\nfollowing works have developed various applications based on the pretrained SAM\nand achieved impressive performance on downstream vision tasks. However, SAM\nconsists of heavy architectures and requires massive computational capacity,\nwhich hinders the further application of SAM on computation constrained edge\ndevices. To this end, in this paper we propose a framework to obtain a tiny\nsegment anything model (TinySAM) while maintaining the strong zero-shot\nperformance. We first propose a full-stage knowledge distillation method with\nonline hard prompt sampling strategy to distill a lightweight student model. We\nalso adapt the post-training quantization to the promptable segmentation task\nand further reduce the computational cost. Moreover, a hierarchical segmenting\neverything strategy is proposed to accelerate the everything inference by\n$2\\times$ with almost no performance degradation. With all these proposed\nmethods, our TinySAM leads to orders of magnitude computational reduction and\npushes the envelope for efficient segment anything task. Extensive experiments\non various zero-shot transfer tasks demonstrate the significantly advantageous\nperformance of our TinySAM against counterpart methods. Pre-trained models and\ncodes will be available at https://github.com/xinghaochen/TinySAM and\nhttps://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.\n","authors":["Han Shu","Wenshuo Li","Yehui Tang","Yiman Zhang","Yihao Chen","Houqiang Li","Yunhe Wang","Xinghao Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13789v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13783v1","updated":"2023-12-21T12:14:31Z","published":"2023-12-21T12:14:31Z","title":"Few Shot Part Segmentation Reveals Compositional Logic for Industrial\n Anomaly Detection","summary":" Logical anomalies (LA) refer to data violating underlying logical constraints\ne.g., the quantity, arrangement, or composition of components within an image.\nDetecting accurately such anomalies requires models to reason about various\ncomponent types through segmentation. However, curation of pixel-level\nannotations for semantic segmentation is both time-consuming and expensive.\nAlthough there are some prior few-shot or unsupervised co-part segmentation\nalgorithms, they often fail on images with industrial object. These images have\ncomponents with similar textures and shapes, and a precise differentiation\nproves challenging. In this study, we introduce a novel component segmentation\nmodel for LA detection that leverages a few labeled samples and unlabeled\nimages sharing logical constraints. To ensure consistent segmentation across\nunlabeled images, we employ a histogram matching loss in conjunction with an\nentropy loss. As segmentation predictions play a crucial role, we propose to\nenhance both local and global sample validity detection by capturing key\naspects from visual semantics via three memory banks: class histograms,\ncomponent composition embeddings and patch-level representations. For effective\nLA detection, we propose an adaptive scaling strategy to standardize anomaly\nscores from different memory banks in inference. Extensive experiments on the\npublic benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA\ndetection vs. 89.6% from competing methods.\n","authors":["Soopil Kim","Sion An","Philip Chikontwe","Myeongkyun Kang","Ehsan Adeli","Kilian M. Pohl","Sanghyun Park"],"pdf_url":"https://arxiv.org/pdf/2312.13783v1.pdf","comment":"Accepted at AAAI2024"},{"id":"http://arxiv.org/abs/2312.13778v1","updated":"2023-12-21T12:08:27Z","published":"2023-12-21T12:08:27Z","title":"Progressive Evolution from Single-Point to Polygon for Scene Text","summary":" The advancement of text shape representations towards compactness has\nenhanced text detection and spotting performance, but at a high annotation\ncost. Current models use single-point annotations to reduce costs, yet they\nlack sufficient localization information for downstream applications. To\novercome this limitation, we introduce Point2Polygon, which can efficiently\ntransform single-points into compact polygons. Our method uses a coarse-to-fine\nprocess, starting with creating and selecting anchor points based on\nrecognition confidence, then vertically and horizontally refining the polygon\nusing recognition information to optimize its shape. We demonstrate the\naccuracy of the generated polygons through extensive experiments: 1) By\ncreating polygons from ground truth points, we achieved an accuracy of 82.0% on\nICDAR 2015; 2) In training detectors with polygons generated by our method, we\nattained 86% of the accuracy relative to training with ground truth (GT); 3)\nAdditionally, the proposed Point2Polygon can be seamlessly integrated to\nempower single-point spotters to generate polygons. This integration led to an\nimpressive 82.5% accuracy for the generated polygons. It is worth mentioning\nthat our method relies solely on synthetic recognition information, eliminating\nthe need for any manual annotation beyond single points.\n","authors":["Linger Deng","Mingxin Huang","Xudong Xie","Yuliang Liu","Lianwen Jin","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2312.13778v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13776v1","updated":"2023-12-21T12:05:01Z","published":"2023-12-21T12:05:01Z","title":"Pose-based Tremor Type and Level Analysis for Parkinson's Disease from\n Video","summary":" Purpose:Current methods for diagnosis of PD rely on clinical examination. The\naccuracy of diagnosis ranges between 73% and 84%, and is influenced by the\nexperience of the clinical assessor. Hence, an automatic, effective and\ninterpretable supporting system for PD symptom identification would support\nclinicians in making more robust PD diagnostic decisions. Methods: We propose\nto analyze Parkinson's tremor (PT) to support the analysis of PD, since PT is\none of the most typical symptoms of PD with broad generalizability. To realize\nthe idea, we present SPA-PTA, a deep learning-based PT classification and\nseverity estimation system that takes consumer-grade videos of front-facing\nhumans as input. The core of the system is a novel attention module with a\nlightweight pyramidal channel-squeezing-fusion architecture that effectively\nextracts relevant PT information and filters noise. It enhances modeling\nperformance while improving system interpretability. Results:We validate our\nsystem via individual-based leave-one-out cross-validation on two tasks: the PT\nclassification task and the tremor severity rating estimation task. Our system\npresents a 91.3% accuracy and 80.0% F1-score in classifying PT with non-PT\nclass, while providing a 76.4% accuracy and 76.7% F1-score in more complex\nmulticlass tremor rating classification task. Conclusion: Our system offers a\ncost-effective PT classification and tremor severity estimation results as\nwarning signs of PD for undiagnosed patients with PT symptoms. In addition, it\nprovides a potential solution for supporting PD diagnosis in regions with\nlimited clinical resources.\n","authors":["Haozheng Zhang","Edmond S. L. Ho","Xiatian Zhang","Silvia Del Din","Hubert P. H. Shum"],"pdf_url":"https://arxiv.org/pdf/2312.13776v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.05807v2","updated":"2023-12-21T11:59:11Z","published":"2023-05-09T23:40:23Z","title":"Even Small Correlation and Diversity Shifts Pose Dataset-Bias Issues","summary":" Distribution shifts are common in real-world datasets and can affect the\nperformance and reliability of deep learning models. In this paper, we study\ntwo types of distribution shifts: diversity shifts, which occur when test\nsamples exhibit patterns unseen during training, and correlation shifts, which\noccur when test data present a different correlation between seen invariant and\nspurious features. We propose an integrated protocol to analyze both types of\nshifts using datasets where they co-exist in a controllable manner. Finally, we\napply our approach to a real-world classification problem of skin cancer\nanalysis, using out-of-distribution datasets and specialized bias annotations.\nOur protocol reveals three findings: 1) Models learn and propagate correlation\nshifts even with low-bias training; this poses a risk of accumulating and\ncombining unaccountable weak biases; 2) Models learn robust features in high-\nand low-bias scenarios but use spurious ones if test samples have them; this\nsuggests that spurious correlations do not impair the learning of robust\nfeatures; 3) Diversity shift can reduce the reliance on spurious correlations;\nthis is counter intuitive since we expect biased models to depend more on\nbiases when invariant features are missing. Our work has implications for\ndistribution shift research and practice, providing new insights into how\nmodels learn and rely on spurious correlations under different types of shifts.\n","authors":["Alceu Bissoto","Catarina Barata","Eduardo Valle","Sandra Avila"],"pdf_url":"https://arxiv.org/pdf/2305.05807v2.pdf","comment":"Paper under consideration at Pattern Recognition Letters"},{"id":"http://arxiv.org/abs/2308.08746v2","updated":"2023-12-21T11:56:08Z","published":"2023-08-17T02:51:01Z","title":"SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation","summary":" The Segment Anything Model (SAM) is a powerful foundation model that has\nrevolutionised image segmentation. To apply SAM to surgical instrument\nsegmentation, a common approach is to locate precise points or boxes of\ninstruments and then use them as prompts for SAM in a zero-shot manner.\nHowever, we observe two problems with this naive pipeline: (1) the domain gap\nbetween natural objects and surgical instruments leads to inferior\ngeneralisation of SAM; and (2) SAM relies on precise point or box locations for\naccurate segmentation, requiring either extensive manual guidance or a\nwell-performing specialist detector for prompt preparation, which leads to a\ncomplex multi-stage pipeline. To address these problems, we introduce\nSurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to\neffectively integrate surgical-specific information with SAM's pre-trained\nknowledge for improved generalisation. Specifically, we propose a lightweight\nprototype-based class prompt encoder for tuning, which directly generates\nprompt embeddings from class prototypes and eliminates the use of explicit\nprompts for improved robustness and a simpler pipeline. In addition, to address\nthe low inter-class variance among surgical instrument categories, we propose\ncontrastive prototype learning, further enhancing the discrimination of the\nclass prototypes for more accurate class prompting. The results of extensive\nexperiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that\nSurgicalSAM achieves state-of-the-art performance while only requiring a small\nnumber of tunable parameters. The source code is available at\nhttps://github.com/wenxi-yue/SurgicalSAM.\n","authors":["Wenxi Yue","Jing Zhang","Kun Hu","Yong Xia","Jiebo Luo","Zhiyong Wang"],"pdf_url":"https://arxiv.org/pdf/2308.08746v2.pdf","comment":"AAAI2024. The source code is available at\n https://github.com/wenxi-yue/SurgicalSAM"},{"id":"http://arxiv.org/abs/2312.13771v1","updated":"2023-12-21T11:52:45Z","published":"2023-12-21T11:52:45Z","title":"AppAgent: Multimodal Agents as Smartphone Users","summary":" Recent advancements in large language models (LLMs) have led to the creation\nof intelligent agents capable of performing complex tasks. This paper\nintroduces a novel LLM-based multimodal agent framework designed to operate\nsmartphone applications. Our framework enables the agent to operate smartphone\napplications through a simplified action space, mimicking human-like\ninteractions such as tapping and swiping. This novel approach bypasses the need\nfor system back-end access, thereby broadening its applicability across diverse\napps. Central to our agent's functionality is its innovative learning method.\nThe agent learns to navigate and use new apps either through autonomous\nexploration or by observing human demonstrations. This process generates a\nknowledge base that the agent refers to for executing complex tasks across\ndifferent applications. To demonstrate the practicality of our agent, we\nconducted extensive testing over 50 tasks in 10 different applications,\nincluding social media, email, maps, shopping, and sophisticated image editing\ntools. The results affirm our agent's proficiency in handling a diverse array\nof high-level tasks.\n","authors":["Zhao Yang","Jiaxuan Liu","Yucheng Han","Xin Chen","Zebiao Huang","Bin Fu","Gang Yu"],"pdf_url":"https://arxiv.org/pdf/2312.13771v1.pdf","comment":"10 pages, 3 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.13770v1","updated":"2023-12-21T11:50:49Z","published":"2023-12-21T11:50:49Z","title":"3D Points Splatting for Real-Time Dynamic Hand Reconstruction","summary":" We present 3D Points Splatting Hand Reconstruction (3D-PSHR), a real-time and\nphoto-realistic hand reconstruction approach. We propose a self-adaptive\ncanonical points upsampling strategy to achieve high-resolution hand geometry\nrepresentation. This is followed by a self-adaptive deformation that deforms\nthe hand from the canonical space to the target pose, adapting to the dynamic\nchanging of canonical points which, in contrast to the common practice of\nsubdividing the MANO model, offers greater flexibility and results in improved\ngeometry fitting. To model texture, we disentangle the appearance color into\nthe intrinsic albedo and pose-aware shading, which are learned through a\nContext-Attention module. Moreover, our approach allows the geometric and the\nappearance models to be trained simultaneously in an end-to-end manner. We\ndemonstrate that our method is capable of producing animatable, photorealistic\nand relightable hand reconstructions using multiple datasets, including\nmonocular videos captured with handheld smartphones and large-scale multi-view\nvideos featuring various hand poses. We also demonstrate that our approach\nachieves real-time rendering speeds while simultaneously maintaining superior\nperformance compared to existing state-of-the-art methods.\n","authors":["Zheheng Jiang","Hossein Rahmani","Sue Black","Bryan M. Williams"],"pdf_url":"https://arxiv.org/pdf/2312.13770v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.00400v2","updated":"2023-12-21T11:50:24Z","published":"2022-05-01T05:30:53Z","title":"Unleashing the Potential of Adjacent Snippets for Weakly-supervised\n Temporal Action Localization","summary":" Weakly-supervised temporal action localization (WTAL) intends to detect\naction instances with only weak supervision, \\eg, video-level labels. The\ncurrent~\\textit{de facto} pipeline locates action instances by thresholding and\ngrouping continuous high-score regions on temporal class activation sequences.\nIn this route, the capacity of the model to recognize the relationships between\nadjacent snippets is of vital importance which determines the quality of the\naction boundaries. However, it is error-prone since the variations between\nadjacent snippets are typically subtle, and unfortunately this is overlooked in\nthe literature. To tackle the issue, we propose a novel WTAL approach named\nConvex Combination Consistency between Neighbors (C$^3$BN). C$^3$BN consists of\ntwo key ingredients: a micro data augmentation strategy that increases the\ndiversity in-between adjacent snippets by convex combination of adjacent\nsnippets, and a macro-micro consistency regularization that enforces the model\nto be invariant to the transformations~\\textit{w.r.t.} video semantics, snippet\npredictions, and snippet representations. Consequently, fine-grained patterns\nin-between adjacent snippets are enforced to be explored, thereby resulting in\na more robust action boundary localization. Experimental results demonstrate\nthe effectiveness of C$^3$BN on top of various baselines for WTAL with\nvideo-level and point-level supervisions. Code is at\nhttps://github.com/Qinying-Liu/C3BN.\n","authors":["Qinying Liu","Zilei Wang","Ruoxi Chen","Zhilin Li"],"pdf_url":"https://arxiv.org/pdf/2205.00400v2.pdf","comment":"ICME2023"},{"id":"http://arxiv.org/abs/2312.13764v1","updated":"2023-12-21T11:43:41Z","published":"2023-12-21T11:43:41Z","title":"A Semantic Space is Worth 256 Language Descriptions: Make Stronger\n Segmentation Models with Descriptive Properties","summary":" This paper introduces ProLab, a novel approach using property-level label\nspace for creating strong interpretable segmentation models. Instead of relying\nsolely on category-specific annotations, ProLab uses descriptive properties\ngrounded in common sense knowledge for supervising segmentation models. It is\nbased on two core designs. First, we employ Large Language Models (LLMs) and\ncarefully crafted prompts to generate descriptions of all involved categories\nthat carry meaningful common sense knowledge and follow a structured format.\nSecond, we introduce a description embedding model preserving semantic\ncorrelation across descriptions and then cluster them into a set of descriptive\nproperties (e.g., 256) using K-Means. These properties are based on\ninterpretable common sense knowledge consistent with theories of human\nrecognition. We empirically show that our approach makes segmentation models\nperform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal\nContext, Cityscapes, and BDD). Our method also shows better scalability with\nextended training steps than category-level supervision. Our interpretable\nsegmentation framework also emerges with the generalization ability to segment\nout-of-domain or unknown categories using only in-domain descriptive\nproperties. Code is available at https://github.com/lambert-x/ProLab.\n","authors":["Junfei Xiao","Ziqi Zhou","Wenxuan Li","Shiyi Lan","Jieru Mei","Zhiding Yu","Alan Yuille","Yuyin Zhou","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13764v1.pdf","comment":"Preprint. Code is available at https://github.com/lambert-x/ProLab"},{"id":"http://arxiv.org/abs/2312.13763v1","updated":"2023-12-21T11:41:02Z","published":"2023-12-21T11:41:02Z","title":"Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed\n Diffusion Models","summary":" Text-guided diffusion models have revolutionized image and video generation\nand have also been successfully used for optimization-based 3D object\nsynthesis. Here, we instead focus on the underexplored text-to-4D setting and\nsynthesize dynamic, animated 3D objects using score distillation methods with\nan additional temporal dimension. Compared to previous work, we pursue a novel\ncompositional generation-based approach, and combine text-to-image,\ntext-to-video, and 3D-aware multiview diffusion models to provide feedback\nduring 4D object optimization, thereby simultaneously enforcing temporal\nconsistency, high-quality visual appearance and realistic geometry. Our method,\ncalled Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with\ndeformation fields as 4D representation. Crucial to AYG is a novel method to\nregularize the distribution of the moving 3D Gaussians and thereby stabilize\nthe optimization and induce motion. We also propose a motion amplification\nmechanism as well as a new autoregressive synthesis scheme to generate and\ncombine multiple 4D sequences for longer generation. These techniques allow us\nto synthesize vivid dynamic scenes, outperform previous work qualitatively and\nquantitatively and achieve state-of-the-art text-to-4D performance. Due to the\nGaussian 4D representation, different 4D animations can be seamlessly combined,\nas we demonstrate. AYG opens up promising avenues for animation, simulation and\ndigital content creation as well as synthetic data generation.\n","authors":["Huan Ling","Seung Wook Kim","Antonio Torralba","Sanja Fidler","Karsten Kreis"],"pdf_url":"https://arxiv.org/pdf/2312.13763v1.pdf","comment":"Project page:\n https://research.nvidia.com/labs/toronto-ai/AlignYourGaussians/"},{"id":"http://arxiv.org/abs/2305.04743v3","updated":"2023-12-21T11:40:35Z","published":"2023-05-01T02:58:48Z","title":"MARS: Mask Attention Refinement with Sequential Quadtree Nodes for Car\n Damage Instance Segmentation","summary":" Evaluating car damages from misfortune is critical to the car insurance\nindustry. However, the accuracy is still insufficient for real-world\napplications since the deep learning network is not designed for car damage\nimages as inputs, and its segmented masks are still very coarse. This paper\npresents MARS (Mask Attention Refinement with Sequential quadtree nodes) for\ncar damage instance segmentation. Our MARS represents self-attention mechanisms\nto draw global dependencies between the sequential quadtree nodes layer and\nquadtree transformer to recalibrate channel weights and predict highly accurate\ninstance masks. Our extensive experiments demonstrate that MARS outperforms\nstate-of-the-art (SOTA) instance segmentation methods on three popular\nbenchmarks such as Mask R-CNN [9], PointRend [13], and Mask Transfiner [12], by\na large margin of +1.3 maskAP-based R50-FPN backbone and +2.3 maskAP-based\nR101-FPN backbone on Thai car-damage dataset. Our demos are available at\nhttps://github.com/kaopanboonyuen/MARS.\n","authors":["Teerapong Panboonyuen","Naphat Nithisopa","Panin Pienroj","Laphonchai Jirachuphun","Chaiwasut Watthanasirikrit","Naruepon Pornwiriyakul"],"pdf_url":"https://arxiv.org/pdf/2305.04743v3.pdf","comment":"12 pages. arXiv admin note: substantial text overlap with\n arXiv:2111.13673 by other authors"},{"id":"http://arxiv.org/abs/2312.13752v1","updated":"2023-12-21T11:33:10Z","published":"2023-12-21T11:33:10Z","title":"Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the\n AIIB23 challenge","summary":" Airway-related quantitative imaging biomarkers are crucial for examination,\ndiagnosis, and prognosis in pulmonary diseases. However, the manual delineation\nof airway trees remains prohibitively time-consuming. While significant efforts\nhave been made towards enhancing airway modelling, current public-available\ndatasets concentrate on lung diseases with moderate morphological variations.\nThe intricate honeycombing patterns present in the lung tissues of fibrotic\nlung disease patients exacerbate the challenges, often leading to various\nprediction errors. To address this issue, the 'Airway-Informed Quantitative CT\nImaging Biomarker for Fibrotic Lung Disease 2023' (AIIB23) competition was\norganized in conjunction with the official 2023 International Conference on\nMedical Image Computing and Computer Assisted Intervention (MICCAI). The airway\nstructures were meticulously annotated by three experienced radiologists.\nCompetitors were encouraged to develop automatic airway segmentation models\nwith high robustness and generalization abilities, followed by exploring the\nmost correlated QIB of mortality prediction. A training set of 120\nhigh-resolution computerised tomography (HRCT) scans were publicly released\nwith expert annotations and mortality status. The online validation set\nincorporated 52 HRCT scans from patients with fibrotic lung disease and the\noffline test set included 140 cases from fibrosis and COVID-19 patients. The\nresults have shown that the capacity of extracting airway trees from patients\nwith fibrotic lung disease could be enhanced by introducing voxel-wise weighted\ngeneral union loss and continuity loss. In addition to the competitive image\nbiomarkers for prognosis, a strong airway-derived biomarker (Hazard ratio>1.5,\np<0.0001) was revealed for survival prognostication compared with existing\nclinical measurements, clinician assessment and AI-based biomarkers.\n","authors":["Yang Nan","Xiaodan Xing","Shiyi Wang","Zeyu Tang","Federico N Felder","Sheng Zhang","Roberta Eufrasia Ledda","Xiaoliu Ding","Ruiqi Yu","Weiping Liu","Feng Shi","Tianyang Sun","Zehong Cao","Minghui Zhang","Yun Gu","Hanxiao Zhang","Jian Gao","Wen Tang","Pengxin Yu","Han Kang","Junqiang Chen","Xing Lu","Boyu Zhang","Michail Mamalakis","Francesco Prinzi","Gianluca Carlini","Lisa Cuneo","Abhirup Banerjee","Zhaohu Xing","Lei Zhu","Zacharia Mesbah","Dhruv Jain","Tsiry Mayet","Hongyu Yuan","Qing Lyu","Athol Wells","Simon LF Walsh","Guang Yang"],"pdf_url":"https://arxiv.org/pdf/2312.13752v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2210.15136v2","updated":"2023-12-21T11:31:38Z","published":"2022-10-27T02:51:24Z","title":"3D Shape Knowledge Graph for Cross-domain 3D Shape Retrieval","summary":" The surge in 3D modeling has led to a pronounced research emphasis on the\nfield of 3D shape retrieval. Numerous contemporary approaches have been put\nforth to tackle this intricate challenge. Nevertheless, effectively addressing\nthe intricacies of cross-modal 3D shape retrieval remains a formidable\nundertaking, owing to inherent modality-based disparities. This study presents\nan innovative notion, termed \"geometric words\", which functions as elemental\nconstituents for representing entities through combinations. To establish the\nknowledge graph, we employ geometric words as nodes, connecting them via shape\ncategories and geometry attributes. Subsequently, we devise a unique graph\nembedding method for knowledge acquisition. Finally, an effective similarity\nmeasure is introduced for retrieval purposes. Importantly, each 3D or 2D entity\ncan anchor its geometric terms within the knowledge graph, thereby serving as a\nlink between cross-domain data. As a result, our approach facilitates multiple\ncross-domain 3D shape retrieval tasks. We evaluate the proposed method's\nperformance on the ModelNet40 and ShapeNetCore55 datasets, encompassing\nscenarios related to 3D shape retrieval and cross-domain retrieval.\nFurthermore, we employ the established cross-modal dataset (MI3DOR) to assess\ncross-modal 3D shape retrieval. The resulting experimental outcomes, in\nconjunction with comparisons against state-of-the-art techniques, clearly\nhighlight the superiority of our approach.\n","authors":["Rihao Chang","Yongtao Ma","Tong Hao","Weizhi Nie"],"pdf_url":"https://arxiv.org/pdf/2210.15136v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13746v1","updated":"2023-12-21T11:30:02Z","published":"2023-12-21T11:30:02Z","title":"Video Recognition in Portrait Mode","summary":" The creation of new datasets often presents new challenges for video\nrecognition and can inspire novel ideas while addressing these challenges.\nWhile existing datasets mainly comprise landscape mode videos, our paper seeks\nto introduce portrait mode videos to the research community and highlight the\nunique challenges associated with this video format. With the growing\npopularity of smartphones and social media applications, recognizing portrait\nmode videos is becoming increasingly important. To this end, we have developed\nthe first dataset dedicated to portrait mode video recognition, namely\nPortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a\ndata-driven manner, comprising 400 fine-grained categories, and rigorous\nquality assurance was implemented to ensure the accuracy of human annotations.\nIn addition to the new dataset, we conducted a comprehensive analysis of the\nimpact of video format (portrait mode versus landscape mode) on recognition\naccuracy and spatial bias due to the different formats. Furthermore, we\ndesigned extensive experiments to explore key aspects of portrait mode video\nrecognition, including the choice of data augmentation, evaluation procedure,\nthe importance of temporal information, and the role of audio modality.\nBuilding on the insights from our experimental results and the introduction of\nPortraitMode-400, our paper aims to inspire further research efforts in this\nemerging research area.\n","authors":["Mingfei Han","Linjie Yang","Xiaojie Jin","Jiashi Feng","Xiaojun Chang","Heng Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13746v1.pdf","comment":"See mingfei.info/PMV for data and code information"},{"id":"http://arxiv.org/abs/2308.01196v2","updated":"2023-12-21T11:27:00Z","published":"2023-07-27T22:57:55Z","title":"Sustainable Transparency in Recommender Systems: Bayesian Ranking of\n Images for Explainability","summary":" Recommender Systems have become crucial in the modern world, commonly guiding\nusers towards relevant content or products, and having a large influence over\nthe decisions of users and citizens. However, ensuring transparency and user\ntrust in these systems remains a challenge; personalized explanations have\nemerged as a solution, offering justifications for recommendations. Among the\nexisting approaches for generating personalized explanations, using existing\nvisual content created by users is a promising option to maximize transparency\nand user trust. State-of-the-art models that follow this approach, despite\nleveraging highly optimized architectures, employ surrogate learning tasks that\ndo not efficiently model the objective of ranking images as explanations for a\ngiven recommendation; this leads to a suboptimal training process with high\ncomputational costs that may not be reduced without affecting model\nperformance. This work presents BRIE, a novel model where we leverage Bayesian\nPairwise Ranking to enhance the training process, allowing us to consistently\noutperform state-of-the-art models in six real-world datasets while reducing\nits model size by up to 64 times and its CO${_2}$ emissions by up to 75% in\ntraining and inference.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Berta Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2308.01196v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.13495v2","updated":"2023-12-21T11:01:09Z","published":"2022-11-24T09:34:20Z","title":"Few-shot Object Detection with Refined Contrastive Learning","summary":" Due to the scarcity of sampling data in reality, few-shot object detection\n(FSOD) has drawn more and more attention because of its ability to quickly\ntrain new detection concepts with less data. However, there are still failure\nidentifications due to the difficulty in distinguishing confusable classes. We\nalso notice that the high standard deviation of average precision reveals the\ninconsistent detection performance. To this end, we propose a novel FSOD method\nwith Refined Contrastive Learning (FSRC). A pre-determination component is\nintroduced to find out the Resemblance Group from novel classes which contains\nconfusable classes. Afterwards, Refined Contrastive Learning (RCL) is pointedly\nperformed on this group of classes in order to increase the inter-class\ndistances among them. In the meantime, the detection results distribute more\nuniformly which further improve the performance. Experimental results based on\nPASCAL VOC and COCO datasets demonstrate our proposed method outperforms the\ncurrent state-of-the-art research.\n","authors":["Zeyu Shangguan","Lian Huai","Tong Liu","Xingqun Jiang"],"pdf_url":"https://arxiv.org/pdf/2211.13495v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13735v1","updated":"2023-12-21T10:59:17Z","published":"2023-12-21T10:59:17Z","title":"DECO: Query-Based End-to-End Object Detection with ConvNets","summary":" Detection Transformer (DETR) and its variants have shown great potential for\naccurate object detection in recent years. The mechanism of object query\nenables DETR family to directly obtain a fixed number of object predictions and\nstreamlines the detection pipeline. Meanwhile, recent studies also reveal that\nwith proper architecture design, convolution networks (ConvNets) also achieve\ncompetitive performance with transformers, \\eg, ConvNeXt. To this end, in this\npaper we explore whether we could build a query-based end-to-end object\ndetection framework with ConvNets instead of sophisticated transformer\narchitecture. The proposed framework, \\ie, Detection ConvNet (DECO), is\ncomposed of a backbone and convolutional encoder-decoder architecture. We\ncarefully design the DECO encoder and propose a novel mechanism for our DECO\ndecoder to perform interaction between object queries and image features via\nconvolutional layers. We compare the proposed DECO against prior detectors on\nthe challenging COCO benchmark. Despite its simplicity, our DECO achieves\ncompetitive performance in terms of detection accuracy and running speed.\nSpecifically, with the ResNet-50 and ConvNeXt-Tiny backbone, DECO obtains\n$38.6\\%$ and $40.8\\%$ AP on COCO \\textit{val} set with $35$ and $28$ FPS\nrespectively and outperforms the DETR model. Incorporated with advanced\nmulti-scale feature module, our DECO+ achieves $47.8\\%$ AP with $34$ FPS. We\nhope the proposed DECO brings another perspective for designing object\ndetection framework.\n","authors":["Xinghao Chen","Siwei Li","Yijing Yang","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13735v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13729v1","updated":"2023-12-21T10:52:59Z","published":"2023-12-21T10:52:59Z","title":"Gaussian Splitting Algorithm with Color and Opacity Depended on Viewing\n Direction","summary":" Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of\nneural networks to capture the intricacies of 3D objects. By encoding the shape\nand color information within neural network weights, NeRFs excel at producing\nstrikingly sharp novel views of 3D objects. Recently, numerous generalizations\nof NeRFs utilizing generative models have emerged, expanding its versatility.\nIn contrast, Gaussian Splatting (GS) offers a similar renders quality with\nfaster training and inference as it does not need neural networks to work. We\nencode information about the 3D objects in the set of Gaussian distributions\nthat can be rendered in 3D similarly to classical meshes. Unfortunately, GS are\ndifficult to condition since they usually require circa hundred thousand\nGaussian components. To mitigate the caveats of both models, we propose a\nhybrid model that uses GS representation of the 3D object's shape and\nNeRF-based encoding of color and opacity. Our model uses Gaussian distributions\nwith trainable positions (i.e. means of Gaussian), shape (i.e. covariance of\nGaussian), color and opacity, and neural network, which takes parameters of\nGaussian and viewing direction to produce changes in color and opacity.\nConsequently, our model better describes shadows, light reflections, and\ntransparency of 3D objects.\n","authors":["Dawid Malarz","Weronika Smolak","Jacek Tabor","Sławomir Tadeja","Przemysław Spurek"],"pdf_url":"https://arxiv.org/pdf/2312.13729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13714v1","updated":"2023-12-21T10:27:52Z","published":"2023-12-21T10:27:52Z","title":"Bootstrap Masked Visual Modeling via Hard Patches Mining","summary":" Masked visual modeling has attracted much attention due to its promising\npotential in learning generalizable representations. Typical approaches urge\nmodels to predict specific contents of masked tokens, which can be intuitively\nconsidered as teaching a student (the model) to solve given problems\n(predicting masked contents). Under such settings, the performance is highly\ncorrelated with mask strategies (the difficulty of provided problems). We argue\nthat it is equally important for the model to stand in the shoes of a teacher\nto produce challenging problems by itself. Intuitively, patches with high\nvalues of reconstruction loss can be regarded as hard samples, and masking\nthose hard patches naturally becomes a demanding reconstruction task. To\nempower the model as a teacher, we propose Hard Patches Mining (HPM),\npredicting patch-wise losses and subsequently determining where to mask.\nTechnically, we introduce an auxiliary loss predictor, which is trained with a\nrelative objective to prevent overfitting to exact loss values. Also, to\ngradually guide the training procedure, we propose an easy-to-hard mask\nstrategy. Empirically, HPM brings significant improvements under both image and\nvideo benchmarks. Interestingly, solely incorporating the extra loss prediction\nobjective leads to better representations, verifying the efficacy of\ndetermining where is hard to reconstruct. The code is available at\nhttps://github.com/Haochen-Wang409/HPM.\n","authors":["Haochen Wang","Junsong Fan","Yuxi Wang","Kaiyou Song","Tiancai Wang","Xiangyu Zhang","Zhaoxiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13714v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2304.05919"},{"id":"http://arxiv.org/abs/2307.15588v2","updated":"2023-12-21T09:47:19Z","published":"2023-07-28T14:43:27Z","title":"OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic\n Segmentation","summary":" Light field cameras, by harnessing the power of micro-lens array, are capable\nof capturing intricate angular and spatial details. This allows for acquiring\ncomplex light patterns and details from multiple angles, significantly\nenhancing the precision of image semantic segmentation, a critical aspect of\nscene interpretation in vision intelligence. However, the extensive angular\ninformation of light field cameras contains a large amount of redundant data,\nwhich is overwhelming for the limited hardware resources of intelligent\nvehicles. Besides, inappropriate compression leads to information corruption\nand data loss. To excavate representative information, we propose a new\nparadigm, Omni-Aperture Fusion model (OAFuser), which leverages dense context\nfrom the central view and discovers the angular information from sub-aperture\nimages to generate a semantically consistent result. To avoid feature loss\nduring network propagation and simultaneously streamline the redundant\ninformation from the light field camera, we present a simple yet very effective\nSub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular\nfeatures without any additional memory cost. Furthermore, to address the\nmismatched spatial information across viewpoints, we present a Center Angular\nRectification Module (CARM) to realize feature resorting and prevent feature\nocclusion caused by asymmetric information. Our proposed OAFuser achieves\nstate-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a\nnew record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain\nof +4.53%. The source code of OAFuser will be available at\nhttps://github.com/FeiBryantkit/OAFuser.\n","authors":["Fei Teng","Jiaming Zhang","Kunyu Peng","Yaonan Wang","Rainer Stiefelhagen","Kailun Yang"],"pdf_url":"https://arxiv.org/pdf/2307.15588v2.pdf","comment":"The source code of OAFuser will be made publicly available at\n https://github.com/FeiBryantkit/OAFuser"},{"id":"http://arxiv.org/abs/2312.09709v2","updated":"2023-12-21T09:40:00Z","published":"2023-12-15T11:32:11Z","title":"ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for\n Zero-Shot Learning","summary":" This paper provides a novel parsimonious yet efficient design for zero-shot\nlearning (ZSL), dubbed ParsNets, where we are interested in learning a\ncomposition of on-device friendly linear networks, each with orthogonality and\nlow-rankness properties, to achieve equivalent or even better performance\nagainst existing deep models. Concretely, we first refactor the core module of\nZSL, i.e., visual-semantics mapping function, into several base linear networks\nthat correspond to diverse components of the semantic space, where the complex\nnonlinearity can be collapsed into simple local linearities. Then, to\nfacilitate the generalization of local linearities, we construct a maximal\nmargin geometry on the learned features by enforcing low-rank constraints on\nintra-class samples and high-rank constraints on inter-class samples, resulting\nin orthogonal subspaces for different classes and each subspace lies on a\ncompact manifold. To enhance the model's adaptability and counterbalance\nover/under-fittings in ZSL, a set of sample-wise indicators is employed to\nselect a sparse subset from these base linear networks to form a composite\nsemantic predictor for each sample. Notably, maximal margin geometry can\nguarantee the diversity of features, and meanwhile, local linearities guarantee\nefficiency. Thus, our ParsNets can generalize better to unseen classes and can\nbe deployed flexibly on resource-constrained devices. Theoretical explanations\nand extensive experiments are conducted to verify the effectiveness of the\nproposed method.\n","authors":["Jingcai Guo","Qihua Zhou","Ruibing Li","Xiaocheng Lu","Ziming Liu","Junyang Chen","Xin Xie","Jie Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.09709v2.pdf","comment":"10 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.13691v1","updated":"2023-12-21T09:37:14Z","published":"2023-12-21T09:37:14Z","title":"DreamTuner: Single Image is Enough for Subject-Driven Generation","summary":" Diffusion-based models have demonstrated impressive capabilities for\ntext-to-image generation and are expected for personalized applications of\nsubject-driven generation, which require the generation of customized concepts\nwith one or a few reference images. However, existing methods based on\nfine-tuning fail to balance the trade-off between subject learning and the\nmaintenance of the generation capabilities of pretrained models. Moreover,\nother methods that utilize additional image encoders tend to lose important\ndetails of the subject due to encoding compression. To address these\nchallenges, we propose DreamTurner, a novel method that injects reference\ninformation from coarse to fine to achieve subject-driven image generation more\neffectively. DreamTurner introduces a subject-encoder for coarse subject\nidentity preservation, where the compressed general subject features are\nintroduced through an attention layer before visual-text cross-attention. We\nthen modify the self-attention layers within pretrained text-to-image models to\nself-subject-attention layers to refine the details of the target subject. The\ngenerated image queries detailed features from both the reference image and\nitself in self-subject-attention. It is worth emphasizing that\nself-subject-attention is an effective, elegant, and training-free method for\nmaintaining the detailed features of customized subjects and can serve as a\nplug-and-play solution during inference. Finally, with additional\nsubject-driven fine-tuning, DreamTurner achieves remarkable performance in\nsubject-driven image generation, which can be controlled by a text or other\nconditions such as pose. For further details, please visit the project page at\nhttps://dreamtuner-diffusion.github.io/.\n","authors":["Miao Hua","Jiawei Liu","Fei Ding","Wei Liu","Jie Wu","Qian He"],"pdf_url":"https://arxiv.org/pdf/2312.13691v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12635v2","updated":"2023-12-21T09:20:40Z","published":"2023-12-19T22:33:42Z","title":"RealCraft: Attention Control as A Solution for Zero-shot Long Video\n Editing","summary":" Although large-scale text-to-image generative models have shown promising\nperformance in synthesizing high-quality images, directly applying these models\nto image editing remains a significant challenge. This challenge is further\namplified in video editing due to the additional dimension of time. Especially\nfor editing real videos as it necessitates maintaining a stable semantic layout\nacross the frames while executing localized edits precisely without disrupting\nthe existing backgrounds. In this paper, we propose RealCraft, an\nattention-control-based method for zero-shot editing in real videos. By\nemploying the object-centric manipulation of cross-attention between prompts\nand frames and spatial-temporal attention within the frames, we achieve precise\nshape-wise editing along with enhanced consistency. Our model can be used\ndirectly with Stable Diffusion and operates without the need for additional\nlocalized information. We showcase our zero-shot attention-control-based method\nacross a range of videos, demonstrating localized, high-fidelity, shape-precise\nand time-consistent editing in videos of various lengths, up to 64 frames.\n","authors":["Shutong Jin","Ruiyu Wang","Florian T. Pokorny"],"pdf_url":"https://arxiv.org/pdf/2312.12635v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07967v2","updated":"2023-12-21T09:08:34Z","published":"2023-11-14T07:46:03Z","title":"Comparison of two data fusion approaches for land use classification","summary":" Accurate land use maps, describing the territory from an anthropic\nutilisation point of view, are useful tools for land management and planning.\nTo produce them, the use of optical images alone remains limited. It is\ntherefore necessary to make use of several heterogeneous sources, each carrying\ncomplementary or contradictory information due to their imperfections or their\ndifferent specifications. This study compares two different approaches i.e. a\npre-classification and a post-classification fusion approach for combining\nseveral sources of spatial data in the context of land use classification. The\napproaches are applied on authoritative land use data located in the Gers\ndepartment in the southwest of France. Pre-classification fusion, while not\nexplicitly modeling imperfections, has the best final results, reaching an\noverall accuracy of 97% and a macro-mean F1 score of 88%.\n","authors":["Martin Cubaud","Arnaud Le Bris","Laurence Jolivet","Ana-Maria Olteanu-Raimond"],"pdf_url":"https://arxiv.org/pdf/2311.07967v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13663v1","updated":"2023-12-21T08:40:57Z","published":"2023-12-21T08:40:57Z","title":"Free-Editor: Zero-shot Text-driven 3D Scene Editing","summary":" Text-to-Image (T2I) diffusion models have gained popularity recently due to\ntheir multipurpose and easy-to-use nature, e.g. image and video generation as\nwell as editing. However, training a diffusion model specifically for 3D scene\nediting is not straightforward due to the lack of large-scale datasets. To\ndate, editing 3D scenes requires either re-training the model to adapt to\nvarious 3D edited scenes or design-specific methods for each special editing\ntype. Furthermore, state-of-the-art (SOTA) methods require multiple\nsynchronized edited images from the same scene to facilitate the scene editing.\nDue to the current limitations of T2I models, it is very challenging to apply\nconsistent editing effects to multiple images, i.e. multi-view inconsistency in\nediting. This in turn compromises the desired 3D scene editing performance if\nthese images are used. In our work, we propose a novel training-free 3D scene\nediting technique, Free-Editor, which allows users to edit 3D scenes without\nfurther re-training the model during test time. Our proposed method\nsuccessfully avoids the multi-view style inconsistency issue in SOTA methods\nwith the help of a \"single-view editing\" scheme. Specifically, we show that\nediting a particular 3D scene can be performed by only modifying a single view.\nTo this end, we introduce an Edit Transformer that enforces intra-view\nconsistency and inter-view style transfer by utilizing self- and\ncross-attention, respectively. Since it is no longer required to re-train the\nmodel and edit every view in a scene, the editing time, as well as memory\nresources, are reduced significantly, e.g., the runtime being $\\sim \\textbf{20}\n\\times$ faster than SOTA. We have conducted extensive experiments on a wide\nrange of benchmark datasets and achieve diverse editing capabilities with our\nproposed technique.\n","authors":["Nazmul Karim","Umar Khalid","Hasan Iqbal","Jing Hua","Chen Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13663v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.01423v2","updated":"2023-12-21T08:39:17Z","published":"2023-06-02T10:29:33Z","title":"Improving Gradient-Trend Identification: Fast-Adaptive Moment Estimation\n with Finance-Inspired Triple Exponential Moving Average","summary":" The performance improvement of deep networks significantly depends on their\noptimizers. With existing optimizers, precise and efficient recognition of the\ngradients trend remains a challenge. Existing optimizers predominantly adopt\ntechniques based on the first-order exponential moving average (EMA), which\nresults in noticeable delays that impede the real-time tracking of gradients\ntrend and consequently yield sub-optimal performance. To overcome this\nlimitation, we introduce a novel optimizer called fast-adaptive moment\nestimation (FAME). Inspired by the triple exponential moving average (TEMA)\nused in the financial domain, FAME leverages the potency of higher-order TEMA\nto improve the precision of identifying gradient trends. TEMA plays a central\nrole in the learning process as it actively influences optimization dynamics;\nthis role differs from its conventional passive role as a technical indicator\nin financial contexts. Because of the introduction of TEMA into the\noptimization process, FAME can identify gradient trends with higher accuracy\nand fewer lag issues, thereby offering smoother and more consistent responses\nto gradient fluctuations compared to conventional first-order EMA. To study the\neffectiveness of our novel FAME optimizer, we conducted comprehensive\nexperiments encompassing six diverse computer-vision benchmarks and tasks,\nspanning detection, classification, and semantic comprehension. We integrated\nFAME into 15 learning architectures and compared its performance with those of\nsix popular optimizers. Results clearly showed that FAME is more robust and\naccurate and provides superior performance stability by minimizing noise (i.e.,\ntrend fluctuations). Notably, FAME achieves higher accuracy levels in\nremarkably fewer training epochs than its counterparts, clearly indicating its\nsignificance for optimizing deep networks in computer-vision tasks.\n","authors":["Roi Peleg","Teddy Lazebnik","Assaf Hoogi"],"pdf_url":"https://arxiv.org/pdf/2306.01423v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13655v1","updated":"2023-12-21T08:29:41Z","published":"2023-12-21T08:29:41Z","title":"Compositional Zero-Shot Learning for Attribute-Based Object Reference in\n Human-Robot Interaction","summary":" Language-enabled robots have been widely studied over the past years to\nenable natural human-robot interaction and teaming in various real-world\napplications. Language-enabled robots must be able to comprehend referring\nexpressions to identify a particular object from visual perception using a set\nof referring attributes extracted from natural language. However, visual\nobservations of an object may not be available when it is referred to, and the\nnumber of objects and attributes may also be unbounded in open worlds. To\naddress the challenges, we implement an attribute-based compositional zero-shot\nlearning method that uses a list of attributes to perform referring expression\ncomprehension in open worlds. We evaluate the approach on two datasets\nincluding the MIT-States and the Clothing 16K. The preliminary experimental\nresults show that our implemented approach allows a robot to correctly identify\nthe objects referred to by human commands.\n","authors":["Peng Gao","Ahmed Jaafar","Brian Reily","Christopher Reardon","Hao Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13655v1.pdf","comment":"Equal contribution from the first two authors"},{"id":"http://arxiv.org/abs/2312.13646v1","updated":"2023-12-21T08:16:26Z","published":"2023-12-21T08:16:26Z","title":"Weakly Supervised Semantic Segmentation for Driving Scenes","summary":" State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS)\nusing image-level labels exhibit severe performance degradation on driving\nscene datasets such as Cityscapes. To address this challenge, we develop a new\nWSSS framework tailored to driving scene datasets. Based on extensive analysis\nof dataset characteristics, we employ Contrastive Language-Image Pre-training\n(CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key\nchallenges: (1) pseudo-masks from CLIP lack in representing small object\nclasses, and (2) these masks contain notable noise. We propose solutions for\neach issue as follows. (1) We devise Global-Local View Training that seamlessly\nincorporates small-scale patches during model training, thereby enhancing the\nmodel's capability to handle small-sized yet critical objects in driving scenes\n(e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing\n(CARB), a novel technique that discerns reliable and noisy regions through\nevaluating the consistency between CLIP masks and segmentation predictions. It\nprioritizes reliable pixels over noisy pixels via adaptive loss weighting.\nNotably, the proposed method achieves 51.8\\% mIoU on the Cityscapes test\ndataset, showcasing its potential as a strong WSSS baseline on driving scene\ndatasets. Experimental results on CamVid and WildDash2 demonstrate the\neffectiveness of our method across diverse datasets, even with small-scale\ndatasets or visually challenging conditions. The code is available at\nhttps://github.com/k0u-id/CARB.\n","authors":["Dongseob Kim","Seungho Lee","Junsuk Choe","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2312.13646v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13641v1","updated":"2023-12-21T08:08:02Z","published":"2023-12-21T08:08:02Z","title":"SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection","summary":" Current 3D object detection methods for indoor scenes mainly follow the\nvoting-and-grouping strategy to generate proposals. However, most methods\nutilize instance-agnostic groupings, such as ball query, leading to\ninconsistent semantic information and inaccurate regression of the proposals.\nTo this end, we propose a novel superpoint grouping network for indoor\nanchor-free one-stage 3D object detection. Specifically, we first adopt an\nunsupervised manner to partition raw point clouds into superpoints, areas with\nsemantic consistency and spatial similarity. Then, we design a geometry-aware\nvoting module that adapts to the centerness in anchor-free detection by\nconstraining the spatial relationship between superpoints and object centers.\nNext, we present a superpoint-based grouping module to explore the consistent\nrepresentation within proposals. This module includes a superpoint attention\nlayer to learn feature interaction between neighboring superpoints, and a\nsuperpoint-voxel fusion layer to propagate the superpoint-level information to\nthe voxel level. Finally, we employ effective multiple matching to capitalize\non the dynamic receptive fields of proposals based on superpoints during the\ntraining. Experimental results demonstrate our method achieves state-of-the-art\nperformance on ScanNet V2, SUN RGB-D, and S3DIS datasets in the indoor\none-stage 3D object detection. Source code is available at\nhttps://github.com/zyrant/SPGroup3D.\n","authors":["Yun Zhu","Le Hui","Yaqi Shen","Jin Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13641v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2304.08506v6","updated":"2023-12-21T07:49:49Z","published":"2023-04-17T16:02:06Z","title":"When SAM Meets Medical Images: An Investigation of Segment Anything\n Model (SAM) on Multi-phase Liver Tumor Segmentation","summary":" Learning to segmentation without large-scale samples is an inherent\ncapability of human. Recently, Segment Anything Model (SAM) performs the\nsignificant zero-shot image segmentation, attracting considerable attention\nfrom the computer vision community. Here, we investigate the capability of SAM\nfor medical image analysis, especially for multi-phase liver tumor segmentation\n(MPLiTS), in terms of prompts, data resolution, phases. Experimental results\ndemonstrate that there might be a large gap between SAM and expected\nperformance. Fortunately, the qualitative results show that SAM is a powerful\nannotation tool for the community of interactive medical image segmentation.\n","authors":["Chuanfei Hu","Tianyi Xia","Shenghong Ju","Xinde Li"],"pdf_url":"https://arxiv.org/pdf/2304.08506v6.pdf","comment":"Preliminary investigation"},{"id":"http://arxiv.org/abs/2312.13633v1","updated":"2023-12-21T07:49:27Z","published":"2023-12-21T07:49:27Z","title":"Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video\n Grounding","summary":" Temporal Video Grounding (TVG) aims to localize the temporal boundary of a\nspecific segment in an untrimmed video based on a given language query. Since\ndatasets in this domain are often gathered from limited video scenes, models\ntend to overfit to scene-specific factors, which leads to suboptimal\nperformance when encountering new scenes in real-world applications. In a new\nscene, the fine-grained annotations are often insufficient due to the expensive\nlabor cost, while the coarse-grained video-query pairs are easier to obtain.\nThus, to address this issue and enhance model performance on new scenes, we\nexplore the TVG task in an unsupervised domain adaptation (UDA) setting across\nscenes for the first time, where the video-query pairs in the source scene\n(domain) are labeled with temporal boundaries, while those in the target scene\nare not. Under the UDA setting, we introduce a novel Adversarial Multi-modal\nDomain Adaptation (AMDA) method to adaptively adjust the model's scene-related\nknowledge by incorporating insights from the target data. Specifically, we\ntackle the domain gap by utilizing domain discriminators, which help identify\nvaluable scene-related features effective across both domains. Concurrently, we\nmitigate the semantic gap between different modalities by aligning video-query\npairs with related semantics. Furthermore, we employ a mask-reconstruction\napproach to enhance the understanding of temporal semantics within a scene.\nExtensive experiments on Charades-STA, ActivityNet Captions, and YouCook2\ndemonstrate the effectiveness of our proposed method.\n","authors":["Haifeng Huang","Yang Zhao","Zehan Wang","Yan Xia","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.13633v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13632v1","updated":"2023-12-21T07:48:54Z","published":"2023-12-21T07:48:54Z","title":"ProvFL: Client-Driven Interpretability of Global Model Predictions in\n Federated Learning","summary":" Federated Learning (FL) trains a collaborative machine learning model by\naggregating multiple privately trained clients' models over several training\nrounds. Such a long, continuous action of model aggregations poses significant\nchallenges in reasoning about the origin and composition of such a global\nmodel. Regardless of the quality of the global model or if it has a fault,\nunderstanding the model's origin is equally important for debugging,\ninterpretability, and explainability in federated learning. FL application\ndevelopers often question: (1) what clients contributed towards a global model\nand (2) if a global model predicts a label, which clients are responsible for\nit?\n We introduce, neuron provenance, a fine-grained lineage capturing mechanism\nthat tracks the flow of information between the individual participating\nclients in FL and the final global model. We operationalize this concept in\nProvFL that functions on two key principles. First, recognizing that monitoring\nevery neuron of every client's model statically is ineffective and noisy due to\nthe uninterpretable nature of individual neurons, ProvFL dynamically isolates\ninfluential and sensitive neurons in the global model, significantly reducing\nthe search space. Second, as multiple clients' models are fused in each round\nto form a global model, tracking each client's contribution becomes\nchallenging. ProvFL leverages the invertible nature of fusion algorithms to\nprecisely isolate each client's contribution derived from selected neurons.\nWhen asked to localize the clients responsible for the given behavior (i.e.,\nprediction) of the global model, ProvFL successfully localizes them with an\naverage provenance accuracy of 97%. Additionally, ProvFL outperforms the\nstate-of-the-art FL fault localization approach by an average margin of 50%.\n","authors":["Waris Gill","Ali Anwar","Muhammad Ali Gulzar"],"pdf_url":"https://arxiv.org/pdf/2312.13632v1.pdf","comment":"22 pages. For access to the source code used in this study, please\n contact the authors directly"},{"id":"http://arxiv.org/abs/2312.13631v1","updated":"2023-12-21T07:48:38Z","published":"2023-12-21T07:48:38Z","title":"Diff-Oracle: Diffusion Model for Oracle Character Generation with\n Controllable Styles and Contents","summary":" Deciphering the oracle bone script plays a significant role in Chinese\narchaeology and philology. However, it is significantly challenging due to the\nscarcity of oracle character images. To overcome this issue, we propose\nDiff-Oracle, based on diffusion models (DMs), to generate sufficient\ncontrollable oracle characters. In contrast to most DMs that rely on text\nprompts, we incorporate a style encoder to control style information during the\ngeneration process. This encoder extracts style prompts from existing oracle\ncharacter images, where style details are converted from a CLIP model into a\ntext embedding format. Inspired by ControlNet, we introduce a content encoder\nto capture desired content information from content images, ensuring the\nfidelity of character glyphs. To train Diff-Oracle effectively, we propose to\nobtain pixel-level paired oracle character images (i.e., style and content\nimages) by a pre-trained image-to-image translation model. Extensive\nqualitative and quantitative experiments conducted on two benchmark datasets,\nOracle-241 and OBC306, demonstrate that our Diff-Oracle outperforms existing\ngenerative methods in terms of image generation, further enhancing recognition\naccuracy. Source codes will be available.\n","authors":["Jing Li","Qiu-Feng Wang","Kaizhu Huang","Rui Zhang","Siyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13631v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13630v1","updated":"2023-12-21T07:48:15Z","published":"2023-12-21T07:48:15Z","title":"MFABA: A More Faithful and Accelerated Boundary-based Attribution Method\n for Deep Neural Networks","summary":" To better understand the output of deep neural networks (DNN), attribution\nbased methods have been an important approach for model interpretability, which\nassign a score for each input dimension to indicate its importance towards the\nmodel outcome. Notably, the attribution methods use the axioms of sensitivity\nand implementation invariance to ensure the validity and reliability of\nattribution results. Yet, the existing attribution methods present challenges\nfor effective interpretation and efficient computation. In this work, we\nintroduce MFABA, an attribution algorithm that adheres to axioms, as a novel\nmethod for interpreting DNN. Additionally, we provide the theoretical proof and\nin-depth analysis for MFABA algorithm, and conduct a large scale experiment.\nThe results demonstrate its superiority by achieving over 101.5142 times faster\nspeed than the state-of-the-art attribution algorithms. The effectiveness of\nMFABA is thoroughly evaluated through the statistical analysis in comparison to\nother methods, and the full implementation package is open-source at:\nhttps://github.com/LMBTough/MFABA\n","authors":["Zhiyu Zhu","Huaming Chen","Jiayu Zhang","Xinyi Wang","Zhibo Jin","Minhui Xue","Dongxiao Zhu","Kim-Kwang Raymond Choo"],"pdf_url":"https://arxiv.org/pdf/2312.13630v1.pdf","comment":"Accepted by The 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI-24)"},{"id":"http://arxiv.org/abs/2312.11460v2","updated":"2023-12-21T07:46:20Z","published":"2023-12-18T18:59:06Z","title":"Hybrid Internal Model: A Simple and Efficient Learner for Agile Legged\n Locomotion","summary":" Robust locomotion control depends on accurate state estimations. However, the\nsensors of most legged robots can only provide partial and noisy observations,\nmaking the estimation particularly challenging, especially for external states\nlike terrain frictions and elevation maps. Inspired by the classical Internal\nModel Control principle, we consider these external states as disturbances and\nintroduce Hybrid Internal Model (HIM) to estimate them according to the\nresponse of the robot. The response, which we refer to as the hybrid internal\nembedding, contains the robot's explicit velocity and implicit stability\nrepresentation, corresponding to two primary goals for locomotion tasks:\nexplicitly tracking velocity and implicitly maintaining stability. We use\ncontrastive learning to optimize the embedding to be close to the robot's\nsuccessor state, in which the response is naturally embedded. HIM has several\nappealing benefits: It only needs the robot's proprioceptions, i.e., those from\njoint encoders and IMU as observations. It innovatively maintains consistent\nobservations between simulation reference and reality that avoids information\nloss in mimicking learning. It exploits batch-level information that is more\nrobust to noises and keeps better sample efficiency. It only requires 1 hour of\ntraining on an RTX 4090 to enable a quadruped robot to traverse any terrain\nunder any disturbances. A wealth of real-world experiments demonstrates its\nagility, even in high-difficulty tasks and cases never occurred during the\ntraining process, revealing remarkable open-world generalizability.\n","authors":["Junfeng Long","Zirui Wang","Quanyi Li","Jiawei Gao","Liu Cao","Jiangmiao Pang"],"pdf_url":"https://arxiv.org/pdf/2312.11460v2.pdf","comment":"Use 1 hour to train a quadruped robot capable of traversing any\n terrain under any disturbances in the open world, Project Page:\n https://github.com/OpenRobotLab/HIMLoco"},{"id":"http://arxiv.org/abs/2305.12743v2","updated":"2023-12-21T07:35:33Z","published":"2023-05-22T06:11:01Z","title":"Semantic Invariant Multi-view Clustering with Fully Incomplete\n Information","summary":" Robust multi-view learning with incomplete information has received\nsignificant attention due to issues such as incomplete correspondences and\nincomplete instances that commonly affect real-world multi-view applications.\nExisting approaches heavily rely on paired samples to realign or impute\ndefective ones, but such preconditions cannot always be satisfied in practice\ndue to the complexity of data collection and transmission. To address this\nproblem, we present a novel framework called SeMantic Invariance LEarning\n(SMILE) for multi-view clustering with incomplete information that does not\nrequire any paired samples. To be specific, we discover the existence of\ninvariant semantic distribution across different views, which enables SMILE to\nalleviate the cross-view discrepancy to learn consensus semantics without\nrequiring any paired samples. The resulting consensus semantics remain\nunaffected by cross-view distribution shifts, making them useful for\nrealigning/imputing defective instances and forming clusters. We demonstrate\nthe effectiveness of SMILE through extensive comparison experiments with 13\nstate-of-the-art baselines on five benchmarks. Our approach improves the\nclustering accuracy of NoisyMNIST from 19.3\\%/23.2\\% to 82.7\\%/69.0\\% when the\ncorrespondences/instances are fully incomplete. The code could be accessed from\nhttps://pengxi.me.\n","authors":["Pengxin Zeng","Mouxing Yang","Yiding Lu","Changqing Zhang","Peng Hu","Xi Peng"],"pdf_url":"https://arxiv.org/pdf/2305.12743v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03815v3","updated":"2023-12-21T07:29:43Z","published":"2023-05-05T19:42:39Z","title":"Persistent Homology Meets Object Unity: Object Recognition in Clutter","summary":" Recognition of occluded objects in unseen and unstructured indoor\nenvironments is a challenging problem for mobile robots. To address this\nchallenge, we propose a new descriptor, TOPS, for point clouds generated from\ndepth images and an accompanying recognition framework, THOR, inspired by human\nreasoning. The descriptor employs a novel slicing-based approach to compute\ntopological features from filtrations of simplicial complexes using persistent\nhomology, and facilitates reasoning-based recognition using object unity. Apart\nfrom a benchmark dataset, we report performance on a new dataset, the UW Indoor\nScenes (UW-IS) Occluded dataset, curated using commodity hardware to reflect\nreal-world scenarios with different environmental conditions and degrees of\nobject occlusion. THOR outperforms state-of-the-art methods on both the\ndatasets and achieves substantially higher recognition accuracy for all the\nscenarios of the UW-IS Occluded dataset. Therefore, THOR, is a promising step\ntoward robust recognition in low-cost robots, meant for everyday use in indoor\nsettings.\n","authors":["Ekta U. Samani","Ashis G. Banerjee"],"pdf_url":"https://arxiv.org/pdf/2305.03815v3.pdf","comment":"This work has been accepted for publication in the IEEE Transactions\n on Robotics"},{"id":"http://arxiv.org/abs/2312.13620v1","updated":"2023-12-21T07:22:25Z","published":"2023-12-21T07:22:25Z","title":"A Comprehensive End-to-End Computer Vision Framework for Restoration and\n Recognition of Low-Quality Engineering Drawings","summary":" The digitization of engineering drawings is crucial for efficient reuse,\ndistribution, and archiving. Existing computer vision approaches for digitizing\nengineering drawings typically assume the input drawings have high quality.\nHowever, in reality, engineering drawings are often blurred and distorted due\nto improper scanning, storage, and transmission, which may jeopardize the\neffectiveness of existing approaches. This paper focuses on restoring and\nrecognizing low-quality engineering drawings, where an end-to-end framework is\nproposed to improve the quality of the drawings and identify the graphical\nsymbols on them. The framework uses K-means clustering to classify different\nengineering drawing patches into simple and complex texture patches based on\ntheir gray level co-occurrence matrix statistics. Computer vision operations\nand a modified Enhanced Super-Resolution Generative Adversarial Network\n(ESRGAN) model are then used to improve the quality of the two types of\npatches, respectively. A modified Faster Region-based Convolutional Neural\nNetwork (Faster R-CNN) model is used to recognize the quality-enhanced\ngraphical symbols. Additionally, a multi-stage task-driven collaborative\nlearning strategy is proposed to train the modified ESRGAN and Faster R-CNN\nmodels to improve the resolution of engineering drawings in the direction that\nfacilitates graphical symbol recognition, rather than human visual perception.\nA synthetic data generation method is also proposed to construct\nquality-degraded samples for training the framework. Experiments on real-world\nelectrical diagrams show that the proposed framework achieves an accuracy of\n98.98% and a recall of 99.33%, demonstrating its superiority over previous\napproaches. Moreover, the framework is integrated into a widely-used power\nsystem software application to showcase its practicality.\n","authors":["Lvyang Yang","Jiankang Zhang","Huaiqiang Li","Longfei Ren","Chen Yang","Jingyu Wang","Dongyuan Shi"],"pdf_url":"https://arxiv.org/pdf/2312.13620v1.pdf","comment":"20 pages, 13 figures, submitted to Engineering Applications of\n Artificial Intelligence"},{"id":"http://arxiv.org/abs/2307.16586v4","updated":"2023-12-21T07:03:08Z","published":"2023-07-31T11:40:53Z","title":"SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment\n Anything Model","summary":" Optical Flow Estimation aims to find the 2D dense motion field between two\nframes. Due to the limitation of model structures and training datasets,\nexisting methods often rely too much on local clues and ignore the integrity of\nobjects, resulting in fragmented motion estimation. Through theoretical\nanalysis, we find the pre-trained large vision models are helpful in optical\nflow estimation, and we notice that the recently famous Segment Anything Model\n(SAM) demonstrates a strong ability to segment complete objects, which is\nsuitable for solving the fragmentation problem. We thus propose a solution to\nembed the frozen SAM image encoder into FlowFormer to enhance object\nperception. To address the challenge of in-depth utilizing SAM in\nnon-segmentation tasks like optical flow estimation, we propose an Optical Flow\nTask-Specific Adaption scheme, including a Context Fusion Module to fuse the\nSAM encoder with the optical flow context encoder, and a Context Adaption\nModule to adapt the SAM features for optical flow task with Learned\nTask-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10\nclean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set,\nsurpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model\nachieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks,\nranking #1 among all two-frame methods on Sintel clean pass.\n","authors":["Shili Zhou","Ruian He","Weimin Tan","Bo Yan"],"pdf_url":"https://arxiv.org/pdf/2307.16586v4.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2304.03693v3","updated":"2023-12-21T06:44:25Z","published":"2023-04-07T15:30:49Z","title":"Model-Agnostic Gender Debiased Image Captioning","summary":" Image captioning models are known to perpetuate and amplify harmful societal\nbias in the training set. In this work, we aim to mitigate such gender bias in\nimage captioning models. While prior work has addressed this problem by forcing\nmodels to focus on people to reduce gender misclassification, it conversely\ngenerates gender-stereotypical words at the expense of predicting the correct\ngender. From this observation, we hypothesize that there are two types of\ngender bias affecting image captioning models: 1) bias that exploits context to\npredict gender, and 2) bias in the probability of generating certain (often\nstereotypical) words because of gender. To mitigate both types of gender\nbiases, we propose a framework, called LIBRA, that learns from synthetically\nbiased samples to decrease both types of biases, correcting gender\nmisclassification and changing gender-stereotypical words to more neutral ones.\nCode is available at https://github.com/rebnej/LIBRA.\n","authors":["Yusuke Hirota","Yuta Nakashima","Noa Garcia"],"pdf_url":"https://arxiv.org/pdf/2304.03693v3.pdf","comment":"CVPR 2023"},{"id":"http://arxiv.org/abs/2312.13604v1","updated":"2023-12-21T06:44:18Z","published":"2023-12-21T06:44:18Z","title":"Ponymation: Learning 3D Animal Motions from Unlabeled Online Videos","summary":" We introduce Ponymation, a new method for learning a generative model of\narticulated 3D animal motions from raw, unlabeled online videos. Unlike\nexisting approaches for motion synthesis, our model does not require any pose\nannotations or parametric shape models for training, and is learned purely from\na collection of raw video clips obtained from the Internet. We build upon a\nrecent work, MagicPony, which learns articulated 3D animal shapes purely from\nsingle image collections, and extend it on two fronts. First, instead of\ntraining on static images, we augment the framework with a video training\npipeline that incorporates temporal regularizations, achieving more accurate\nand temporally consistent reconstructions. Second, we learn a generative model\nof the underlying articulated 3D motion sequences via a spatio-temporal\ntransformer VAE, simply using 2D reconstruction losses without relying on any\nexplicit pose annotations. At inference time, given a single 2D image of a new\nanimal instance, our model reconstructs an articulated, textured 3D mesh, and\ngenerates plausible 3D animations by sampling from the learned motion latent\nspace.\n","authors":["Keqiang Sun","Dor Litvak","Yunzhi Zhang","Hongsheng Li","Jiajun Wu","Shangzhe Wu"],"pdf_url":"https://arxiv.org/pdf/2312.13604v1.pdf","comment":"Project page: https://keqiangsun.github.io/projects/ponymation. The\n first two authors contributed equally to this work. The last two authors\n contributed equally"},{"id":"http://arxiv.org/abs/2312.11396v2","updated":"2023-12-21T06:39:15Z","published":"2023-12-18T17:55:44Z","title":"MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based\n Attention-Adjusted Guidance","summary":" Recent diffusion-based image editing approaches have exhibited impressive\nediting capabilities in images with simple compositions. However, localized\nediting in complex scenarios has not been well-studied in the literature,\ndespite its growing real-world demands. Existing mask-based inpainting methods\nfall short of retaining the underlying structure within the edit region.\nMeanwhile, mask-free attention-based methods often exhibit editing leakage and\nmisalignment in more complex compositions. In this work, we develop MAG-Edit, a\ntraining-free, inference-stage optimization method, which enables localized\nimage editing in complex scenarios. In particular, MAG-Edit optimizes the noise\nlatent feature in diffusion models by maximizing two mask-based cross-attention\nconstraints of the edit token, which in turn gradually enhances the local\nalignment with the desired prompt. Extensive quantitative and qualitative\nexperiments demonstrate the effectiveness of our method in achieving both text\nalignment and structure preservation for localized editing within complex\nscenarios.\n","authors":["Qi Mao","Lan Chen","Yuchao Gu","Zhen Fang","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2312.11396v2.pdf","comment":"for project page, see https://mag-edit.github.io/"},{"id":"http://arxiv.org/abs/2311.02358v4","updated":"2023-12-21T06:01:32Z","published":"2023-11-04T09:57:50Z","title":"Domain Transfer in Latent Space (DTLS) Wins on Image Super-Resolution --\n a Non-Denoising Model","summary":" Large scale image super-resolution is a challenging computer vision task,\nsince vast information is missing in a highly degraded image, say for example\nforscale x16 super-resolution. Diffusion models are used successfully in recent\nyears in extreme super-resolution applications, in which Gaussian noise is used\nas a means to form a latent photo-realistic space, and acts as a link between\nthe space of latent vectors and the latent photo-realistic space. There are\nquite a few sophisticated mathematical derivations on mapping the statistics of\nGaussian noises making Diffusion Models successful. In this paper we propose a\nsimple approach which gets away from using Gaussian noise but adopts some basic\nstructures of diffusion models for efficient image super-resolution.\nEssentially, we propose a DNN to perform domain transfer between neighbor\ndomains, which can learn the differences in statistical properties to\nfacilitate gradual interpolation with results of reasonable quality. Further\nquality improvement is achieved by conditioning the domain transfer with\nreference to the input LR image. Experimental results show that our method\noutperforms not only state-of-the-art large scale super resolution models, but\nalso the current diffusion models for image super-resolution. The approach can\nreadily be extended to other image-to-image tasks, such as image enlightening,\ninpainting, denoising, etc.\n","authors":["Chun-Chuen Hui","Wan-Chi Siu","Ngai-Fong Law"],"pdf_url":"https://arxiv.org/pdf/2311.02358v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07871v3","updated":"2023-12-21T05:53:35Z","published":"2023-12-13T03:17:34Z","title":"MLNet: Mutual Learning Network with Neighborhood Invariance for\n Universal Domain Adaptation","summary":" Universal domain adaptation (UniDA) is a practical but challenging problem,\nin which information about the relation between the source and the target\ndomains is not given for knowledge transfer. Existing UniDA methods may suffer\nfrom the problems of overlooking intra-domain variations in the target domain\nand difficulty in separating between the similar known and unknown class. To\naddress these issues, we propose a novel Mutual Learning Network (MLNet) with\nneighborhood invariance for UniDA. In our method, confidence-guided invariant\nfeature learning with self-adaptive neighbor selection is designed to reduce\nthe intra-domain variations for more generalizable feature representation. By\nusing the cross-domain mixup scheme for better unknown-class identification,\nthe proposed method compensates for the misidentified known-class errors by\nmutual learning between the closed-set and open-set classifiers. Extensive\nexperiments on three publicly available benchmarks demonstrate that our method\nachieves the best results compared to the state-of-the-arts in most cases and\nsignificantly outperforms the baseline across all the four settings in UniDA.\nCode is available at https://github.com/YanzuoLu/MLNet.\n","authors":["Yanzuo Lu","Meng Shen","Andy J Ma","Xiaohua Xie","Jian-Huang Lai"],"pdf_url":"https://arxiv.org/pdf/2312.07871v3.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2309.12780v3","updated":"2023-12-21T05:52:52Z","published":"2023-09-22T10:43:55Z","title":"LMC: Large Model Collaboration with Cross-assessment for Training-Free\n Open-Set Object Recognition","summary":" Open-set object recognition aims to identify if an object is from a class\nthat has been encountered during training or not. To perform open-set object\nrecognition accurately, a key challenge is how to reduce the reliance on\nspurious-discriminative features. In this paper, motivated by that different\nlarge models pre-trained through different paradigms can possess very rich\nwhile distinct implicit knowledge, we propose a novel framework named Large\nModel Collaboration (LMC) to tackle the above challenge via collaborating\ndifferent off-the-shelf large models in a training-free manner. Moreover, we\nalso incorporate the proposed framework with several novel designs to\neffectively extract implicit knowledge from large models. Extensive experiments\ndemonstrate the efficacy of our proposed framework. Code is available\nhttps://github.com/Harryqu123/LMC\n","authors":["Haoxuan Qu","Xiaofei Hui","Yujun Cai","Jun Liu"],"pdf_url":"https://arxiv.org/pdf/2309.12780v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13594v1","updated":"2023-12-21T05:51:55Z","published":"2023-12-21T05:51:55Z","title":"Towards More Faithful Natural Language Explanation Using Multi-Level\n Contrastive Learning in VQA","summary":" Natural language explanation in visual question answer (VQA-NLE) aims to\nexplain the decision-making process of models by generating natural language\nsentences to increase users' trust in the black-box systems. Existing post-hoc\nmethods have achieved significant progress in obtaining a plausible\nexplanation. However, such post-hoc explanations are not always aligned with\nhuman logical inference, suffering from the issues on: 1) Deductive\nunsatisfiability, the generated explanations do not logically lead to the\nanswer; 2) Factual inconsistency, the model falsifies its counterfactual\nexplanation for answers without considering the facts in images; and 3)\nSemantic perturbation insensitivity, the model can not recognize the semantic\nchanges caused by small perturbations. These problems reduce the faithfulness\nof explanations generated by models. To address the above issues, we propose a\nnovel self-supervised \\textbf{M}ulti-level \\textbf{C}ontrastive\n\\textbf{L}earning based natural language \\textbf{E}xplanation model (MCLE) for\nVQA with semantic-level, image-level, and instance-level factual and\ncounterfactual samples. MCLE extracts discriminative features and aligns the\nfeature spaces from explanations with visual question and answer to generate\nmore consistent explanations. We conduct extensive experiments, ablation\nanalysis, and case study to demonstrate the effectiveness of our method on two\nVQA-NLE benchmarks.\n","authors":["Chengen Lai","Shengli Song","Shiqi Meng","Jingyang Li","Sitong Yan","Guangneng Hu"],"pdf_url":"https://arxiv.org/pdf/2312.13594v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2211.07864v3","updated":"2023-12-21T05:45:52Z","published":"2022-11-15T03:10:05Z","title":"Federated Adaptive Prompt Tuning for Multi-domain Collaborative Learning","summary":" Federated learning (FL) enables multiple clients to collaboratively train a\nglobal model without disclosing their data. Previous researches often require\ntraining the complete model parameters. However, the emergence of powerful\npre-trained models makes it possible to achieve higher performance with fewer\nlearnable parameters in FL. In this paper, we propose a federated adaptive\nprompt tuning algorithm, FedAPT, for multi-domain collaborative image\nclassification with powerful foundation models, like CLIP. Compared with direct\nfederated prompt tuning, our core idea is to adaptively unlock specific domain\nknowledge for each test sample in order to provide them with personalized\nprompts. To implement this idea, we design an adaptive prompt tuning module,\nwhich consists of a meta prompt, an adaptive network, and some keys. The server\nrandomly generates a set of keys and assigns a unique key to each client. Then\nall clients cooperatively train the global adaptive network and meta prompt\nwith the local datasets and the frozen keys. Ultimately, the global aggregation\nmodel can assign a personalized prompt to CLIP based on the domain features of\neach test sample. We perform extensive experiments on two multi-domain image\nclassification datasets across two different settings -- supervised and\nunsupervised. The results show that FedAPT can achieve better performance with\nless than 10\\% of the number of parameters of the fully trained model, and the\nglobal model can perform well in diverse client domains simultaneously.\n","authors":["Shangchao Su","Mingzhao Yang","Bin Li","Xiangyang Xue"],"pdf_url":"https://arxiv.org/pdf/2211.07864v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10208v2","updated":"2023-12-21T05:44:09Z","published":"2023-12-15T21:06:22Z","title":"Video-based Surgical Skill Assessment using Tree-based Gaussian Process\n Classifier","summary":" This paper aims to present a novel pipeline for automated surgical skill\nassessment using video data and to showcase the effectiveness of the proposed\napproach in evaluating surgeon proficiency, its potential for targeted training\ninterventions, and quality assurance in surgical departments. The pipeline\nincorporates a representation flow convolutional neural network and a novel\ntree-based Gaussian process classifier, which is robust to noise, while being\ncomputationally efficient. Additionally, new kernels are introduced to enhance\naccuracy. The performance of the pipeline is evaluated using the JIGSAWS\ndataset. Comparative analysis with existing literature reveals significant\nimprovement in accuracy and betterment in computation cost. The proposed\npipeline contributes to computational efficiency and accuracy improvement in\nsurgical skill assessment using video data. Results of our study based on\ncomments of our colleague surgeons show that the proposed method has the\npotential to facilitate skill improvement among surgery fellows and enhance\npatient safety through targeted training interventions and quality assurance in\nsurgical departments.\n","authors":["Arefeh Rezaei","Mohammad Javad Ahmadi","Amir Molaei","Hamid. D. Taghirad"],"pdf_url":"https://arxiv.org/pdf/2312.10208v2.pdf","comment":"11 pages, 2 figures, journal"},{"id":"http://arxiv.org/abs/2312.07488v2","updated":"2023-12-21T05:37:58Z","published":"2023-12-12T18:24:15Z","title":"LMDrive: Closed-Loop End-to-End Driving with Large Language Models","summary":" Despite significant recent progress in the field of autonomous driving,\nmodern methods still struggle and can incur serious accidents when encountering\nlong-tail unforeseen events and challenging urban scenarios. On the one hand,\nlarge language models (LLM) have shown impressive reasoning capabilities that\napproach \"Artificial General Intelligence\". On the other hand, previous\nautonomous driving methods tend to rely on limited-format inputs (e.g. sensor\ndata and navigation waypoints), restricting the vehicle's ability to understand\nlanguage information and interact with humans. To this end, this paper\nintroduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous\ndriving framework. LMDrive uniquely processes and integrates multi-modal sensor\ndata with natural language instructions, enabling interaction with humans and\nnavigation software in realistic instructional settings. To facilitate further\nresearch in language-based closed-loop autonomous driving, we also publicly\nrelease the corresponding dataset which includes approximately 64K\ninstruction-following data clips, and the LangAuto benchmark that tests the\nsystem's ability to handle complex instructions and challenging driving\nscenarios. Extensive closed-loop experiments are conducted to demonstrate\nLMDrive's effectiveness. To the best of our knowledge, we're the very first\nwork to leverage LLMs for closed-loop end-to-end autonomous driving. Codes,\nmodels, and datasets can be found at https://github.com/opendilab/LMDrive\n","authors":["Hao Shao","Yuxuan Hu","Letian Wang","Steven L. Waslander","Yu Liu","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2312.07488v2.pdf","comment":"project page: https://hao-shao.com/projects/lmdrive.html"},{"id":"http://arxiv.org/abs/2312.13139v2","updated":"2023-12-21T05:34:23Z","published":"2023-12-20T16:00:43Z","title":"Unleashing Large-Scale Video Generative Pre-training for Visual Robot\n Manipulation","summary":" Generative pre-trained models have demonstrated remarkable effectiveness in\nlanguage and vision domains by learning useful representations. In this paper,\nwe extend the scope of this effectiveness by showing that visual robot\nmanipulation can significantly benefit from large-scale video generative\npre-training. We introduce GR-1, a straightforward GPT-style model designed for\nmulti-task language-conditioned visual robot manipulation. GR-1 takes as inputs\na language instruction, a sequence of observation images, and a sequence of\nrobot states. It predicts robot actions as well as future images in an\nend-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly\nfinetuned on robot data after pre-trained on a large-scale video dataset. We\nperform extensive experiments on the challenging CALVIN benchmark and a real\nrobot. On CALVIN benchmark, our method outperforms state-of-the-art baseline\nmethods and improves the success rate from 88.9% to 94.9%. In the setting of\nzero-shot unseen scene generalization, GR-1 improves the success rate from\n53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline\nmethods and shows strong potentials in generalization to unseen scenes and\nobjects. We provide inaugural evidence that a unified GPT-style transformer,\naugmented with large-scale video generative pre-training, exhibits remarkable\ngeneralization to multi-task visual robot manipulation. Project page:\nhttps://GR1-Manipulation.github.io\n","authors":["Hongtao Wu","Ya Jing","Chilam Cheang","Guangzeng Chen","Jiafeng Xu","Xinghang Li","Minghuan Liu","Hang Li","Tao Kong"],"pdf_url":"https://arxiv.org/pdf/2312.13139v2.pdf","comment":"Project page: https://GR1-Manipulation.github.io"},{"id":"http://arxiv.org/abs/2202.02980v5","updated":"2023-12-21T05:14:59Z","published":"2022-02-07T07:12:24Z","title":"3D Object Detection from Images for Autonomous Driving: A Survey","summary":" 3D object detection from images, one of the fundamental and challenging\nproblems in autonomous driving, has received increasing attention from both\nindustry and academia in recent years. Benefiting from the rapid development of\ndeep learning technologies, image-based 3D detection has achieved remarkable\nprogress. Particularly, more than 200 works have studied this problem from 2015\nto 2021, encompassing a broad spectrum of theories, algorithms, and\napplications. However, to date no recent survey exists to collect and organize\nthis knowledge. In this paper, we fill this gap in the literature and provide\nthe first comprehensive survey of this novel and continuously growing research\nfield, summarizing the most commonly used pipelines for image-based 3D\ndetection and deeply analyzing each of their components. Additionally, we also\npropose two new taxonomies to organize the state-of-the-art methods into\ndifferent categories, with the intent of providing a more systematic review of\nexisting methods and facilitating fair comparisons with future works. In\nretrospect of what has been achieved so far, we also analyze the current\nchallenges in the field and discuss future directions for image-based 3D\ndetection research.\n","authors":["Xinzhu Ma","Wanli Ouyang","Andrea Simonelli","Elisa Ricci"],"pdf_url":"https://arxiv.org/pdf/2202.02980v5.pdf","comment":"Accepted by T-PAMI"},{"id":"http://arxiv.org/abs/2312.13578v1","updated":"2023-12-21T05:03:18Z","published":"2023-12-21T05:03:18Z","title":"DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for\n Single Image Talking Face Generation","summary":" The generation of emotional talking faces from a single portrait image\nremains a significant challenge. The simultaneous achievement of expressive\nemotional talking and accurate lip-sync is particularly difficult, as\nexpressiveness is often compromised for the accuracy of lip-sync. As widely\nadopted by many prior works, the LSTM network often fails to capture the\nsubtleties and variations of emotional expressions. To address these\nchallenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven\nframework, tailored for generating diverse expressions and accurate lip-sync\nconcurrently. In the first stage, we propose EmoDiff, a novel diffusion module\nthat generates diverse highly dynamic emotional expressions and head poses in\naccordance with the audio and the referenced emotion style. Given the strong\ncorrelation between lip motion and audio, we then refine the dynamics with\nenhanced lip-sync accuracy using audio features and emotion style. To this end,\nwe deploy a video-to-video rendering module to transfer the expressions and lip\nmotions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively\nand qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of\nexpressiveness, lip-sync accuracy and perceptual quality.\n","authors":["Chenxu Zhang","Chao Wang","Jianfeng Zhang","Hongyi Xu","Guoxian Song","You Xie","Linjie Luo","Yapeng Tian","Xiaohu Guo","Jiashi Feng"],"pdf_url":"https://arxiv.org/pdf/2312.13578v1.pdf","comment":"Project Page at https://magic-research.github.io/dream-talk/"},{"id":"http://arxiv.org/abs/2312.13575v1","updated":"2023-12-21T04:48:34Z","published":"2023-12-21T04:48:34Z","title":"ARBiBench: Benchmarking Adversarial Robustness of Binarized Neural\n Networks","summary":" Network binarization exhibits great potential for deployment on\nresource-constrained devices due to its low computational cost. Despite the\ncritical importance, the security of binarized neural networks (BNNs) is rarely\ninvestigated. In this paper, we present ARBiBench, a comprehensive benchmark to\nevaluate the robustness of BNNs against adversarial perturbations on CIFAR-10\nand ImageNet. We first evaluate the robustness of seven influential BNNs on\nvarious white-box and black-box attacks. The results reveal that 1) The\nadversarial robustness of BNNs exhibits a completely opposite performance on\nthe two datasets under white-box attacks. 2) BNNs consistently exhibit better\nadversarial robustness under black-box attacks. 3) Different BNNs exhibit\ncertain similarities in their robustness performance. Then, we conduct\nexperiments to analyze the adversarial robustness of BNNs based on these\ninsights. Our research contributes to inspiring future research on enhancing\nthe robustness of BNNs and advancing their application in real-world scenarios.\n","authors":["Peng Zhao","Jiehua Zhang","Bowen Peng","Longguang Wang","YingMei Wei","Yu Liu","Li Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.14742v2","updated":"2023-12-21T04:06:43Z","published":"2022-11-27T06:18:40Z","title":"Dynamic Feature Pruning and Consolidation for Occluded Person\n Re-Identification","summary":" Occluded person re-identification (ReID) is a challenging problem due to\ncontamination from occluders. Existing approaches address the issue with prior\nknowledge cues, such as human body key points and semantic segmentations, which\neasily fail in the presence of heavy occlusion and other humans as occluders.\nIn this paper, we propose a feature pruning and consolidation (FPC) framework\nto circumvent explicit human structure parsing. The framework mainly consists\nof a sparse encoder, a multi-view feature mathcing module, and a feature\nconsolidation decoder. Specifically, the sparse encoder drops less important\nimage tokens, mostly related to background noise and occluders, solely based on\ncorrelation within the class token attention. Subsequently, the matching stage\nrelies on the preserved tokens produced by the sparse encoder to identify\nk-nearest neighbors in the gallery by measuring the image and patch-level\ncombined similarity. Finally, we use the feature consolidation module to\ncompensate pruned features using identified neighbors for recovering essential\ninformation while disregarding disturbance from noise and occlusion.\nExperimental results demonstrate the effectiveness of our proposed framework on\noccluded, partial, and holistic Re-ID datasets. In particular, our method\noutperforms state-of-the-art results by at least 8.6\\% mAP and 6.0\\% Rank-1\naccuracy on the challenging Occluded-Duke dataset.\n","authors":["YuTeng Ye","Hang Zhou","Jiale Cai","Chenxing Gao","Youjia Zhang","Junle Wang","Qiang Hu","Junqing Yu","Wei Yang"],"pdf_url":"https://arxiv.org/pdf/2211.14742v2.pdf","comment":"Accepted by AAAI-24"},{"id":"http://arxiv.org/abs/2308.10045v2","updated":"2023-12-21T04:01:11Z","published":"2023-08-19T15:08:10Z","title":"An Empirical Study of CLIP for Text-based Person Search","summary":" Text-based Person Search (TBPS) aims to retrieve the person images using\nnatural language descriptions. Recently, Contrastive Language Image Pretraining\n(CLIP), a universal large cross-modal vision-language pre-training model, has\nremarkably performed over various cross-modal downstream tasks due to its\npowerful cross-modal semantic learning capacity. TPBS, as a fine-grained\ncross-modal retrieval task, is also facing the rise of research on the\nCLIP-based TBPS. In order to explore the potential of the visual-language\npre-training model for downstream TBPS tasks, this paper makes the first\nattempt to conduct a comprehensive empirical study of CLIP for TBPS and thus\ncontribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the\nTBPS community. We revisit critical design considerations under CLIP, including\ndata augmentation and loss function. The model, with the aforementioned designs\nand practical training tricks, can attain satisfactory performance without any\nsophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in\nmodel generalization and model compression, demonstrating the effectiveness of\nTBPS-CLIP from various aspects. This work is expected to provide empirical\ninsights and highlight future CLIP-based TBPS research.\n","authors":["Min Cao","Yang Bai","Ziyin Zeng","Mang Ye","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.10045v2.pdf","comment":"Accepted by AAAI 2024. Code is available at\n https://github.com/Flame-Chasers/TBPS-CLIP"},{"id":"http://arxiv.org/abs/2309.08154v2","updated":"2023-12-21T03:53:38Z","published":"2023-09-15T04:39:11Z","title":"Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking","summary":" The core of cross-modal matching is to accurately measure the similarity\nbetween different modalities in a unified representation space. However,\ncompared to textual descriptions of a certain perspective, the visual modality\nhas more semantic variations. So, images are usually associated with multiple\ntextual captions in databases. Although popular symmetric embedding methods\nhave explored numerous modal interaction approaches, they often learn toward\nincreasing the average expression probability of multiple semantic variations\nwithin image embeddings. Consequently, information entropy in embeddings is\nincreased, resulting in redundancy and decreased accuracy. In this work, we\npropose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the\ninformation entropy. Specifically, we obtain a set of heterogeneous visual\nsub-embeddings through dynamic orthogonal constraint loss. To encourage the\ngenerated candidate embeddings to capture various semantic variations, we\nconstruct a mixed distribution and employ a variance-aware weighting loss to\nassign different weights to the optimization process. In addition, we develop a\nFast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and\nenhance the performance. We compare the performance with existing set-based\nmethod using four image feature encoders and two text feature encoders on three\nbenchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role\nof different components by ablation studies and perform a sensitivity analysis\nof the hyperparameters. The qualitative analysis of visualized bidirectional\nretrieval and attention maps further demonstrates the ability of our method to\nencode semantic variations.\n","authors":["Wenzhang Wei","Zhipeng Gui","Changguang Wu","Anqi Zhao","Dehua Peng","Huayi Wu"],"pdf_url":"https://arxiv.org/pdf/2309.08154v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13558v1","updated":"2023-12-21T03:51:08Z","published":"2023-12-21T03:51:08Z","title":"The Truth is in There: Improving Reasoning in Language Models with\n Layer-Selective Rank Reduction","summary":" Transformer-based Large Language Models (LLMs) have become a fixture in\nmodern machine learning. Correspondingly, significant resources are allocated\ntowards research that aims to further advance this technology, typically\nresulting in models of increasing size that are trained on increasing amounts\nof data. This work, however, demonstrates the surprising result that it is\noften possible to significantly improve the performance of LLMs by selectively\nremoving higher-order components of their weight matrices. This simple\nintervention, which we call LAyer-SElective Rank reduction (LASER), can be done\non a model after training has completed, and requires no additional parameters\nor data. We show extensive experiments demonstrating the generality of this\nfinding across language models and datasets, and provide in-depth analyses\noffering insights into both when LASER is effective and the mechanism by which\nit operates.\n","authors":["Pratyusha Sharma","Jordan T. Ash","Dipendra Misra"],"pdf_url":"https://arxiv.org/pdf/2312.13558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13555v1","updated":"2023-12-21T03:46:29Z","published":"2023-12-21T03:46:29Z","title":"CR-SAM: Curvature Regularized Sharpness-Aware Minimization","summary":" The capacity to generalize to future unseen data stands as one of the utmost\ncrucial attributes of deep neural networks. Sharpness-Aware Minimization (SAM)\naims to enhance the generalizability by minimizing worst-case loss using\none-step gradient ascent as an approximation. However, as training progresses,\nthe non-linearity of the loss landscape increases, rendering one-step gradient\nascent less effective. On the other hand, multi-step gradient ascent will incur\nhigher training cost. In this paper, we introduce a normalized Hessian trace to\naccurately measure the curvature of loss landscape on {\\em both} training and\ntest sets. In particular, to counter excessive non-linearity of loss landscape,\nwe propose Curvature Regularized SAM (CR-SAM), integrating the normalized\nHessian trace as a SAM regularizer. Additionally, we present an efficient way\nto compute the trace via finite differences with parallelism. Our theoretical\nanalysis based on PAC-Bayes bounds establishes the regularizer's efficacy in\nreducing generalization error. Empirical evaluation on CIFAR and ImageNet\ndatasets shows that CR-SAM consistently enhances classification performance for\nResNet and Vision Transformer (ViT) models across various datasets. Our code is\navailable at https://github.com/TrustAIoT/CR-SAM.\n","authors":["Tao Wu","Tie Luo","Donald C. Wunsch"],"pdf_url":"https://arxiv.org/pdf/2312.13555v1.pdf","comment":"AAAI 2024, main track"},{"id":"http://arxiv.org/abs/2310.04247v2","updated":"2023-12-21T03:08:39Z","published":"2023-10-06T13:41:39Z","title":"Semantic segmentation of longitudinal thermal images for identification\n of hot and cool spots in urban areas","summary":" This work presents the analysis of semantically segmented, longitudinally,\nand spatially rich thermal images collected at the neighborhood scale to\nidentify hot and cool spots in urban areas. An infrared observatory was\noperated over a few months to collect thermal images of different types of\nbuildings on the educational campus of the National University of Singapore. A\nsubset of the thermal image dataset was used to train state-of-the-art deep\nlearning models to segment various urban features such as buildings,\nvegetation, sky, and roads. It was observed that the U-Net segmentation model\nwith `resnet34' CNN backbone has the highest mIoU score of 0.99 on the test\ndataset, compared to other models such as DeepLabV3, DeeplabV3+, FPN, and\nPSPnet. The masks generated using the segmentation models were then used to\nextract the temperature from thermal images and correct for differences in the\nemissivity of various urban features. Further, various statistical measure of\nthe temperature extracted using the predicted segmentation masks is shown to\nclosely match the temperature extracted using the ground truth masks. Finally,\nthe masks were used to identify hot and cool spots in the urban feature at\nvarious instances of time. This forms one of the very few studies demonstrating\nthe automated analysis of thermal images, which can be of potential use to\nurban planners for devising mitigation strategies for reducing the urban heat\nisland (UHI) effect, improving building energy efficiency, and maximizing\noutdoor thermal comfort.\n","authors":["Vasantha Ramani","Pandarasamy Arjunan","Kameshwar Poolla","Clayton Miller"],"pdf_url":"https://arxiv.org/pdf/2310.04247v2.pdf","comment":"14 pages, 13 figures"},{"id":"http://arxiv.org/abs/2311.16512v4","updated":"2023-12-21T03:03:40Z","published":"2023-11-27T16:33:29Z","title":"CoSeR: Bridging Image and Language for Cognitive Super-Resolution","summary":" Existing super-resolution (SR) models primarily focus on restoring local\ntexture details, often neglecting the global semantic information within the\nscene. This oversight can lead to the omission of crucial semantic details or\nthe introduction of inaccurate textures during the recovery process. In our\nwork, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering\nSR models with the capacity to comprehend low-resolution images. We achieve\nthis by marrying image appearance and language understanding to generate a\ncognitive embedding, which not only activates prior information from large\ntext-to-image diffusion models but also facilitates the generation of\nhigh-quality reference images to optimize the SR process. To further improve\nimage fidelity, we propose a novel condition injection scheme called\n\"All-in-Attention\", consolidating all conditional information into a single\nmodule. Consequently, our method successfully restores semantically correct and\nphotorealistic details, demonstrating state-of-the-art performance across\nmultiple benchmarks. Code: https://github.com/VINHYU/CoSeR\n","authors":["Haoze Sun","Wenbo Li","Jianzhuang Liu","Haoyu Chen","Renjing Pei","Xueyi Zou","Youliang Yan","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2311.16512v4.pdf","comment":"Project page: https://coser-main.github.io ; GitHub repository:\n https://github.com/VINHYU/CoSeR"},{"id":"http://arxiv.org/abs/2312.13537v1","updated":"2023-12-21T02:39:53Z","published":"2023-12-21T02:39:53Z","title":"HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in\n Image Editing via Hypernetworks","summary":" Editing real images authentically while also achieving cross-domain editing\nremains a challenge. Recent studies have focused on converting real images into\nlatent codes and accomplishing image editing by manipulating these codes.\nHowever, merely manipulating the latent codes would constrain the edited images\nto the generator's image domain, hindering the attainment of diverse editing\ngoals. In response, we propose an innovative image editing method called\nHyperEditor, which utilizes weight factors generated by hypernetworks to\nreassign the weights of the pre-trained StyleGAN2's generator. Guided by CLIP's\ncross-modal image-text semantic alignment, this innovative approach enables us\nto simultaneously accomplish authentic attribute editing and cross-domain style\ntransfer, a capability not realized in previous methods. Additionally, we\nascertain that modifying only the weights of specific layers in the generator\ncan yield an equivalent editing result. Therefore, we introduce an adaptive\nlayer selector, enabling our hypernetworks to autonomously identify the layers\nrequiring output weight factors, which can further improve our hypernetworks'\nefficiency. Extensive experiments on abundant challenging datasets demonstrate\nthe effectiveness of our method.\n","authors":["Hai Zhang","Chunwei Wu","Guitao Cao","Hailing Wang","Wenming Cao"],"pdf_url":"https://arxiv.org/pdf/2312.13537v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.12763v2","updated":"2023-12-21T02:39:11Z","published":"2023-12-20T04:49:45Z","title":"AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition\n and Fusion","summary":" Generating realistic human motion sequences from text descriptions is a\nchallenging task that requires capturing the rich expressiveness of both\nnatural language and human motion.Recent advances in diffusion models have\nenabled significant progress in human motion synthesis.However, existing\nmethods struggle to handle text inputs that describe complex or long motions.In\nthis paper, we propose the Adaptable Motion Diffusion (AMD) model, which\nleverages a Large Language Model (LLM) to parse the input text into a sequence\nof concise and interpretable anatomical scripts that correspond to the target\nmotion.This process exploits the LLM's ability to provide anatomical guidance\nfor complex motion synthesis.We then devise a two-branch fusion scheme that\nbalances the influence of the input text and the anatomical scripts on the\ninverse diffusion process, which adaptively ensures the semantic fidelity and\ndiversity of the synthesized motion.Our method can effectively handle texts\nwith complex or long motion descriptions, where existing methods often fail.\nExperiments on datasets with relatively more complex motions, such as CLCD1 and\nCLCD2, demonstrate that our AMD significantly outperforms existing\nstate-of-the-art models.\n","authors":["Beibei Jing","Youjia Zhang","Zikai Song","Junqing Yu","Wei Yang"],"pdf_url":"https://arxiv.org/pdf/2312.12763v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13534v1","updated":"2023-12-21T02:28:41Z","published":"2023-12-21T02:28:41Z","title":"SE(3)-Equivariant and Noise-Invariant 3D Motion Tracking in Medical\n Images","summary":" Rigid motion tracking is paramount in many medical imaging applications where\nmovements need to be detected, corrected, or accounted for. Modern strategies\nrely on convolutional neural networks (CNN) and pose this problem as rigid\nregistration. Yet, CNNs do not exploit natural symmetries in this task, as they\nare equivariant to translations (their outputs shift with their inputs) but not\nto rotations. Here we propose EquiTrack, the first method that uses recent\nsteerable SE(3)-equivariant CNNs (E-CNN) for motion tracking. While steerable\nE-CNNs can extract corresponding features across different poses, testing them\non noisy medical images reveals that they do not have enough learning capacity\nto learn noise invariance. Thus, we introduce a hybrid architecture that pairs\na denoiser with an E-CNN to decouple the processing of anatomically irrelevant\nintensity features from the extraction of equivariant spatial features. Rigid\ntransforms are then estimated in closed-form. EquiTrack outperforms\nstate-of-the-art learning and optimisation methods for motion tracking in adult\nbrain MRI and fetal MRI time series. Our code is available at\ngithub.com/BBillot/equitrack.\n","authors":["Benjamin Billot","Daniel Moyer","Neel Dey","Malte Hoffmann","Esra Abaci Turk","Borjan Gagoski","Ellen Grant","Polina Golland"],"pdf_url":"https://arxiv.org/pdf/2312.13534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15254v3","updated":"2023-12-21T02:07:58Z","published":"2023-07-28T01:40:04Z","title":"Multiple Instance Learning Framework with Masked Hard Instance Mining\n for Whole Slide Image Classification","summary":" The whole slide image (WSI) classification is often formulated as a multiple\ninstance learning (MIL) problem. Since the positive tissue is only a small\nfraction of the gigapixel WSI, existing MIL methods intuitively focus on\nidentifying salient instances via attention mechanisms. However, this leads to\na bias towards easy-to-classify instances while neglecting hard-to-classify\ninstances. Some literature has revealed that hard examples are beneficial for\nmodeling a discriminative boundary accurately. By applying such an idea at the\ninstance level, we elaborate a novel MIL framework with masked hard instance\nmining (MHIM-MIL), which uses a Siamese structure (Teacher-Student) with a\nconsistency constraint to explore the potential hard instances. With several\ninstance masking strategies based on attention scores, MHIM-MIL employs a\nmomentum teacher to implicitly mine hard instances for training the student\nmodel, which can be any attention-based MIL model. This counter-intuitive\nstrategy essentially enables the student to learn a better discriminating\nboundary. Moreover, the student is used to update the teacher with an\nexponential moving average (EMA), which in turn identifies new hard instances\nfor subsequent training iterations and stabilizes the optimization.\nExperimental results on the CAMELYON-16 and TCGA Lung Cancer datasets\ndemonstrate that MHIM-MIL outperforms other latest methods in terms of\nperformance and training cost. The code is available at:\nhttps://github.com/DearCaat/MHIM-MIL.\n","authors":["Wenhao Tang","Sheng Huang","Xiaoxian Zhang","Fengtao Zhou","Yi Zhang","Bo Liu"],"pdf_url":"https://arxiv.org/pdf/2307.15254v3.pdf","comment":"Published on ICCV2023"},{"id":"http://arxiv.org/abs/2312.13528v1","updated":"2023-12-21T02:01:19Z","published":"2023-12-21T02:01:19Z","title":"DyBluRF: Dynamic Deblurring Neural Radiance Fields for Blurry Monocular\n Video","summary":" Video view synthesis, allowing for the creation of visually appealing frames\nfrom arbitrary viewpoints and times, offers immersive viewing experiences.\nNeural radiance fields, particularly NeRF, initially developed for static\nscenes, have spurred the creation of various methods for video view synthesis.\nHowever, the challenge for video view synthesis arises from motion blur, a\nconsequence of object or camera movement during exposure, which hinders the\nprecise synthesis of sharp spatio-temporal views. In response, we propose a\nnovel dynamic deblurring NeRF framework for blurry monocular video, called\nDyBluRF, consisting of an Interleave Ray Refinement (IRR) stage and a Motion\nDecomposition-based Deblurring (MDD) stage. Our DyBluRF is the first that\naddresses and handles the novel view synthesis for blurry monocular video. The\nIRR stage jointly reconstructs dynamic 3D scenes and refines the inaccurate\ncamera pose information to combat imprecise pose information extracted from the\ngiven blurry frames. The MDD stage is a novel incremental latent sharp-rays\nprediction (ILSP) approach for the blurry monocular video frames by decomposing\nthe latent sharp rays into global camera motion and local object motion\ncomponents. Extensive experimental results demonstrate that our DyBluRF\noutperforms qualitatively and quantitatively the very recent state-of-the-art\nmethods. Our project page including source codes and pretrained model are\npublicly available at https://kaist-viclab.github.io/dyblurf-site/.\n","authors":["Minh-Quan Viet Bui","Jongmin Park","Jihyong Oh","Munchurl Kim"],"pdf_url":"https://arxiv.org/pdf/2312.13528v1.pdf","comment":"The first three authors contributed equally to this work. Please\n visit our project page at https://kaist-viclab.github.io/dyblurf-site/"},{"id":"http://arxiv.org/abs/2312.13514v1","updated":"2023-12-21T01:30:44Z","published":"2023-12-21T01:30:44Z","title":"Rethinking of Feature Interaction for Multi-task Learning on Dense\n Prediction","summary":" Existing works generally adopt the encoder-decoder structure for Multi-task\nDense Prediction, where the encoder extracts the task-generic features, and\nmultiple decoders generate task-specific features for predictions. We observe\nthat low-level representations with rich details and high-level representations\nwith abundant task information are not both involved in the multi-task\ninteraction process. Additionally, low-quality and low-efficiency issues also\nexist in current multi-task learning architectures. In this work, we propose to\nlearn a comprehensive intermediate feature globally from both task-generic and\ntask-specific features, we reveal an important fact that this intermediate\nfeature, namely the bridge feature, is a good solution to the above issues.\nBased on this, we propose a novel Bridge-Feature-Centirc Interaction (BRFI)\nmethod. A Bridge Feature Extractor (BFE) is designed for the generation of\nstrong bridge features and Task Pattern Propagation (TPP) is applied to ensure\nhigh-quality task interaction participants. Then a Task-Feature Refiner (TFR)\nis developed to refine final task predictions with the well-learned knowledge\nfrom the bridge features. Extensive experiments are conducted on NYUD-v2 and\nPASCAL Context benchmarks, and the superior performance shows the proposed\narchitecture is effective and powerful in promoting different dense prediction\ntasks simultaneously.\n","authors":["Jingdong Zhang","Jiayuan Fan","Peng Ye","Bo Zhang","Hancheng Ye","Baopu Li","Yancheng Cai","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13514v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13509v1","updated":"2023-12-21T01:09:52Z","published":"2023-12-21T01:09:52Z","title":"MR-STGN: Multi-Residual Spatio Temporal Graph Network Using Attention\n Fusion for Patient Action Assessment","summary":" Accurate assessment of patient actions plays a crucial role in healthcare as\nit contributes significantly to disease progression monitoring and treatment\neffectiveness. However, traditional approaches to assess patient actions often\nrely on manual observation and scoring, which are subjective and\ntime-consuming. In this paper, we propose an automated approach for patient\naction assessment using a Multi-Residual Spatio Temporal Graph Network\n(MR-STGN) that incorporates both angular and positional 3D skeletons. The\nMR-STGN is specifically designed to capture the spatio-temporal dynamics of\npatient actions. It achieves this by integrating information from multiple\nresidual layers, with each layer extracting features at distinct levels of\nabstraction. Furthermore, we integrate an attention fusion mechanism into the\nnetwork, which facilitates the adaptive weighting of various features. This\nempowers the model to concentrate on the most pertinent aspects of the\npatient's movements, offering precise instructions regarding specific body\nparts or movements that require attention. Ablation studies are conducted to\nanalyze the impact of individual components within the proposed model. We\nevaluate our model on the UI-PRMD dataset demonstrating its performance in\naccurately predicting real-time patient action scores, surpassing\nstate-of-the-art methods.\n","authors":["Youssef Mourchid","Rim Slama"],"pdf_url":"https://arxiv.org/pdf/2312.13509v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13506v1","updated":"2023-12-21T00:52:01Z","published":"2023-12-21T00:52:01Z","title":"SPDGAN: A Generative Adversarial Network based on SPD Manifold Learning\n for Automatic Image Colorization","summary":" This paper addresses the automatic colorization problem, which converts a\ngray-scale image to a colorized one. Recent deep-learning approaches can\ncolorize automatically grayscale images. However, when it comes to different\nscenes which contain distinct color styles, it is difficult to accurately\ncapture the color characteristics. In this work, we propose a fully automatic\ncolorization approach based on Symmetric Positive Definite (SPD) Manifold\nLearning with a generative adversarial network (SPDGAN) that improves the\nquality of the colorization results. Our SPDGAN model establishes an\nadversarial game between two discriminators and a generator. The latter is\nbased on ResNet architecture with few alterations. Its goal is to generate fake\ncolorized images without losing color information across layers through\nresidual connections. Then, we employ two discriminators from different\ndomains. The first one is devoted to the image pixel domain, while the second\none is to the Riemann manifold domain which helps to avoid color misalignment.\nExtensive experiments are conducted on the Places365 and COCO-stuff databases\nto test the effect of each component of our SPDGAN. In addition, quantitative\nand qualitative comparisons with state-of-the-art methods demonstrate the\neffectiveness of our model by achieving more realistic colorized images with\nless artifacts visually, and good results of PSNR, SSIM, and FID values.\n","authors":["Youssef Mourchid","Marc Donias","Yannick Berthoumieu","Mohamed Najim"],"pdf_url":"https://arxiv.org/pdf/2312.13506v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13503v1","updated":"2023-12-21T00:44:45Z","published":"2023-12-21T00:44:45Z","title":"InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large\n Multimodal and Language Models","summary":" In this paper, we build a visual dialogue dataset, named InfoVisDial, which\nprovides rich informative answers in each round even with external knowledge\nrelated to the visual content. Different from existing datasets where the\nanswer is compact and short, InfoVisDial contains long free-form answers with\nrich information in each round of dialogue. For effective data collection, the\nkey idea is to bridge the large-scale multimodal model (e.g., GIT) and the\nlanguage models (e.g., GPT-3). GIT can describe the image content even with\nscene text, while GPT-3 can generate informative dialogue based on the image\ndescription and appropriate prompting techniques. With such automatic pipeline,\nwe can readily generate informative visual dialogue data at scale. Then, we ask\nhuman annotators to rate the generated dialogues to filter the low-quality\nconversations.Human analyses show that InfoVisDial covers informative and\ndiverse dialogue topics: $54.4\\%$ of the dialogue rounds are related to image\nscene texts, and $36.7\\%$ require external knowledge. Each round's answer is\nalso long and open-ended: $87.3\\%$ of answers are unique with an average length\nof $8.9$, compared with $27.37\\%$ and $2.9$ in VisDial. Last, we propose a\nstrong baseline by adapting the GIT model for the visual dialogue task and\nfine-tune the model on InfoVisDial. Hopefully, our work can motivate more\neffort on this direction.\n","authors":["Bingbing Wen","Zhengyuan Yang","Jianfeng Wang","Zhe Gan","Bill Howe","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13503v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13500v1","updated":"2023-12-21T00:31:54Z","published":"2023-12-21T00:31:54Z","title":"Federated Continual Novel Class Learning","summary":" In a privacy-focused era, Federated Learning (FL) has emerged as a promising\nmachine learning technique. However, most existing FL studies assume that the\ndata distribution remains nearly fixed over time, while real-world scenarios\noften involve dynamic and continual changes. To equip FL systems with continual\nmodel evolution capabilities, we focus on an important problem called Federated\nContinual Novel Class Learning (FedCN) in this work. The biggest challenge in\nFedCN is to merge and align novel classes that are discovered and learned by\ndifferent clients without compromising privacy. To address this, we propose a\nGlobal Alignment Learning (GAL) framework that can accurately estimate the\nglobal novel class number and provide effective guidance for local training\nfrom a global perspective, all while maintaining privacy protection.\nSpecifically, GAL first locates high-density regions in the representation\nspace through a bi-level clustering mechanism to estimate the novel class\nnumber, with which the global prototypes corresponding to novel classes can be\nconstructed. Then, GAL uses a novel semantic weighted loss to capture all\npossible correlations between these prototypes and the training data for\nmitigating the impact of pseudo-label noise and data heterogeneity. Extensive\nexperiments on various datasets demonstrate GAL's superior performance over\nstate-of-the-art novel class discovery methods. In particular, GAL achieves\nsignificant improvements in novel-class performance, increasing the accuracy by\n5.1% to 10.6% in the case of one novel class learning stage and by 7.8% to\n17.9% in the case of two novel class learning stages, without sacrificing\nknown-class performance. Moreover, GAL is shown to be effective in equipping a\nvariety of different mainstream FL algorithms with novel class discovery and\nlearning capability, highlighting its potential for many real-world\napplications.\n","authors":["Lixu Wang","Chenxi Liu","Junfeng Guo","Jiahua Dong","Xiao Wang","Heng Huang","Qi Zhu"],"pdf_url":"https://arxiv.org/pdf/2312.13500v1.pdf","comment":"23 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.12337v2","updated":"2023-12-21T00:26:03Z","published":"2023-12-19T17:03:50Z","title":"pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable\n Generalizable 3D Reconstruction","summary":" We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D\nradiance fields parameterized by 3D Gaussian primitives from pairs of images.\nOur model features real-time and memory-efficient rendering for scalable\ntraining as well as fast 3D reconstruction at inference time. To overcome local\nminima inherent to sparse and locally supported representations, we predict a\ndense probability distribution over 3D and sample Gaussian means from that\nprobability distribution. We make this sampling operation differentiable via a\nreparameterization trick, allowing us to back-propagate gradients through the\nGaussian splatting representation. We benchmark our method on wide-baseline\nnovel view synthesis on the real-world RealEstate10k and ACID datasets, where\nwe outperform state-of-the-art light field transformers and accelerate\nrendering by 2.5 orders of magnitude while reconstructing an interpretable and\neditable 3D radiance field.\n","authors":["David Charatan","Sizhe Li","Andrea Tagliasacchi","Vincent Sitzmann"],"pdf_url":"https://arxiv.org/pdf/2312.12337v2.pdf","comment":"Project page: https://dcharatan.github.io/pixelsplat"},{"id":"http://arxiv.org/abs/2312.13494v1","updated":"2023-12-21T00:14:46Z","published":"2023-12-21T00:14:46Z","title":"Visual Tomography: Physically Faithful Volumetric Models of Partially\n Translucent Objects","summary":" When created faithfully from real-world data, Digital 3D representations of\nobjects can be useful for human or computer-assisted analysis. Such models can\nalso serve for generating training data for machine learning approaches in\nsettings where data is difficult to obtain or where too few training data\nexists, e.g. by providing novel views or images in varying conditions. While\nthe vast amount of visual 3D reconstruction approaches focus on non-physical\nmodels, textured object surfaces or shapes, in this contribution we propose a\nvolumetric reconstruction approach that obtains a physical model including the\ninterior of partially translucent objects such as plankton or insects. Our\ntechnique photographs the object under different poses in front of a bright\nwhite light source and computes absorption and scattering per voxel. It can be\ninterpreted as visual tomography that we solve by inverse raytracing. We\nadditionally suggest a method to convert non-physical NeRF media into a\nphysically-based volumetric grid for initialization and illustrate the\nusefulness of the approach using two real-world plankton validation sets, the\nlab-scanned models being finally also relighted and virtually submerged in a\nscenario with augmented medium and illumination conditions. Please visit the\nproject homepage at www.marine.informatik.uni-kiel.de/go/vito\n","authors":["David Nakath","Xiangyu Weng","Mengkun She","Kevin Köser"],"pdf_url":"https://arxiv.org/pdf/2312.13494v1.pdf","comment":"Accepted for publication at 3DV '24"},{"id":"http://arxiv.org/abs/2312.06914v3","updated":"2023-12-21T23:32:07Z","published":"2023-12-12T00:54:39Z","title":"Exploring Novel Object Recognition and Spontaneous Location Recognition\n Machine Learning Analysis Techniques in Alzheimer's Mice","summary":" Understanding object recognition patterns in mice is crucial for advancing\nbehavioral neuroscience and has significant implications for human health,\nparticularly in the realm of Alzheimer's research. This study is centered on\nthe development, application, and evaluation of a state-of-the-art\ncomputational pipeline designed to analyze such behaviors, specifically\nfocusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition\n(SLR) tasks. The pipeline integrates three advanced computational models:\nAny-Maze for initial data collection, DeepLabCut for detailed pose estimation,\nand Convolutional Neural Networks (CNNs) for nuanced behavioral classification.\nEmployed across four distinct mouse groups, this pipeline demonstrated high\nlevels of accuracy and robustness. Despite certain challenges like video\nquality limitations and the need for manual calculations, the results affirm\nthe pipeline's efficacy and potential for scalability. The study serves as a\nproof of concept for a multidimensional computational approach to behavioral\nneuroscience, emphasizing the pipeline's versatility and readiness for future,\nmore complex analyses.\n","authors":["Soham Bafana"],"pdf_url":"https://arxiv.org/pdf/2312.06914v3.pdf","comment":"Aspects of the paper contain errors, and data in the pipeline must be\n vetted one more time. More testing is necessary"},{"id":"http://arxiv.org/abs/2306.05745v2","updated":"2023-12-21T21:28:52Z","published":"2023-06-09T08:22:41Z","title":"Two Independent Teachers are Better Role Model","summary":" Recent deep learning models have attracted substantial attention in infant\nbrain analysis. These models have performed state-of-the-art performance, such\nas semi-supervised techniques (e.g., Temporal Ensembling, mean teacher).\nHowever, these models depend on an encoder-decoder structure with stacked local\noperators to gather long-range information, and the local operators limit the\nefficiency and effectiveness. Besides, the $MRI$ data contain different tissue\nproperties ($TPs$) such as $T1$ and $T2$. One major limitation of these models\nis that they use both data as inputs to the segment process, i.e., the models\nare trained on the dataset once, and it requires much computational and memory\nrequirements during inference. In this work, we address the above limitations\nby designing a new deep-learning model, called 3D-DenseUNet, which works as\nadaptable global aggregation blocks in down-sampling to solve the issue of\nspatial information loss. The self-attention module connects the down-sampling\nblocks to up-sampling blocks, and integrates the feature maps in three\ndimensions of spatial and channel, effectively improving the representation\npotential and discriminating ability of the model. Additionally, we propose a\nnew method called Two Independent Teachers ($2IT$), that summarizes the model\nweights instead of label predictions. Each teacher model is trained on\ndifferent types of brain data, $T1$ and $T2$, respectively. Then, a fuse model\nis added to improve test accuracy and enable training with fewer parameters and\nlabels compared to the Temporal Ensembling method without modifying the network\narchitecture. Empirical results demonstrate the effectiveness of the proposed\nmethod. The code is available at\nhttps://github.com/AfifaKhaled/Two-Independent-Teachers-are-Better-Role-Model.\n","authors":["Afifa Khaled","Ahmed A. Mubarak","Kun He"],"pdf_url":"https://arxiv.org/pdf/2306.05745v2.pdf","comment":"This manuscript contains 14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.14301v1","updated":"2023-12-21T21:18:53Z","published":"2023-12-21T21:18:53Z","title":"Autoencoder Based Face Verification System","summary":" The primary objective of this work is to present an alternative approach\naimed at reducing the dependency on labeled data. Our proposed method involves\nutilizing autoencoder pre-training within a face image recognition task with\ntwo step processes. Initially, an autoencoder is trained in an unsupervised\nmanner using a substantial amount of unlabeled training dataset. Subsequently,\na deep learning model is trained with initialized parameters from the\npre-trained autoencoder. This deep learning training process is conducted in a\nsupervised manner, employing relatively limited labeled training dataset.\nDuring evaluation phase, face image embeddings is generated as the output of\ndeep neural network layer. Our training is executed on the CelebA dataset,\nwhile evaluation is performed using benchmark face recognition datasets such as\nLabeled Faces in the Wild (LFW) and YouTube Faces (YTF). Experimental results\ndemonstrate that by initializing the deep neural network with pre-trained\nautoencoder parameters achieve comparable results to state-of-the-art methods.\n","authors":["Enoch Solomon","Abraham Woubie","Eyael Solomon Emiru"],"pdf_url":"https://arxiv.org/pdf/2312.14301v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.14291v2","updated":"2023-12-21T20:54:54Z","published":"2023-04-27T15:51:19Z","title":"EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation","summary":" With autonomous industries on the rise, domain adaptation of the visual\nperception stack is an important research direction due to the cost savings\npromise. Much prior art was dedicated to domain-adaptive semantic segmentation\nin the synthetic-to-real context. Despite being a crucial output of the\nperception stack, panoptic segmentation has been largely overlooked by the\ndomain adaptation community. Therefore, we revisit well-performing domain\nadaptation strategies from other fields, adapt them to panoptic segmentation,\nand show that they can effectively enhance panoptic domain adaptation. Further,\nwe study the panoptic network design and propose a novel architecture (EDAPS)\ndesigned explicitly for domain-adaptive panoptic segmentation. It uses a\nshared, domain-robust transformer encoder to facilitate the joint adaptation of\nsemantic and instance features, but task-specific decoders tailored for the\nspecific requirements of both domain-adaptive semantic and instance\nsegmentation. As a result, the performance gap seen in challenging panoptic\nbenchmarks is substantially narrowed. EDAPS significantly improves the\nstate-of-the-art performance for panoptic segmentation UDA by a large margin of\n20% on SYNTHIA-to-Cityscapes and even 72% on the more challenging\nSYNTHIA-to-Mapillary Vistas. The implementation is available at\nhttps://github.com/susaha/edaps.\n","authors":["Suman Saha","Lukas Hoyer","Anton Obukhov","Dengxin Dai","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2304.14291v2.pdf","comment":"ICCV 2023"},{"id":"http://arxiv.org/abs/2312.14280v1","updated":"2023-12-21T20:25:16Z","published":"2023-12-21T20:25:16Z","title":"Fine-grained Forecasting Models Via Gaussian Process Blurring Effect","summary":" Time series forecasting is a challenging task due to the existence of complex\nand dynamic temporal dependencies. This can lead to incorrect predictions by\neven the best forecasting models. Using more training data is one way to\nimprove the accuracy, but this source is often limited. In contrast, we are\nbuilding on successful denoising approaches for image generation by advocating\nfor an end-to-end forecasting and denoising paradigm.\n We propose an end-to-end forecast-blur-denoise forecasting framework by\nencouraging a division of labors between the forecasting and the denoising\nmodels. The initial forecasting model is directed to focus on accurately\npredicting the coarse-grained behavior, while the denoiser model focuses on\ncapturing the fine-grained behavior that is locally blurred by integrating a\nGaussian Process model. All three parts are interacting for the best end-to-end\nperformance. Our extensive experiments demonstrate that our proposed approach\nis able to improve the forecasting accuracy of several state-of-the-art\nforecasting models as well as several other denoising approaches.\n","authors":["Sepideh Koohfar","Laura Dietz"],"pdf_url":"https://arxiv.org/pdf/2312.14280v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2312.14239v1","updated":"2023-12-21T18:59:53Z","published":"2023-12-21T18:59:53Z","title":"PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce\n Lidar","summary":" 3D reconstruction from a single-view is challenging because of the ambiguity\nfrom monocular cues and lack of information about occluded regions. Neural\nradiance fields (NeRF), while popular for view synthesis and 3D reconstruction,\nare typically reliant on multi-view images. Existing methods for single-view 3D\nreconstruction with NeRF rely on either data priors to hallucinate views of\noccluded regions, which may not be physically accurate, or shadows observed by\nRGB cameras, which are difficult to detect in ambient light and low albedo\nbackgrounds. We propose using time-of-flight data captured by a single-photon\navalanche diode to overcome these limitations. Our method models two-bounce\noptical paths with NeRF, using lidar transient data for supervision. By\nleveraging the advantages of both NeRF and two-bounce light measured by lidar,\nwe demonstrate that we can reconstruct visible and occluded geometry without\ndata priors or reliance on controlled ambient lighting or scene albedo. In\naddition, we demonstrate improved generalization under practical constraints on\nsensor spatial- and temporal-resolution. We believe our method is a promising\ndirection as single-photon lidars become ubiquitous on consumer devices, such\nas phones, tablets, and headsets.\n","authors":["Tzofi Klinghoffer","Xiaoyu Xiang","Siddharth Somasundaram","Yuchen Fan","Christian Richardt","Ramesh Raskar","Rakesh Ranjan"],"pdf_url":"https://arxiv.org/pdf/2312.14239v1.pdf","comment":"Project Page: https://platonerf.github.io/"},{"id":"http://arxiv.org/abs/2312.14238v1","updated":"2023-12-21T18:59:31Z","published":"2023-12-21T18:59:31Z","title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic\n Visual-Linguistic Tasks","summary":" The exponential growth of large language models (LLMs) has opened up numerous\npossibilities for multi-modal AGI systems. However, the progress in vision and\nvision-language foundation models, which are also critical elements of\nmulti-modal AGI, has not kept pace with LLMs. In this work, we design a\nlarge-scale vision-language foundation model (InternVL), which scales up the\nvision foundation model to 6 billion parameters and progressively aligns it\nwith the large language model, using web-scale image-text data from various\nsources. This model can be broadly applied to and achieve state-of-the-art\nperformance on visual perception tasks such as image-level or pixel-level\nrecognition, vision-language tasks such as zero-shot image/video\nclassification, zero-shot image/video-text retrieval, and link with LLMs to\ncreate multi-modal dialogue systems. We hope that our research could contribute\nto the development of multi-modal large models. Code and models are available\nat https://github.com/OpenGVLab/InternVL.\n","authors":["Zhe Chen","Jiannan Wu","Wenhai Wang","Weijie Su","Guo Chen","Sen Xing","Zhong Muyan","Qinglong Zhang","Xizhou Zhu","Lewei Lu","Bin Li","Ping Luo","Tong Lu","Yu Qiao","Jifeng Dai"],"pdf_url":"https://arxiv.org/pdf/2312.14238v1.pdf","comment":"25 pages, 5 figures, 28 tables"},{"id":"http://arxiv.org/abs/2312.14235v1","updated":"2023-12-21T18:54:19Z","published":"2023-12-21T18:54:19Z","title":"Neural Spline Fields for Burst Image Fusion and Layer Separation","summary":" Each photo in an image burst can be considered a sample of a complex 3D\nscene: the product of parallax, diffuse and specular materials, scene motion,\nand illuminant variation. While decomposing all of these effects from a stack\nof misaligned images is a highly ill-conditioned task, the conventional\nalign-and-merge burst pipeline takes the other extreme: blending them into a\nsingle image. In this work, we propose a versatile intermediate representation:\na two-layer alpha-composited image plus flow model constructed with neural\nspline fields -- networks trained to map input coordinates to spline control\npoints. Our method is able to, during test-time optimization, jointly fuse a\nburst image capture into one high-resolution reconstruction and decompose it\ninto transmission and obstruction layers. Then, by discarding the obstruction\nlayer, we can perform a range of tasks including seeing through occlusions,\nreflection suppression, and shadow removal. Validated on complex synthetic and\nin-the-wild captures we find that, with no post-processing steps or learned\npriors, our generalizable model is able to outperform existing dedicated\nsingle-image and multi-view obstruction removal approaches.\n","authors":["Ilya Chugunov","David Shustin","Ruyu Yan","Chenyang Lei","Felix Heide"],"pdf_url":"https://arxiv.org/pdf/2312.14235v1.pdf","comment":"project website: https://light.princeton.edu/publication/nsf"},{"id":"http://arxiv.org/abs/2312.14233v1","updated":"2023-12-21T18:49:47Z","published":"2023-12-21T18:49:47Z","title":"VCoder: Versatile Vision Encoders for Multimodal Large Language Models","summary":" Humans possess the remarkable skill of Visual Perception, the ability to see\nand understand the seen, helping them make sense of the visual world and, in\nturn, reason. Multimodal Large Language Models (MLLM) have recently achieved\nimpressive performance on vision-language tasks ranging from visual\nquestion-answering and image captioning to visual reasoning and image\ngeneration. However, when prompted to identify or count (perceive) the entities\nin a given image, existing MLLM systems fail. Working towards developing an\naccurate MLLM system for perception and reasoning, we propose using Versatile\nvision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the\nVCoder with perception modalities such as segmentation or depth maps, improving\nthe MLLM's perception abilities. Secondly, we leverage the images from COCO and\noutputs from off-the-shelf vision perception models to create our COCO\nSegmentation Text (COST) dataset for training and evaluating MLLMs on the\nobject perception task. Thirdly, we introduce metrics to assess the object\nperception abilities in MLLMs on our COST dataset. Lastly, we provide extensive\nexperimental evidence proving the VCoder's improved object-level perception\nskills over existing Multimodal LLMs, including GPT-4V. We open-source our\ndataset, code, and models to promote research. We open-source our code at\nhttps://github.com/SHI-Labs/VCoder\n","authors":["Jitesh Jain","Jianwei Yang","Humphrey Shi"],"pdf_url":"https://arxiv.org/pdf/2312.14233v1.pdf","comment":"Project Page: https://praeclarumjj3.github.io/vcoder/"},{"id":"http://arxiv.org/abs/2312.14232v1","updated":"2023-12-21T18:46:46Z","published":"2023-12-21T18:46:46Z","title":"Parrot Captions Teach CLIP to Spot Text","summary":" Despite CLIP being the foundation model in numerous vision-language\napplications, the CLIP suffers from a severe text spotting bias. Such bias\ncauses CLIP models to `Parrot' the visual text embedded within images while\ndisregarding the authentic visual semantics. We uncover that in the most\npopular image-text dataset LAION-2B, the captions also densely parrot (spell)\nthe text embedded in images. Our analysis shows that around \\textbf{50\\%} of\nimages are embedded with visual text content, and \\textbf{90\\%} of their\ncaptions more or less parrot the visual text. Based on such observation, we\nthoroughly inspect the different release d versions of CLIP models and verify\nthat the visual text is the dominant factor in measuring the LAION-style\nimage-text similarity for these models. To examine whether these parrot\ncaptions shape the text spotting bias, we train a series of CLIP models with\nLAION subsets curated by different parrot-caption-oriented criteria. We show\nthat training with parrot captions easily shapes such bias but harms the\nexpected visual-language representation learning in CLIP models. This suggests\nthat it is urgent to revisit either the design of CLIP-like models or the\nexisting image-text dataset curation pipeline built on CLIP score filtering.\n","authors":["Yiqi Lin","Conghui He","Alex Jinpeng Wang","Bin Wang","Weijia Li","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2312.14232v1.pdf","comment":"project page: https://linyq17.github.io/CLIP-Parrot-Bias/"},{"id":"http://arxiv.org/abs/2311.13613v2","updated":"2023-12-21T15:35:11Z","published":"2023-11-22T03:45:30Z","title":"Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for\n Enhanced Dataset Pruning","summary":" Dataset pruning aims to construct a coreset capable of achieving performance\ncomparable to the original, full dataset. Most existing dataset pruning methods\nrely on snapshot-based criteria to identify representative samples, often\nresulting in poor generalization across various pruning and cross-architecture\nscenarios. Recent studies have addressed this issue by expanding the scope of\ntraining dynamics considered, including factors such as forgetting event and\nprobability change, typically using an averaging approach. However, these works\nstruggle to integrate a broader range of training dynamics without overlooking\nwell-generalized samples, which may not be sufficiently highlighted in an\naveraging manner. In this study, we propose a novel dataset pruning method\ntermed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS\nutilizes a dual-depth strategy to achieve a balance between incorporating\nextensive training dynamics and identifying representative samples for dataset\npruning. In the first depth, we estimate the series of each sample's individual\ncontributions spanning the training progress, ensuring comprehensive\nintegration of training dynamics. In the second depth, we focus on the\nvariability of the sample-wise contributions identified in the first depth to\nhighlight well-generalized samples. Extensive experiments conducted on CIFAR\nand ImageNet datasets verify the superiority of TDDS over previous SOTA\nmethods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with\nonly 10% training data, surpassing random selection by 7.83% and other\ncomparison methods by at least 12.69%.\n","authors":["Xin Zhang","Jiawei Du","Yunsong Li","Weiying Xie","Joey Tianyi Zhou"],"pdf_url":"https://arxiv.org/pdf/2311.13613v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14223v1","updated":"2023-12-21T15:05:12Z","published":"2023-12-21T15:05:12Z","title":"Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation","summary":" Shortcut learning is when a model -- e.g. a cardiac disease classifier --\nexploits correlations between the target label and a spurious shortcut feature,\ne.g. a pacemaker, to predict the target label based on the shortcut rather than\nreal discriminative features. This is common in medical imaging, where\ntreatment and clinical annotations correlate with disease labels, making them\neasy shortcuts to predict disease. We propose a novel detection and\nquantification of the impact of potential shortcut features via a fast\ndiffusion-based counterfactual image generation that can synthetically remove\nor add shortcuts. Via a novel inpainting-based modification we spatially limit\nthe changes made with no extra inference step, encouraging the removal of\nspatially constrained shortcut features while ensuring that the shortcut-free\ncounterfactuals preserve their remaining image features to a high degree. Using\nthese, we assess how shortcut features influence model predictions.\n This is enabled by our second contribution: An efficient diffusion-based\ncounterfactual explanation method with significant inference speed-up at\ncomparable image quality as state-of-the-art. We confirm this on two large\nchest X-ray datasets, a skin lesion dataset, and CelebA.\n","authors":["Nina Weng","Paraskevas Pegios","Aasa Feragen","Eike Petersen","Siavash Bigdeli"],"pdf_url":"https://arxiv.org/pdf/2312.14223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06709v2","updated":"2023-12-21T13:35:49Z","published":"2023-12-10T17:07:29Z","title":"AM-RADIO: Agglomerative Model -- Reduce All Domains Into One","summary":" A handful of visual foundation models (VFMs) have recently emerged as the\nbackbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are\ntrained with distinct objectives, exhibiting unique characteristics for various\ndownstream tasks. We find that despite their conceptual differences, these\nmodels can be effectively merged into a unified model through multi-teacher\ndistillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All\nDomains Into One). This integrative approach not only surpasses the performance\nof individual teacher models but also amalgamates their distinctive features,\nsuch as zero-shot vision-language comprehension, detailed pixel-level\nunderstanding, and open vocabulary segmentation capabilities. In pursuit of the\nmost hardware-efficient backbone, we evaluated numerous architectures in our\nmulti-teacher distillation pipeline using the same training recipe. This led to\nthe development of a novel architecture (E-RADIO) that exceeds the performance\nof its predecessors and is at least 7x faster than the teacher models. Our\ncomprehensive benchmarking process covers downstream tasks including ImageNet\nclassification, ADE20k semantic segmentation, COCO object detection and\nLLaVa-1.5 framework.\n Code: https://github.com/NVlabs/RADIO\n","authors":["Mike Ranzinger","Greg Heinrich","Jan Kautz","Pavlo Molchanov"],"pdf_url":"https://arxiv.org/pdf/2312.06709v2.pdf","comment":"Version 2: Added more acknowledgements and updated table 7 with more\n recent results. Ensured that the link in the abstract to our code is working\n properly"},{"id":"http://arxiv.org/abs/2303.02370v3","updated":"2023-12-21T13:03:05Z","published":"2023-03-04T10:14:47Z","title":"Self-Supervised Learning for Place Representation Generalization across\n Appearance Changes","summary":" Visual place recognition is a key to unlocking spatial navigation for\nanimals, humans and robots. While state-of-the-art approaches are trained in a\nsupervised manner and therefore hardly capture the information needed for\ngeneralizing to unusual conditions, we argue that self-supervised learning may\nhelp abstracting the place representation so that it can be foreseen,\nirrespective of the conditions. More precisely, in this paper, we investigate\nlearning features that are robust to appearance modifications while sensitive\nto geometric transformations in a self-supervised manner. This dual-purpose\ntraining is made possible by combining the two self-supervision main paradigms,\n\\textit{i.e.} contrastive and predictive learning. Our results on standard\nbenchmarks reveal that jointly learning such appearance-robust and\ngeometry-sensitive image descriptors leads to competitive visual place\nrecognition results across adverse seasonal and illumination conditions,\nwithout requiring any human-annotated labels.\n","authors":["Mohamed Adel Musallam","Vincent Gaudillière","Djamila Aouada"],"pdf_url":"https://arxiv.org/pdf/2303.02370v3.pdf","comment":"11 pages, 6 figures, WACV 2024"},{"id":"http://arxiv.org/abs/2312.14218v1","updated":"2023-12-21T12:49:36Z","published":"2023-12-21T12:49:36Z","title":"AutoAugment Input Transformation for Highly Transferable Targeted\n Attacks","summary":" Deep Neural Networks (DNNs) are widely acknowledged to be susceptible to\nadversarial examples, wherein imperceptible perturbations are added to clean\nexamples through diverse input transformation attacks. However, these methods\noriginally designed for non-targeted attacks exhibit low success rates in\ntargeted attacks. Recent targeted adversarial attacks mainly pay attention to\ngradient optimization, attempting to find the suitable perturbation direction.\nHowever, few of them are dedicated to input transformation.In this work, we\nobserve a positive correlation between the logit/probability of the target\nclass and diverse input transformation methods in targeted attacks. To this\nend, we propose a novel targeted adversarial attack called AutoAugment Input\nTransformation (AAIT). Instead of relying on hand-made strategies, AAIT\nsearches for the optimal transformation policy from a transformation space\ncomprising various operations. Then, AAIT crafts adversarial examples using the\nfound optimal transformation policy to boost the adversarial transferability in\ntargeted attacks. Extensive experiments conducted on CIFAR-10 and\nImageNet-Compatible datasets demonstrate that the proposed AAIT surpasses other\ntransfer-based targeted attacks significantly.\n","authors":["Haobo Lu","Xin Liu","Kun He"],"pdf_url":"https://arxiv.org/pdf/2312.14218v1.pdf","comment":"10 pages, 6 figures"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2210.10619v2","updated":"2023-12-21T17:18:44Z","published":"2022-10-05T13:48:19Z","title":"Restricted Bernoulli Matrix Factorization: Balancing the trade-off\n between prediction accuracy and coverage in classification based\n collaborative filtering","summary":" Reliability measures associated with the prediction of the machine learning\nmodels are critical to strengthening user confidence in artificial\nintelligence. Therefore, those models that are able to provide not only\npredictions, but also reliability, enjoy greater popularity. In the field of\nrecommender systems, reliability is crucial, since users tend to prefer those\nrecommendations that are sure to interest them, that is, high predictions with\nhigh reliabilities. In this paper, we propose Restricted Bernoulli Matrix\nFactorization (ResBeMF), a new algorithm aimed at enhancing the performance of\nclassification-based collaborative filtering. The proposed model has been\ncompared to other existing solutions in the literature in terms of prediction\nquality (Mean Absolute Error and accuracy scores), prediction quantity\n(coverage score) and recommendation quality (Mean Average Precision score). The\nexperimental results demonstrate that the proposed model provides a good\nbalance in terms of the quality measures used compared to other recommendation\nmodels.\n","authors":["Ángel González-Prieto","Abraham Gutiérrez","Fernando Ortega","Raúl Lara-Cabrera"],"pdf_url":"https://arxiv.org/pdf/2210.10619v2.pdf","comment":"Several changes performed, including a title change. 21 pages, 7\n figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.14037v1","updated":"2023-12-21T17:03:26Z","published":"2023-12-21T17:03:26Z","title":"Neural Contextual Bandits for Personalized Recommendation","summary":" In the dynamic landscape of online businesses, recommender systems are\npivotal in enhancing user experiences. While traditional approaches have relied\non static supervised learning, the quest for adaptive, user-centric\nrecommendations has led to the emergence of the formulation of contextual\nbandits. This tutorial investigates the contextual bandits as a powerful\nframework for personalized recommendations. We delve into the challenges,\nadvanced algorithms and theories, collaborative strategies, and open challenges\nand future prospects within this field. Different from existing related\ntutorials, (1) we focus on the exploration perspective of contextual bandits to\nalleviate the ``Matthew Effect'' in the recommender systems, i.e., the rich get\nricher and the poor get poorer, concerning the popularity of items; (2) in\naddition to the conventional linear contextual bandits, we will also dedicated\nto neural contextual bandits which have emerged as an important branch in\nrecent years, to investigate how neural networks benefit contextual bandits for\npersonalized recommendation both empirically and theoretically; (3) we will\ncover the latest topic, collaborative neural contextual bandits, to incorporate\nboth user heterogeneity and user correlations customized for recommender\nsystem; (4) we will provide and discuss the new emerging challenges and open\nquestions for neural contextual bandits with applications in the personalized\nrecommendation, especially for large neural models.\n","authors":["Yikun Ban","Yunzhe Qi","Jingrui He"],"pdf_url":"https://arxiv.org/pdf/2312.14037v1.pdf","comment":"WWW'24 Tutorial"},{"id":"http://arxiv.org/abs/2312.10623v2","updated":"2023-12-21T16:38:23Z","published":"2023-12-17T06:39:10Z","title":"A Survey on Query-based API Recommendation","summary":" Application Programming Interfaces (APIs) are designed to help developers\nbuild software more effectively. Recommending the right APIs for specific tasks\nhas gained increasing attention among researchers and developers in recent\nyears. To comprehensively understand this research domain, we have surveyed to\nanalyze API recommendation studies published in the last 10 years. Our study\nbegins with an overview of the structure of API recommendation tools.\nSubsequently, we systematically analyze prior research and pose four key\nresearch questions. For RQ1, we examine the volume of published papers and the\nvenues in which these papers appear within the API recommendation field. In\nRQ2, we categorize and summarize the prevalent data sources and collection\nmethods employed in API recommendation research. In RQ3, we explore the types\nof data and common data representations utilized by API recommendation\napproaches. We also investigate the typical data extraction procedures and\ncollection approaches employed by the existing approaches. RQ4 delves into the\nmodeling techniques employed by API recommendation approaches, encompassing\nboth statistical and deep learning models. Additionally, we compile an overview\nof the prevalent ranking strategies and evaluation metrics used for assessing\nAPI recommendation tools. Drawing from our survey findings, we identify current\nchallenges in API recommendation research that warrant further exploration,\nalong with potential avenues for future research.\n","authors":["Moshi Wei","Nima Shiri Harzevili","Alvine Boaye Belle","Junjie Wang","Lin Shi","Jinqiu Yang","Song Wang","Ming Zhen"," Jiang"],"pdf_url":"https://arxiv.org/pdf/2312.10623v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.01304v2","updated":"2023-12-21T12:33:16Z","published":"2023-11-02T15:18:00Z","title":"VM-Rec: A Variational Mapping Approach for Cold-start User\n Recommendation","summary":" The cold-start problem is a common challenge for most recommender systems.\nThe practical application of most cold-start methods is hindered by the\ndeficiency in auxiliary content information for users. Moreover, most methods\nnecessitate simultaneous updates to the extensive parameters of recommender\nmodels, leading to significant training costs, particularly in large-scale\nindustrial scenarios. We observe that the model can generate expressive\nembeddings for warm users with relatively more interactions. Initially, these\nusers were cold-start users, and after transitioning to warm users, they\nexhibit clustering patterns in their embeddings with consistent initial\ninteractions. Based on this motivation, we propose a Variational Mapping\napproach for cold-start user Recommendation (VM-Rec), mapping from few initial\ninteractions to expressive embeddings for cold-start users. Specifically, we\nencode the initial interactions into a latent representation, where each\ndimension disentangledly signifies the degree of association with each warm\nuser. Subsequently, we utilize this latent representation as the parameters for\nthe mapping function, mapping (decoding) it into an expressive embedding, which\ncan be integrated into a pre-trained recommender model directly. Our method is\nevaluated on three datasets using the same base model, demonstrating superior\nperformance compared to other popular cold-start methods.\n","authors":["Linan Zheng","Jiale Chen","Pengsheng Liu","Guangfa Zhang","Jinyun Fang"],"pdf_url":"https://arxiv.org/pdf/2311.01304v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.02047v2","updated":"2023-12-21T11:30:54Z","published":"2022-05-04T13:13:52Z","title":"Hyperbolic Relevance Matching for Neural Keyphrase Extraction","summary":" Keyphrase extraction is a fundamental task in natural language processing and\ninformation retrieval that aims to extract a set of phrases with important\ninformation from a source document. Identifying important keyphrase is the\ncentral component of the keyphrase extraction task, and its main challenge is\nhow to represent information comprehensively and discriminate importance\naccurately. In this paper, to address these issues, we design a new hyperbolic\nmatching model (HyperMatch) to represent phrases and documents in the same\nhyperbolic space and explicitly estimate the phrase-document relevance via the\nPoincar\\'e distance as the important score of each phrase. Specifically, to\ncapture the hierarchical syntactic and semantic structure information,\nHyperMatch takes advantage of the hidden representations in multiple layers of\nRoBERTa and integrates them as the word embeddings via an adaptive mixing\nlayer. Meanwhile, considering the hierarchical structure hidden in the\ndocument, HyperMatch embeds both phrases and documents in the same hyperbolic\nspace via a hyperbolic phrase encoder and a hyperbolic document encoder. This\nstrategy can further enhance the estimation of phrase-document relevance due to\nthe good properties of hyperbolic space. In this setting, the keyphrase\nextraction can be taken as a matching problem and effectively implemented by\nminimizing a hyperbolic margin-based triplet loss. Extensive experiments are\nconducted on six benchmarks and demonstrate that HyperMatch outperforms the\nstate-of-the-art baselines.\n","authors":["Mingyang Song","Yi Feng","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2205.02047v2.pdf","comment":"12 pages, 3 figures, Accepted by NAACL2022"},{"id":"http://arxiv.org/abs/2308.01196v2","updated":"2023-12-21T11:27:00Z","published":"2023-07-27T22:57:55Z","title":"Sustainable Transparency in Recommender Systems: Bayesian Ranking of\n Images for Explainability","summary":" Recommender Systems have become crucial in the modern world, commonly guiding\nusers towards relevant content or products, and having a large influence over\nthe decisions of users and citizens. However, ensuring transparency and user\ntrust in these systems remains a challenge; personalized explanations have\nemerged as a solution, offering justifications for recommendations. Among the\nexisting approaches for generating personalized explanations, using existing\nvisual content created by users is a promising option to maximize transparency\nand user trust. State-of-the-art models that follow this approach, despite\nleveraging highly optimized architectures, employ surrogate learning tasks that\ndo not efficiently model the objective of ranking images as explanations for a\ngiven recommendation; this leads to a suboptimal training process with high\ncomputational costs that may not be reduced without affecting model\nperformance. This work presents BRIE, a novel model where we leverage Bayesian\nPairwise Ranking to enhance the training process, allowing us to consistently\noutperform state-of-the-art models in six real-world datasets while reducing\nits model size by up to 64 times and its CO${_2}$ emissions by up to 75% in\ntraining and inference.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Berta Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2308.01196v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.09749v5","updated":"2023-12-21T10:56:50Z","published":"2021-10-19T05:48:22Z","title":"Importance Estimation from Multiple Perspectives for Keyphrase\n Extraction","summary":" Keyphrase extraction is a fundamental task in Natural Language Processing,\nwhich usually contains two main parts: candidate keyphrase extraction and\nkeyphrase importance estimation. From the view of human understanding\ndocuments, we typically measure the importance of phrase according to its\nsyntactic accuracy, information saliency, and concept consistency\nsimultaneously. However, most existing keyphrase extraction approaches only\nfocus on the part of them, which leads to biased results. In this paper, we\npropose a new approach to estimate the importance of keyphrase from multiple\nperspectives (called as \\textit{KIEMP}) and further improve the performance of\nkeyphrase extraction. Specifically, \\textit{KIEMP} estimates the importance of\nphrase with three modules: a chunking module to measure its syntactic accuracy,\na ranking module to check its information saliency, and a matching module to\njudge the concept (i.e., topic) consistency between phrase and the whole\ndocument. These three modules are seamlessly jointed together via an end-to-end\nmulti-task learning model, which is helpful for three parts to enhance each\nother and balance the effects of three perspectives. Experimental results on\nsix benchmark datasets show that \\textit{KIEMP} outperforms the existing\nstate-of-the-art keyphrase extraction approaches in most cases.\n","authors":["Mingyang Song","Liping Jing","Lin Xiao"],"pdf_url":"https://arxiv.org/pdf/2110.09749v5.pdf","comment":"11 pages, 2 figures, Accepted by EMNLP2021"},{"id":"http://arxiv.org/abs/2312.13711v1","updated":"2023-12-21T10:23:16Z","published":"2023-12-21T10:23:16Z","title":"A Learning oriented DLP System based on Classification Model","summary":" Data is the key asset for organizations and data sharing is lifeline for\norganization growth; which may lead to data loss. Data leakage is the most\ncritical issue being faced by organizations. In order to mitigate the data\nleakage issues data leakage prevention systems (DLPSs) are deployed at various\nlevels by the organizations. DLPSs are capable to protect all kind of data i.e.\nDAR, DIM/DIT, DIU. Statistical analysis, regular expression, data\nfingerprinting are common approaches exercised in DLP system. Out of these\ntechniques; statistical analysis approach is most appropriate for proposed DLP\nmodel of data security. This paper defines a statistical DLP model for document\nclassification. Model uses various statistical approaches like TF-IDF (Term\nFrequency- Inverse Document Frequency) a renowned term count/weighing function,\nVectorization, Gradient boosting document classification etc. to classify the\ndocuments before allowing any access to it. Machine learning is used to test\nand train the model. Proposed model also introduces an extremely efficient and\nmore accurate approach; IGBCA (Improvised Gradient Boosting Classification\nAlgorithm); for document classification, to prevent them from possible data\nleakage. Results depicts that proposed model can classify documents with high\naccuracy and on basis of which data can be prevented from being loss.\n","authors":["Kishu Gupta","Ashwani Kush"],"pdf_url":"https://arxiv.org/pdf/2312.13711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13695v1","updated":"2023-12-21T09:45:43Z","published":"2023-12-21T09:45:43Z","title":"Unexplored Frontiers: A Review of Empirical Studies of Exploratory\n Search","summary":" This article reviews how empirical research of exploratory search is\nconducted. We investigated aspects of interdisciplinarity, study settings and\nevaluation methodologies from a systematically selected sample of 231\npublications from 2010-2021, including a total of 172 articles with empirical\nstudies. Our results show that exploratory search is highly interdisciplinary,\nwith the most frequently occurring publication venues including high impact\nvenues in information science, information systems and human-computer\ninteraction. However, taken in aggregate, the breadth of study settings\ninvestigated was limited. We found that a majority of studies (77%) focused on\nevaluating novel retrieval systems as opposed to investigating users' search\nprocesses. Furthermore, a disproportionate number of studies were based on\nscientific literature search (20.7%), a majority of which only considered\nsearching for Computer Science articles. Study participants were generally from\nconvenience samples, with 75% of studies composed exclusively of students and\nother academics. The methodologies used for evaluation were mostly\nquantitative, but lacked consistency between studies and validated\nquestionnaires were rarely used. In discussion, we offer a critical analysis of\nour findings and suggest potential improvements for future exploratory search\nstudies.\n","authors":["Alan Medlar","Denis Kotkov","Dorota Glowacka"],"pdf_url":"https://arxiv.org/pdf/2312.13695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18608v2","updated":"2023-12-21T09:11:48Z","published":"2023-10-28T06:31:06Z","title":"Embedding in Recommender Systems: A Survey","summary":" Recommender systems have become an essential component of many online\nplatforms, providing personalized recommendations to users. A crucial aspect is\nembedding techniques that coverts the high-dimensional discrete features, such\nas user and item IDs, into low-dimensional continuous vectors and can enhance\nthe recommendation performance. Applying embedding techniques captures complex\nentity relationships and has spurred substantial research. In this survey, we\nprovide an overview of the recent literature on embedding techniques in\nrecommender systems. This survey covers embedding methods like collaborative\nfiltering, self-supervised learning, and graph-based techniques. Collaborative\nfiltering generates embeddings capturing user-item preferences, excelling in\nsparse data. Self-supervised methods leverage contrastive or generative\nlearning for various tasks. Graph-based techniques like node2vec exploit\ncomplex relationships in network-rich environments. Addressing the scalability\nchallenges inherent to embedding methods, our survey delves into innovative\ndirections within the field of recommendation systems. These directions aim to\nenhance performance and reduce computational complexity, paving the way for\nimproved recommender systems. Among these innovative approaches, we will\nintroduce Auto Machine Learning (AutoML), hash techniques, and quantization\ntechniques in this survey. We discuss various architectures and techniques and\nhighlight the challenges and future directions in these aspects. This survey\naims to provide a comprehensive overview of the state-of-the-art in this\nrapidly evolving field and serve as a useful resource for researchers and\npractitioners working in the area of recommender systems.\n","authors":["Xiangyu Zhao","Maolin Wang","Xinjian Zhao","Jiansheng Li","Shucheng Zhou","Dawei Yin","Qing Li","Jiliang Tang","Ruocheng Guo"],"pdf_url":"https://arxiv.org/pdf/2310.18608v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.05722v2","updated":"2023-12-21T08:20:40Z","published":"2023-07-10T11:29:41Z","title":"Exploring Large Language Model for Graph Data Understanding in Online\n Job Recommendations","summary":" Large Language Models (LLMs) have revolutionized natural language processing\ntasks, demonstrating their exceptional capabilities in various domains.\nHowever, their potential for behavior graph understanding in job\nrecommendations remains largely unexplored. This paper focuses on unveiling the\ncapability of large language models in understanding behavior graphs and\nleveraging this understanding to enhance recommendations in online recruitment,\nincluding the promotion of out-of-distribution (OOD) application. We present a\nnovel framework that harnesses the rich contextual information and semantic\nrepresentations provided by large language models to analyze behavior graphs\nand uncover underlying patterns and relationships. Specifically, we propose a\nmeta-path prompt constructor that leverages LLM recommender to understand\nbehavior graphs for the first time and design a corresponding path augmentation\nmodule to alleviate the prompt bias introduced by path-based sequence input. By\nleveraging this capability, our framework enables personalized and accurate job\nrecommendations for individual users. We evaluate the effectiveness of our\napproach on a comprehensive dataset and demonstrate its ability to improve the\nrelevance and quality of recommended quality. This research not only sheds\nlight on the untapped potential of large language models but also provides\nvaluable insights for developing advanced recommendation systems in the\nrecruitment market. The findings contribute to the growing field of natural\nlanguage processing and offer practical implications for enhancing job search\nexperiences. We release the code at https://github.com/WLiK/GLRec.\n","authors":["Likang Wu","Zhaopeng Qiu","Zhi Zheng","Hengshu Zhu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2307.05722v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.08154v2","updated":"2023-12-21T03:53:38Z","published":"2023-09-15T04:39:11Z","title":"Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking","summary":" The core of cross-modal matching is to accurately measure the similarity\nbetween different modalities in a unified representation space. However,\ncompared to textual descriptions of a certain perspective, the visual modality\nhas more semantic variations. So, images are usually associated with multiple\ntextual captions in databases. Although popular symmetric embedding methods\nhave explored numerous modal interaction approaches, they often learn toward\nincreasing the average expression probability of multiple semantic variations\nwithin image embeddings. Consequently, information entropy in embeddings is\nincreased, resulting in redundancy and decreased accuracy. In this work, we\npropose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the\ninformation entropy. Specifically, we obtain a set of heterogeneous visual\nsub-embeddings through dynamic orthogonal constraint loss. To encourage the\ngenerated candidate embeddings to capture various semantic variations, we\nconstruct a mixed distribution and employ a variance-aware weighting loss to\nassign different weights to the optimization process. In addition, we develop a\nFast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and\nenhance the performance. We compare the performance with existing set-based\nmethod using four image feature encoders and two text feature encoders on three\nbenchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role\nof different components by ablation studies and perform a sensitivity analysis\nof the hyperparameters. The qualitative analysis of visualized bidirectional\nretrieval and attention maps further demonstrates the ability of our method to\nencode semantic variations.\n","authors":["Wenzhang Wei","Zhipeng Gui","Changguang Wu","Anqi Zhao","Dehua Peng","Huayi Wu"],"pdf_url":"https://arxiv.org/pdf/2309.08154v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13557v1","updated":"2023-12-21T03:50:09Z","published":"2023-12-21T03:50:09Z","title":"Empowering Few-Shot Recommender Systems with Large Language Models --\n Enhanced Representations","summary":" Recommender systems utilizing explicit feedback have witnessed significant\nadvancements and widespread applications over the past years. However,\ngenerating recommendations in few-shot scenarios remains a persistent\nchallenge. Recently, large language models (LLMs) have emerged as a promising\nsolution for addressing natural language processing (NLP) tasks, thereby\noffering novel insights into tackling the few-shot scenarios encountered by\nexplicit feedback-based recommender systems. To bridge recommender systems and\nLLMs, we devise a prompting template that generates user and item\nrepresentations based on explicit feedback. Subsequently, we integrate these\nLLM-processed representations into various recommendation models to evaluate\ntheir significance across diverse recommendation tasks. Our ablation\nexperiments and case study analysis collectively demonstrate the effectiveness\nof LLMs in processing explicit feedback, highlighting that LLMs equipped with\ngenerative and logical reasoning capabilities can effectively serve as a\ncomponent of recommender systems to enhance their performance in few-shot\nscenarios. Furthermore, the broad adaptability of LLMs augments the\ngeneralization potential of recommender models, despite certain inherent\nconstraints. We anticipate that our study can inspire researchers to delve\ndeeper into the multifaceted dimensions of LLMs's involvement in recommender\nsystems and contribute to the advancement of the explicit feedback-based\nrecommender systems field.\n","authors":["Zhoumeng Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13557v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2311.10501v2","updated":"2023-12-21T02:13:36Z","published":"2023-11-17T13:02:25Z","title":"Collaborative Word-based Pre-trained Item Representation for\n Transferable Recommendation","summary":" Item representation learning (IRL) plays an essential role in recommender\nsystems, especially for sequential recommendation. Traditional sequential\nrecommendation models usually utilize ID embeddings to represent items, which\nare not shared across different domains and lack the transferable ability.\nRecent studies use pre-trained language models (PLM) for item text embeddings\n(text-based IRL) that are universally applicable across domains. However, the\nexisting text-based IRL is unaware of the important collaborative filtering\n(CF) information. In this paper, we propose CoWPiRec, an approach of\nCollaborative Word-based Pre-trained item representation for Recommendation. To\neffectively incorporate CF information into text-based IRL, we convert the\nitem-level interaction data to a word graph containing word-level\ncollaborations. Subsequently, we design a novel pre-training task to align the\nword-level semantic- and CF-related item representation. Extensive experimental\nresults on multiple public datasets demonstrate that compared to\nstate-of-the-art transferable sequential recommenders, CoWPiRec achieves\nsignificantly better performances in both fine-tuning and zero-shot settings\nfor cross-scenario recommendation and effectively alleviates the cold-start\nissue. The code is available at: https://github.com/ysh-1998/CoWPiRec.\n","authors":["Shenghao Yang","Chenyang Wang","Yankai Liu","Kangping Xu","Weizhi Ma","Yiqun Liu","Min Zhang","Haitao Zeng","Junlan Feng","Chao Deng"],"pdf_url":"https://arxiv.org/pdf/2311.10501v2.pdf","comment":"Accepted by ICDM 2023"},{"id":"http://arxiv.org/abs/2304.06762v3","updated":"2023-12-21T00:18:48Z","published":"2023-04-13T18:04:19Z","title":"Shall We Pretrain Autoregressive Language Models with Retrieval? A\n Comprehensive Study","summary":" Large decoder-only language models (LMs) can be largely improved in terms of\nperplexity by retrieval (e.g., RETRO), but its impact on text generation\nquality and downstream task accuracy is unclear. Thus, it is still an open\nquestion: shall we pretrain large autoregressive LMs with retrieval? To answer\nit, we perform a comprehensive study on a scalable pre-trained\nretrieval-augmented LM (i.e., RETRO) compared with standard GPT and\nretrieval-augmented GPT incorporated at fine-tuning or inference stages. We\nfirst provide the recipe to reproduce RETRO up to 9.5B parameters while\nretrieving a text corpus with 330B tokens. Based on that, we have the following\nnovel findings: i) RETRO outperforms GPT on text generation with much less\ndegeneration (i.e., repetition), moderately higher factual accuracy, and\nslightly lower toxicity with a nontoxic retrieval database. ii) On the LM\nEvaluation Harness benchmark, RETRO largely outperforms GPT on\nknowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore,\nwe introduce a simple variant of the model, RETRO++, which largely improves\nopen-domain QA results of original RETRO (e.g., EM score +8.6 on Natural\nQuestion) and significantly outperforms retrieval-augmented GPT in both\nfine-tuning and zero-shot evaluation settings. Our findings highlight the\npromising direction of pretraining autoregressive LMs with retrieval as future\nfoundation models. We release our code and model at:\nhttps://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md\n","authors":["Boxin Wang","Wei Ping","Peng Xu","Lawrence McAfee","Zihan Liu","Mohammad Shoeybi","Yi Dong","Oleksii Kuchaiev","Bo Li","Chaowei Xiao","Anima Anandkumar","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2304.06762v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2312.14335v1","updated":"2023-12-21T23:42:13Z","published":"2023-12-21T23:42:13Z","title":"Context-aware Decoding Reduces Hallucination in Query-focused\n Summarization","summary":" Query-focused summarization (QFS) aims to provide a summary of a single\ndocument/multi documents that can satisfy the information needs of a given\nquery. It is useful for various real-world applications, such as abstractive\nsnippet generation or more recent retrieval augmented generation (RAG). A\nprototypical QFS pipeline consists of a retriever (sparse or dense retrieval)\nand a generator (usually a large language model). However, applying large\nlanguage models (LLM) potentially leads to hallucinations, especially when the\nevidence contradicts the prior belief of LLMs. There has been growing interest\nin developing new decoding methods to improve generation quality and reduce\nhallucination. In this work, we conduct a large-scale reproducibility on one\nrecently proposed decoding method -- Context-aware Decoding (CAD). In addition\nto replicating CAD's experiments on news summarization datasets, we include\nexperiments on QFS datasets, and conduct more rigorous analysis on\ncomputational complexity and hyperparameter sensitivity. Experiments with eight\ndifferent language models show that performance-wise, CAD improves QFS quality\nby (1) reducing factuality errors/hallucinations while (2) mostly retaining the\nmatch of lexical patterns, measured by ROUGE scores, while also at a cost of\nincreased inference-time FLOPs and reduced decoding speed. The code\nimplementation based on Huggingface Library is made available\nhttps://github.com/zhichaoxu-shufe/context-aware-decoding-qfs\n","authors":["Zhichao Xu"],"pdf_url":"https://arxiv.org/pdf/2312.14335v1.pdf","comment":"technical report"},{"id":"http://arxiv.org/abs/2308.11998v2","updated":"2023-12-21T21:03:23Z","published":"2023-08-23T08:35:59Z","title":"Economic Recommender Systems -- A Systematic Review","summary":" Many of today's online services provide personalized recommendations to their\nusers. Such recommendations are typically designed to serve certain user needs,\ne.g., to quickly find relevant content in situations of information overload.\nCorrespondingly, the academic literature in the field largely focuses on the\nvalue of recommender systems for the end user. In this context, one underlying\nassumption is that the improved service that is achieved through the\nrecommendations will in turn positively impact the organization's goals, e.g.,\nin the form of higher customer retention or loyalty. However, in reality,\nrecommender systems can be used to target organizational economic goals more\ndirectly by incorporating monetary considerations such as price awareness and\nprofitability aspects into the underlying recommendation models. In this work,\nwe survey the existing literature on what we call Economic Recommender Systems\nbased on a systematic review approach that helped us identify 133 relevant\npapers. We first categorize existing works along different dimensions and then\nreview the most important technical approaches from the literature.\nFurthermore, we discuss common methodologies to evaluate such systems and\nfinally outline the limitations of today's research and future directions.\n","authors":["Alvise De Biasio","Nicolò Navarin","Dietmar Jannach"],"pdf_url":"https://arxiv.org/pdf/2308.11998v2.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2312.14141v1","updated":"2023-12-21T18:57:54Z","published":"2023-12-21T18:57:54Z","title":"Quantum Algorithms for the Pathwise Lasso","summary":" We present a novel quantum high-dimensional linear regression algorithm with\nan $\\ell_1$-penalty based on the classical LARS (Least Angle Regression)\npathwise algorithm. Similarly to available classical numerical algorithms for\nLasso, our quantum algorithm provides the full regularisation path as the\npenalty term varies, but quadratically faster per iteration under specific\nconditions. A quadratic speedup on the number of features/predictors $d$ is\npossible by using the simple quantum minimum-finding subroutine from D\\\"urr and\nHoyer (arXiv'96) in order to obtain the joining time at each iteration. We then\nimprove upon this simple quantum algorithm and obtain a quadratic speedup both\nin the number of features $d$ and the number of observations $n$ by using the\nrecent approximate quantum minimum-finding subroutine from Chen and de Wolf\n(ICALP'23). As one of our main contributions, we construct a quantum unitary\nbased on quantum amplitude estimation to approximately compute the joining\ntimes to be searched over by the approximate quantum minimum finding. Since the\njoining times are no longer exactly computed, it is no longer clear that the\nresulting approximate quantum algorithm obtains a good solution. As our second\nmain contribution, we prove, via an approximate version of the KKT conditions\nand a duality gap, that the LARS algorithm (and therefore our quantum\nalgorithm) is robust to errors. This means that it still outputs a path that\nminimises the Lasso cost function up to a small error if the joining times are\nonly approximately computed. Finally, in the model where the observations are\ngenerated by an underlying linear model with an unknown coefficient vector, we\nprove bounds on the difference between the unknown coefficient vector and the\napproximate Lasso solution, which generalises known results about convergence\nrates in classical statistical learning theory analysis.\n","authors":["João F. Doriguello","Debbie Lim","Chi Seng Pun","Patrick Rebentrost","Tushar Vaidya"],"pdf_url":"https://arxiv.org/pdf/2312.14141v1.pdf","comment":"44 pages"},{"id":"http://arxiv.org/abs/2312.14136v1","updated":"2023-12-21T18:55:22Z","published":"2023-12-21T18:55:22Z","title":"Fast kernel half-space depth for data with non-convex supports","summary":" Data depth is a statistical function that generalizes order and quantiles to\nthe multivariate setting and beyond, with applications spanning over\ndescriptive and visual statistics, anomaly detection, testing, etc. The\ncelebrated halfspace depth exploits data geometry via an optimization program\nto deliver properties of invariances, robustness, and non-parametricity.\nNevertheless, it implicitly assumes convex data supports and requires\nexponential computational cost. To tackle distribution's multimodality, we\nextend the halfspace depth in a Reproducing Kernel Hilbert Space (RKHS). We\nshow that the obtained depth is intuitive and establish its consistency with\nprovable concentration bounds that allow for homogeneity testing. The proposed\ndepth can be computed using manifold gradient making faster than halfspace\ndepth by several orders of magnitude. The performance of our depth is\ndemonstrated through numerical simulations as well as applications such as\nanomaly detection on real data and homogeneity testing.\n","authors":["Arturo Castellanos","Pavlo Mozharovskyi","Florence d'Alché-Buc","Hicham Janati"],"pdf_url":"https://arxiv.org/pdf/2312.14136v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2312.14134v1","updated":"2023-12-21T18:55:05Z","published":"2023-12-21T18:55:05Z","title":"Diffusion Reward: Learning Rewards via Conditional Video Diffusion","summary":" Learning rewards from expert videos offers an affordable and effective\nsolution to specify the intended behaviors for reinforcement learning tasks. In\nthis work, we propose Diffusion Reward, a novel framework that learns rewards\nfrom expert videos via conditional video diffusion models for solving complex\nvisual RL problems. Our key insight is that lower generative diversity is\nobserved when conditioned on expert trajectories. Diffusion Reward is\naccordingly formalized by the negative of conditional entropy that encourages\nproductive exploration of expert-like behaviors. We show the efficacy of our\nmethod over 10 robotic manipulation tasks from MetaWorld and Adroit with visual\ninput and sparse reward. Moreover, Diffusion Reward could even solve unseen\ntasks successfully and effectively, largely surpassing baseline methods.\nProject page and code: https://diffusion-reward.github.io/.\n","authors":["Tao Huang","Guangqi Jiang","Yanjie Ze","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2312.14134v1.pdf","comment":"Project page and code: https://diffusion-reward.github.io/"},{"id":"http://arxiv.org/abs/2211.01877v2","updated":"2023-12-21T18:51:49Z","published":"2022-11-03T15:07:51Z","title":"Convex Clustering through MM: An Efficient Algorithm to Perform\n Hierarchical Clustering","summary":" Convex clustering is a modern method with both hierarchical and $k$-means\nclustering characteristics. Although convex clustering can capture complex\nclustering structures hidden in data, the existing convex clustering algorithms\nare not scalable to large data sets with sample sizes greater than several\nthousands. Moreover, it is known that convex clustering sometimes fails to\nproduce a complete hierarchical clustering structure. This issue arises if\nclusters split up or the minimum number of possible clusters is larger than the\ndesired number of clusters. In this paper, we propose convex clustering through\nmajorization-minimization (CCMM) -- an iterative algorithm that uses cluster\nfusions and a highly efficient updating scheme derived using diagonal\nmajorization. Additionally, we explore different strategies to ensure that the\nhierarchical clustering structure terminates in a single cluster. With a\ncurrent desktop computer, CCMM efficiently solves convex clustering problems\nfeaturing over one million objects in seven-dimensional space, achieving a\nsolution time of 51 seconds on average.\n","authors":["Daniel J. W. Touw","Patrick J. F. Groenen","Yoshikazu Terada"],"pdf_url":"https://arxiv.org/pdf/2211.01877v2.pdf","comment":"27 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.14129v1","updated":"2023-12-21T18:49:22Z","published":"2023-12-21T18:49:22Z","title":"WellFactor: Patient Profiling using Integrative Embedding of Healthcare\n Data","summary":" In the rapidly evolving healthcare industry, platforms now have access to not\nonly traditional medical records, but also diverse data sets encompassing\nvarious patient interactions, such as those from healthcare web portals. To\naddress this rich diversity of data, we introduce WellFactor: a method that\nderives patient profiles by integrating information from these sources. Central\nto our approach is the utilization of constrained low-rank approximation.\nWellFactor is optimized to handle the sparsity that is often inherent in\nhealthcare data. Moreover, by incorporating task-specific label information,\nour method refines the embedding results, offering a more informed perspective\non patients. One important feature of WellFactor is its ability to compute\nembeddings for new, previously unobserved patient data instantaneously,\neliminating the need to revisit the entire data set or recomputing the\nembedding. Comprehensive evaluations on real-world healthcare data demonstrate\nWellFactor's effectiveness. It produces better results compared to other\nexisting methods in classification performance, yields meaningful clustering of\npatients, and delivers consistent results in patient similarity searches and\npredictions.\n","authors":["Dongjin Choi","Andy Xiang","Ozgur Ozturk","Deep Shrestha","Barry Drake","Hamid Haidarian","Faizan Javed","Haesun Park"],"pdf_url":"https://arxiv.org/pdf/2312.14129v1.pdf","comment":"2023 IEEE International Conference on Big Data (IEEE BigData 2023)"},{"id":"http://arxiv.org/abs/2312.11462v2","updated":"2023-12-21T18:46:59Z","published":"2023-12-18T18:59:46Z","title":"Cascade Speculative Drafting for Even Faster LLM Inference","summary":" Speculative decoding enhances the efficiency of large language models (LLMs)\nby leveraging a draft model to draft for a larger target model to review.\nHowever, drafting in speculative decoding involves slow autoregressive\ngeneration and generating tokens of different importance with the same time\nallocation. These two inefficiencies lead to its suboptimal performance. To\naddress this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a\nnovel approach that employs two types of cascades. The Vertical Cascade\neliminates autoregressive generation from neural models. The Horizontal Cascade\nconstitutes efficient time allocation in drafting with its optimality supported\nby our theoretical analysis. Combining both cascades, our CS. Drafting\nalgorithm has achieved up to 72 percent additional speedup over speculative\ndecoding in our experiments while keeping the same output distribution.\n","authors":["Ziyi Chen","Xiaocong Yang","Jiacheng Lin","Chenkai Sun","Jie Huang","Kevin Chen-Chuan Chang"],"pdf_url":"https://arxiv.org/pdf/2312.11462v2.pdf","comment":"Preprint in progress"},{"id":"http://arxiv.org/abs/2310.00526v4","updated":"2023-12-21T18:43:25Z","published":"2023-10-01T00:12:31Z","title":"Are Graph Neural Networks Optimal Approximation Algorithms?","summary":" In this work we design graph neural network architectures that can be used to\nobtain optimal approximation algorithms for a large class of combinatorial\noptimization problems using powerful algorithmic tools from semidefinite\nprogramming (SDP). Concretely, we prove that polynomial-sized message passing\nalgorithms can represent the most powerful polynomial time algorithms for Max\nConstraint Satisfaction Problems assuming the Unique Games Conjecture. We\nleverage this result to construct efficient graph neural network architectures,\nOptGNN, that obtain high-quality approximate solutions on landmark\ncombinatorial optimization problems such as Max Cut and maximum independent\nset. Our approach achieves strong empirical results across a wide range of\nreal-world and synthetic datasets against both neural baselines and classical\nalgorithms. Finally, we take advantage of OptGNN's ability to capture convex\nrelaxations to design an algorithm for producing dual certificates of\noptimality (bounds on the optimal solution) from the learned embeddings of\nOptGNN.\n","authors":["Morris Yau","Eric Lu","Nikolaos Karalias","Jessica Xu","Stefanie Jegelka"],"pdf_url":"https://arxiv.org/pdf/2310.00526v4.pdf","comment":"Updated references, fixed more typos and wording issues"},{"id":"http://arxiv.org/abs/2312.14106v1","updated":"2023-12-21T18:31:33Z","published":"2023-12-21T18:31:33Z","title":"Learning Human-like Representations to Enable Learning Human Values","summary":" How can we build AI systems that are aligned with human values and objectives\nin order to avoid causing harm or violating societal standards for acceptable\nbehavior? Making AI systems learn human-like representations of the world has\nmany known benefits, including improving generalization, robustness to domain\nshifts, and few-shot learning performance, among others. We propose that this\nkind of representational alignment between machine learning (ML) models and\nhumans is also a necessary condition for value alignment, where ML systems\nconform to human values and societal norms. We focus on ethics as one aspect of\nvalue alignment and train multiple ML agents (support vector regression and\nkernel regression) in a multi-armed bandit setting, where rewards are sampled\nfrom a distribution that reflects the morality of the chosen action. We then\nstudy the relationship between each agent's degree of representational\nalignment with humans and their performance when learning to take the most\nethical actions.\n","authors":["Andrea Wynn","Ilia Sucholutsky","Thomas L. Griffiths"],"pdf_url":"https://arxiv.org/pdf/2312.14106v1.pdf","comment":"Paper accepted in Human-Centric Representation Learning workshop at\n AAAI 2024 (https://hcrl-workshop.github.io/2024/)"},{"id":"http://arxiv.org/abs/2307.00764v2","updated":"2023-12-21T18:28:31Z","published":"2023-07-03T06:02:15Z","title":"Hierarchical Open-vocabulary Universal Image Segmentation","summary":" Open-vocabulary image segmentation aims to partition an image into semantic\nregions according to arbitrary text descriptions. However, complex visual\nscenes can be naturally decomposed into simpler parts and abstracted at\nmultiple levels of granularity, introducing inherent segmentation ambiguity.\nUnlike existing methods that typically sidestep this ambiguity and treat it as\nan external factor, our approach actively incorporates a hierarchical\nrepresentation encompassing different semantic-levels into the learning\nprocess. We propose a decoupled text-image fusion mechanism and representation\nlearning modules for both \"things\" and \"stuff\". Additionally, we systematically\nexamine the differences that exist in the textual and visual features between\nthese types of categories. Our resulting model, named HIPIE, tackles\nHIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a\nunified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,\nPascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the\nstate-of-the-art results at various levels of image comprehension, including\nsemantic-level (e.g., semantic segmentation), instance-level (e.g.,\npanoptic/referring segmentation and object detection), as well as part-level\n(e.g., part/subpart segmentation) tasks. Our code is released at\nhttps://github.com/berkeley-hipie/HIPIE.\n","authors":["Xudong Wang","Shufan Li","Konstantinos Kallidromitis","Yusuke Kato","Kazuki Kozuka","Trevor Darrell"],"pdf_url":"https://arxiv.org/pdf/2307.00764v2.pdf","comment":"Project web-page:\n http://people.eecs.berkeley.edu/~xdwang/projects/HIPIE/; NeurIPS 2023\n Camera-ready"},{"id":"http://arxiv.org/abs/2312.12067v2","updated":"2023-12-21T18:28:31Z","published":"2023-12-19T11:34:10Z","title":"Optimistic Policy Gradient in Multi-Player Markov Games with a Single\n Controller: Convergence Beyond the Minty Property","summary":" Policy gradient methods enjoy strong practical performance in numerous tasks\nin reinforcement learning. Their theoretical understanding in multiagent\nsettings, however, remains limited, especially beyond two-player competitive\nand potential Markov games. In this paper, we develop a new framework to\ncharacterize optimistic policy gradient methods in multi-player Markov games\nwith a single controller. Specifically, under the further assumption that the\ngame exhibits an equilibrium collapse, in that the marginals of coarse\ncorrelated equilibria (CCE) induce Nash equilibria (NE), we show convergence to\nstationary $\\epsilon$-NE in $O(1/\\epsilon^2)$ iterations, where $O(\\cdot)$\nsuppresses polynomial factors in the natural parameters of the game. Such an\nequilibrium collapse is well-known to manifest itself in two-player zero-sum\nMarkov games, but also occurs even in a class of multi-player Markov games with\nseparable interactions, as established by recent work. As a result, we bypass\nknown complexity barriers for computing stationary NE when either of our\nassumptions fails. Our approach relies on a natural generalization of the\nclassical Minty property that we introduce, which we anticipate to have further\napplications beyond Markov games.\n","authors":["Ioannis Anagnostides","Ioannis Panageas","Gabriele Farina","Tuomas Sandholm"],"pdf_url":"https://arxiv.org/pdf/2312.12067v2.pdf","comment":"To appear at AAAI 2024"},{"id":"http://arxiv.org/abs/2305.18900v2","updated":"2023-12-21T18:22:04Z","published":"2023-05-30T09:58:47Z","title":"One-Line-of-Code Data Mollification Improves Optimization of\n Likelihood-based Generative Models","summary":" Generative Models (GMs) have attracted considerable attention due to their\ntremendous success in various domains, such as computer vision where they are\ncapable to generate impressive realistic-looking images. Likelihood-based GMs\nare attractive due to the possibility to generate new data by a single model\nevaluation. However, they typically achieve lower sample quality compared to\nstate-of-the-art score-based diffusion models (DMs). This paper provides a\nsignificant step in the direction of addressing this limitation. The idea is to\nborrow one of the strengths of score-based DMs, which is the ability to perform\naccurate density estimation in low-density regions and to address manifold\noverfitting by means of data mollification. We connect data mollification\nthrough the addition of Gaussian noise to Gaussian homotopy, which is a\nwell-known technique to improve optimization. Data mollification can be\nimplemented by adding one line of code in the optimization loop, and we\ndemonstrate that this provides a boost in generation quality of\nlikelihood-based GMs, without computational overheads. We report results on\nimage data sets with popular likelihood-based GMs, including variants of\nvariational autoencoders and normalizing flows, showing large improvements in\nFID score.\n","authors":["Ba-Hien Tran","Giulio Franzese","Pietro Michiardi","Maurizio Filippone"],"pdf_url":"https://arxiv.org/pdf/2305.18900v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14095v1","updated":"2023-12-21T18:17:16Z","published":"2023-12-21T18:17:16Z","title":"RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation","summary":" Significant research effort has been devoted in recent years to developing\npersonalized pricing, promotions, and product recommendation algorithms that\ncan leverage rich customer data to learn and earn. Systematic benchmarking and\nevaluation of these causal learning systems remains a critical challenge, due\nto the lack of suitable datasets and simulation environments. In this work, we\npropose a multi-stage model for simulating customer shopping behavior that\ncaptures important sources of heterogeneity, including price sensitivity and\npast experiences. We embedded this model into a working simulation environment\n-- RetailSynth. RetailSynth was carefully calibrated on publicly available\ngrocery data to create realistic synthetic shopping transactions. Multiple\npricing policies were implemented within the simulator and analyzed for impact\non revenue, category penetration, and customer retention. Applied researchers\ncan use RetailSynth to validate causal demand models for multi-category retail\nand to incorporate realistic price sensitivity into emerging benchmarking\nsuites for personalized pricing, promotions, and product recommendations.\n","authors":["Yu Xia","Ali Arian","Sriram Narayanamoorthy","Joshua Mabry"],"pdf_url":"https://arxiv.org/pdf/2312.14095v1.pdf","comment":"30 pages, 8 figures"},{"id":"http://arxiv.org/abs/2305.16150v3","updated":"2023-12-21T18:16:33Z","published":"2023-05-25T15:20:10Z","title":"Unifying GANs and Score-Based Diffusion as Generative Particle Models","summary":" Particle-based deep generative models, such as gradient flows and score-based\ndiffusion models, have recently gained traction thanks to their striking\nperformance. Their principle of displacing particle distributions using\ndifferential equations is conventionally seen as opposed to the previously\nwidespread generative adversarial networks (GANs), which involve training a\npushforward generator network. In this paper we challenge this interpretation,\nand propose a novel framework that unifies particle and adversarial generative\nmodels by framing generator training as a generalization of particle models.\nThis suggests that a generator is an optional addition to any such generative\nmodel. Consequently, integrating a generator into a score-based diffusion model\nand training a GAN without a generator naturally emerge from our framework. We\nempirically test the viability of these original models as proofs of concepts\nof potential applications of our framework.\n","authors":["Jean-Yves Franceschi","Mike Gartrell","Ludovic Dos Santos","Thibaut Issenhuth","Emmanuel de Bézenac","Mickaël Chen","Alain Rakotomamonjy"],"pdf_url":"https://arxiv.org/pdf/2305.16150v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09131v3","updated":"2023-12-21T18:10:28Z","published":"2023-12-14T17:01:58Z","title":"Physics-Informed Neural Network Lyapunov Functions: PDE\n Characterization, Learning, and Verification","summary":" We provide a systematic investigation of using physics-informed neural\nnetworks to compute Lyapunov functions. We encode Lyapunov conditions as a\npartial differential equation (PDE) and use this for training neural network\nLyapunov functions. We analyze the analytical properties of the solutions to\nthe Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov\nequation in training neural Lyapunov functions can lead to approximate regions\nof attraction close to the true domain of attraction. We also examine\napproximation errors and the convergence of neural approximations to the unique\nsolution of Zubov's equation. We then provide sufficient conditions for the\nlearned neural Lyapunov functions that can be readily verified by\nsatisfiability modulo theories (SMT) solvers, enabling formal verification of\nboth local stability analysis and region-of-attraction estimates in the large.\nThrough a number of nonlinear examples, ranging from low to high dimensions, we\ndemonstrate that the proposed framework can outperform traditional\nsums-of-squares (SOS) Lyapunov functions obtained using semidefinite\nprogramming (SDP).\n","authors":["Jun Liu","Yiming Meng","Maxwell Fitzsimmons","Ruikun Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.09131v3.pdf","comment":"The current version has been submitted for publication; corrected\n some minor typos from v2"},{"id":"http://arxiv.org/abs/2312.14078v1","updated":"2023-12-21T17:56:19Z","published":"2023-12-21T17:56:19Z","title":"Learned reconstruction methods for inverse problems: sample error\n estimates","summary":" Learning-based and data-driven techniques have recently become a subject of\nprimary interest in the field of reconstruction and regularization of inverse\nproblems. Besides the development of novel methods, yielding excellent results\nin several applications, their theoretical investigation has attracted growing\ninterest, e.g., on the topics of reliability, stability, and interpretability.\nIn this work, a general framework is described, allowing us to interpret many\nof these techniques in the context of statistical learning. This is not\nintended to provide a complete survey of existing methods, but rather to put\nthem in a working perspective, which naturally allows their theoretical\ntreatment. The main goal of this dissertation is thereby to address the\ngeneralization properties of learned reconstruction methods, and specifically\nto perform their sample error analysis. This task, well-developed in\nstatistical learning, consists in estimating the dependence of the learned\noperators with respect to the data employed for their training. A rather\ngeneral strategy is proposed, whose assumptions are met for a large class of\ninverse problems and learned methods, as depicted via a selection of examples.\n","authors":["Luca Ratti"],"pdf_url":"https://arxiv.org/pdf/2312.14078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14066v1","updated":"2023-12-21T17:46:05Z","published":"2023-12-21T17:46:05Z","title":"Upper Bounding Barlow Twins: A Novel Filter for Multi-Relational\n Clustering","summary":" Multi-relational clustering is a challenging task due to the fact that\ndiverse semantic information conveyed in multi-layer graphs is difficult to\nextract and fuse. Recent methods integrate topology structure and node\nattribute information through graph filtering. However, they often use a\nlow-pass filter without fully considering the correlation among multiple\ngraphs. To overcome this drawback, we propose to learn a graph filter motivated\nby the theoretical analysis of Barlow Twins. We find that input with a negative\nsemi-definite inner product provides a lower bound for Barlow Twins loss, which\nprevents it from reaching a better solution. We thus learn a filter that yields\nan upper bound for Barlow Twins. Afterward, we design a simple clustering\narchitecture and demonstrate its state-of-the-art performance on four benchmark\ndatasets.\n","authors":["Xiaowei Qian","Bingheng Li","Zhao Kang"],"pdf_url":"https://arxiv.org/pdf/2312.14066v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2210.02998v3","updated":"2023-12-21T17:41:55Z","published":"2022-10-06T15:38:02Z","title":"ThoraX-PriorNet: A Novel Attention-Based Architecture Using Anatomical\n Prior Probability Maps for Thoracic Disease Classification","summary":" Objective: Computer-aided disease diagnosis and prognosis based on medical\nimages is a rapidly emerging field. Many Convolutional Neural Network (CNN)\narchitectures have been developed by researchers for disease classification and\nlocalization from chest X-ray images. It is known that different thoracic\ndisease lesions are more likely to occur in specific anatomical regions\ncompared to others. This article aims to incorporate this disease and\nregion-dependent prior probability distribution within a deep learning\nframework. Methods: We present the ThoraX-PriorNet, a novel attention-based CNN\nmodel for thoracic disease classification. We first estimate a\ndisease-dependent spatial probability, i.e., an anatomical prior, that\nindicates the probability of occurrence of a disease in a specific region in a\nchest X-ray image. Next, we develop a novel attention-based classification\nmodel that combines information from the estimated anatomical prior and\nautomatically extracted chest region of interest (ROI) masks to provide\nattention to the feature maps generated from a deep convolution network. Unlike\nprevious works that utilize various self-attention mechanisms, the proposed\nmethod leverages the extracted chest ROI masks along with the probabilistic\nanatomical prior information, which selects the region of interest for\ndifferent diseases to provide attention. Results: The proposed method shows\nsuperior performance in disease classification on the NIH ChestX-ray14 dataset\ncompared to existing state-of-the-art methods while reaching an area under the\nROC curve (%AUC) of 84.67. Regarding disease localization, the anatomy prior\nattention method shows competitive performance compared to state-of-the-art\nmethods, achieving an accuracy of 0.80, 0.63, 0.49, 0.33, 0.28, 0.21, and 0.04\nwith an Intersection over Union (IoU) threshold of 0.1, 0.2, 0.3, 0.4, 0.5,\n0.6, and 0.7, respectively.\n","authors":["Md. Iqbal Hossain","Mohammad Zunaed","Md. Kawsar Ahmed","S. M. Jawwad Hossain","Anwarul Hasan","Taufiq Hasan"],"pdf_url":"https://arxiv.org/pdf/2210.02998v3.pdf","comment":"Accepted to IEEE ACCESS"},{"id":"http://arxiv.org/abs/2312.14057v1","updated":"2023-12-21T17:34:18Z","published":"2023-12-21T17:34:18Z","title":"Weighted least-squares approximation with determinantal point processes\n and generalized volume sampling","summary":" We consider the problem of approximating a function from $L^2$ by an element\nof a given $m$-dimensional space $V_m$, associated with some feature map\n$\\varphi$, using evaluations of the function at random points $x_1,\\dots,x_n$.\nAfter recalling some results on optimal weighted least-squares using\nindependent and identically distributed points, we consider weighted\nleast-squares using projection determinantal point processes (DPP) or volume\nsampling. These distributions introduce dependence between the points that\npromotes diversity in the selected features $\\varphi(x_i)$. We first provide a\ngeneralized version of volume-rescaled sampling yielding quasi-optimality\nresults in expectation with a number of samples $n = O(m\\log(m))$, that means\nthat the expected $L^2$ error is bounded by a constant times the best\napproximation error in $L^2$. Also, further assuming that the function is in\nsome normed vector space $H$ continuously embedded in $L^2$, we further prove\nthat the approximation is almost surely bounded by the best approximation error\nmeasured in the $H$-norm. This includes the cases of functions from $L^\\infty$\nor reproducing kernel Hilbert spaces. Finally, we present an alternative\nstrategy consisting in using independent repetitions of projection DPP (or\nvolume sampling), yielding similar error bounds as with i.i.d. or volume\nsampling, but in practice with a much lower number of samples. Numerical\nexperiments illustrate the performance of the different strategies.\n","authors":["Anthony Nouy","Bertrand Michel"],"pdf_url":"https://arxiv.org/pdf/2312.14057v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14050v1","updated":"2023-12-21T17:19:27Z","published":"2023-12-21T17:19:27Z","title":"Machine learning and domain decomposition methods -- a survey","summary":" Hybrid algorithms, which combine black-box machine learning methods with\nexperience from traditional numerical methods and domain expertise from diverse\napplication areas, are progressively gaining importance in scientific machine\nlearning and various industrial domains, especially in computational science\nand engineering. In the present survey, several promising avenues of research\nwill be examined which focus on the combination of machine learning (ML) and\ndomain decomposition methods (DDMs). The aim of this survey is to provide an\noverview of existing work within this field and to structure it into domain\ndecomposition for machine learning and machine learning-enhanced domain\ndecomposition, including: domain decomposition for classical machine learning,\ndomain decomposition to accelerate the training of physics-aware neural\nnetworks, machine learning to enhance the convergence properties or\ncomputational efficiency of DDMs, and machine learning as a discretization\nmethod in a DDM for the solution of PDEs. In each of these fields, we summarize\nexisting work and key advances within a common framework and, finally, disuss\nongoing challenges and opportunities for future research.\n","authors":["Axel Klawonn","Martin Lanser","Janine Weber"],"pdf_url":"https://arxiv.org/pdf/2312.14050v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14037v1","updated":"2023-12-21T17:03:26Z","published":"2023-12-21T17:03:26Z","title":"Neural Contextual Bandits for Personalized Recommendation","summary":" In the dynamic landscape of online businesses, recommender systems are\npivotal in enhancing user experiences. While traditional approaches have relied\non static supervised learning, the quest for adaptive, user-centric\nrecommendations has led to the emergence of the formulation of contextual\nbandits. This tutorial investigates the contextual bandits as a powerful\nframework for personalized recommendations. We delve into the challenges,\nadvanced algorithms and theories, collaborative strategies, and open challenges\nand future prospects within this field. Different from existing related\ntutorials, (1) we focus on the exploration perspective of contextual bandits to\nalleviate the ``Matthew Effect'' in the recommender systems, i.e., the rich get\nricher and the poor get poorer, concerning the popularity of items; (2) in\naddition to the conventional linear contextual bandits, we will also dedicated\nto neural contextual bandits which have emerged as an important branch in\nrecent years, to investigate how neural networks benefit contextual bandits for\npersonalized recommendation both empirically and theoretically; (3) we will\ncover the latest topic, collaborative neural contextual bandits, to incorporate\nboth user heterogeneity and user correlations customized for recommender\nsystem; (4) we will provide and discuss the new emerging challenges and open\nquestions for neural contextual bandits with applications in the personalized\nrecommendation, especially for large neural models.\n","authors":["Yikun Ban","Yunzhe Qi","Jingrui He"],"pdf_url":"https://arxiv.org/pdf/2312.14037v1.pdf","comment":"WWW'24 Tutorial"},{"id":"http://arxiv.org/abs/2306.09200v2","updated":"2023-12-21T16:59:44Z","published":"2023-06-15T15:35:31Z","title":"ChessGPT: Bridging Policy Learning and Language Modeling","summary":" When solving decision-making tasks, humans typically depend on information\nfrom two key sources: (1) Historical policy data, which provides interaction\nreplay from the environment, and (2) Analytical insights in natural language\nform, exposing the invaluable thought process or strategic considerations.\nDespite this, the majority of preceding research focuses on only one source:\nthey either use historical replay exclusively to directly learn policy or value\nfunctions, or engaged in language model training utilizing mere language\ncorpus. In this paper, we argue that a powerful autonomous agent should cover\nboth sources. Thus, we propose ChessGPT, a GPT model bridging policy learning\nand language modeling by integrating data from these two sources in Chess\ngames. Specifically, we build a large-scale game and language dataset related\nto chess. Leveraging the dataset, we showcase two model examples ChessCLIP and\nChessGPT, integrating policy learning and language modeling. Finally, we\npropose a full evaluation framework for evaluating language model's chess\nability. Experimental results validate our model and dataset's effectiveness.\nWe open source our code, model, and dataset at\nhttps://github.com/waterhorse1/ChessGPT.\n","authors":["Xidong Feng","Yicheng Luo","Ziyan Wang","Hongrui Tang","Mengyue Yang","Kun Shao","David Mguni","Yali Du","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2306.09200v2.pdf","comment":"Published as a conference article in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14027v1","updated":"2023-12-21T16:58:49Z","published":"2023-12-21T16:58:49Z","title":"AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based\n Optimization","summary":" Uncertainty estimation is a key issue when considering the application of\ndeep neural network methods in science and engineering. In this work, we\nintroduce a novel algorithm that quantifies epistemic uncertainty via Monte\nCarlo sampling from a tempered posterior distribution. It combines the well\nestablished Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based\noptimization using Adam and leverages a prolate proposal distribution, to\nefficiently draw from the posterior. We prove that the constructed chain admits\nthe Gibbs posterior as an invariant distribution and converges to this Gibbs\nposterior in total variation distance. Numerical evaluations are postponed to a\nfirst revision.\n","authors":["Sebastian Bieringer","Gregor Kasieczka","Maximilian F. Steffen","Mathias Trabs"],"pdf_url":"https://arxiv.org/pdf/2312.14027v1.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2312.14021v1","updated":"2023-12-21T16:53:04Z","published":"2023-12-21T16:53:04Z","title":"Leveraging Visual Supervision for Array-based Active Speaker Detection\n and Localization","summary":" Conventional audio-visual approaches for active speaker detection (ASD)\ntypically rely on visually pre-extracted face tracks and the corresponding\nsingle-channel audio to find the speaker in a video. Therefore, they tend to\nfail every time the face of the speaker is not visible. We demonstrate that a\nsimple audio convolutional recurrent neural network (CRNN) trained with spatial\ninput features extracted from multichannel audio can perform simultaneous\nhorizontal active speaker detection and localization (ASDL), independently of\nthe visual modality. To address the time and cost of generating ground truth\nlabels to train such a system, we propose a new self-supervised training\npipeline that embraces a ``student-teacher'' learning approach. A conventional\npre-trained active speaker detector is adopted as a ``teacher'' network to\nprovide the position of the speakers as pseudo-labels. The multichannel audio\n``student'' network is trained to generate the same results. At inference, the\nstudent network can generalize and locate also the occluded speakers that the\nteacher network is not able to detect visually, yielding considerable\nimprovements in recall rate. Experiments on the TragicTalkers dataset show that\nan audio network trained with the proposed self-supervised learning approach\ncan exceed the performance of the typical audio-visual methods and produce\nresults competitive with the costly conventional supervised training. We\ndemonstrate that improvements can be achieved when minimal manual supervision\nis introduced in the learning pipeline. Further gains may be sought with larger\ntraining sets and integrating vision with the multichannel audio system.\n","authors":["Davide Berghi","Philip J. B. Jackson"],"pdf_url":"https://arxiv.org/pdf/2312.14021v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14020v1","updated":"2023-12-21T16:52:41Z","published":"2023-12-21T16:52:41Z","title":"BANSpEmo: A Bangla Emotional Speech Recognition Dataset","summary":" In the field of audio and speech analysis, the ability to identify emotions\nfrom acoustic signals is essential. Human-computer interaction (HCI) and\nbehavioural analysis are only a few of the many areas where the capacity to\ndistinguish emotions from speech signals has an extensive range of\napplications. Here, we are introducing BanSpEmo, a corpus of emotional speech\nthat only consists of audio recordings and has been created specifically for\nthe Bangla language. This corpus contains 792 audio recordings over a duration\nof more than 1 hour and 23 minutes. 22 native speakers took part in the\nrecording of two sets of sentences that represent the six desired emotions. The\ndata set consists of 12 Bangla sentences which are uttered in 6 emotions as\nDisgust, Happy, Sad, Surprised, Anger, and Fear. This corpus is not also gender\nbalanced. Ten individuals who either have experience in related field or have\nacting experience took part in the assessment of this corpus. It has a balanced\nnumber of audio recordings in each emotion class. BanSpEmo can be considered as\na useful resource to promote emotion and speech recognition research and\nrelated applications in the Bangla language. The dataset can be found here:\nhttps://data.mendeley.com/datasets/rdwn4bs5ky and might be employed for\nacademic research.\n","authors":["Md Gulzar Hussain","Mahmuda Rahman","Babe Sultana","Ye Shiren"],"pdf_url":"https://arxiv.org/pdf/2312.14020v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.14367v2","updated":"2023-12-21T16:46:35Z","published":"2023-07-25T09:35:43Z","title":"Prot2Text: Multimodal Protein's Function Generation with GNNs and\n Transformers","summary":" The complex nature of big biological systems pushed some scientists to\nclassify its understanding under the inconceivable missions. Different leveled\nchallenges complicated this task, one of is the prediction of a protein's\nfunction. In recent years, significant progress has been made in this field\nthrough the development of various machine learning approaches. However, most\nexisting methods formulate the task as a multi-classification problem, i.e\nassigning predefined labels to proteins. In this work, we propose a novel\napproach, \\textbf{Prot2Text}, which predicts a protein function's in a free\ntext style, moving beyond the conventional binary or categorical\nclassifications. By combining Graph Neural Networks(GNNs) and Large Language\nModels(LLMs), in an encoder-decoder framework, our model effectively integrates\ndiverse data types including proteins' sequences, structures, and textual\nannotations. This multimodal approach allows for a holistic representation of\nproteins' functions, enabling the generation of detailed and accurate\ndescriptions. To evaluate our model, we extracted a multimodal protein dataset\nfrom SwissProt, and demonstrate empirically the effectiveness of Prot2Text.\nThese results highlight the transformative impact of multimodal models,\nspecifically the fusion of GNNs and LLMs, empowering researchers with powerful\ntools for more accurate prediction of proteins' functions. The code, the models\nand a demo will be publicly released.\n","authors":["Hadi Abdine","Michail Chatzianastasis","Costas Bouyioukos","Michalis Vazirgiannis"],"pdf_url":"https://arxiv.org/pdf/2307.14367v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14000v1","updated":"2023-12-21T16:34:03Z","published":"2023-12-21T16:34:03Z","title":"Risk-Sensitive Stochastic Optimal Control as Rao-Blackwellized Markovian\n Score Climbing","summary":" Stochastic optimal control of dynamical systems is a crucial challenge in\nsequential decision-making. Recently, control-as-inference approaches have had\nconsiderable success, providing a viable risk-sensitive framework to address\nthe exploration-exploitation dilemma. Nonetheless, a majority of these\ntechniques only invoke the inference-control duality to derive a modified risk\nobjective that is then addressed within a reinforcement learning framework.\nThis paper introduces a novel perspective by framing risk-sensitive stochastic\ncontrol as Markovian score climbing under samples drawn from a conditional\nparticle filter. Our approach, while purely inference-centric, provides\nasymptotically unbiased estimates for gradient-based policy optimization with\noptimal importance weighting and no explicit value function learning. To\nvalidate our methodology, we apply it to the task of learning neural\nnon-Gaussian feedback policies, showcasing its efficacy on numerical benchmarks\nof stochastic dynamical systems.\n","authors":["Hany Abdulsamad","Sahel Iqbal","Adrien Corenflos","Simo Särkkä"],"pdf_url":"https://arxiv.org/pdf/2312.14000v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12559v4","updated":"2023-12-21T16:24:00Z","published":"2023-09-22T01:06:16Z","title":"Invariant Learning via Probability of Sufficient and Necessary Causes","summary":" Out-of-distribution (OOD) generalization is indispensable for learning models\nin the wild, where testing distribution typically unknown and different from\nthe training. Recent methods derived from causality have shown great potential\nin achieving OOD generalization. However, existing methods mainly focus on the\ninvariance property of causes, while largely overlooking the property of\n\\textit{sufficiency} and \\textit{necessity} conditions. Namely, a necessary but\ninsufficient cause (feature) is invariant to distribution shift, yet it may not\nhave required accuracy. By contrast, a sufficient yet unnecessary cause\n(feature) tends to fit specific data well but may have a risk of adapting to a\nnew domain. To capture the information of sufficient and necessary causes, we\nemploy a classical concept, the probability of sufficiency and necessary causes\n(PNS), which indicates the probability of whether one is the necessary and\nsufficient cause. To associate PNS with OOD generalization, we propose PNS risk\nand formulate an algorithm to learn representation with a high PNS value. We\ntheoretically analyze and prove the generalizability of the PNS risk.\nExperiments on both synthetic and real-world benchmarks demonstrate the\neffectiveness of the proposed method. The details of the implementation can be\nfound at the GitHub repository: https://github.com/ymy4323460/CaSN.\n","authors":["Mengyue Yang","Zhen Fang","Yonggang Zhang","Yali Du","Furui Liu","Jean-Francois Ton","Jianhong Wang","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2309.12559v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.08638v2","updated":"2023-12-21T16:22:44Z","published":"2023-08-16T19:20:06Z","title":"Fair GANs through model rebalancing for extremely imbalanced class\n distributions","summary":" Deep generative models require large amounts of training data. This often\nposes a problem as the collection of datasets can be expensive and difficult,\nin particular datasets that are representative of the appropriate underlying\ndistribution (e.g. demographic). This introduces biases in datasets which are\nfurther propagated in the models. We present an approach to construct an\nunbiased generative adversarial network (GAN) from an existing biased GAN by\nrebalancing the model distribution. We do so by generating balanced data from\nan existing imbalanced deep generative model using an evolutionary algorithm\nand then using this data to train a balanced generative model. Additionally, we\npropose a bias mitigation loss function that minimizes the deviation of the\nlearned class distribution from being equiprobable. We show results for the\nStyleGAN2 models while training on the Flickr Faces High Quality (FFHQ) dataset\nfor racial fairness and see that the proposed approach improves on the fairness\nmetric by almost 5 times, whilst maintaining image quality. We further validate\nour approach by applying it to an imbalanced CIFAR10 dataset where we show that\nwe can obtain comparable fairness and image quality as when training on a\nbalanced CIFAR10 dataset which is also twice as large. Lastly, we argue that\nthe traditionally used image quality metrics such as Frechet inception distance\n(FID) are unsuitable for scenarios where the class distributions are imbalanced\nand a balanced reference set is not available.\n","authors":["Anubhav Jain","Nasir Memon","Julian Togelius"],"pdf_url":"https://arxiv.org/pdf/2308.08638v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13987v1","updated":"2023-12-21T16:20:12Z","published":"2023-12-21T16:20:12Z","title":"Modular Neural Network Policies for Learning In-Flight Object Catching\n with a Robot Hand-Arm System","summary":" We present a modular framework designed to enable a robot hand-arm system to\nlearn how to catch flying objects, a task that requires fast, reactive, and\naccurately-timed robot motions. Our framework consists of five core modules:\n(i) an object state estimator that learns object trajectory prediction, (ii) a\ncatching pose quality network that learns to score and rank object poses for\ncatching, (iii) a reaching control policy trained to move the robot hand to\npre-catch poses, (iv) a grasping control policy trained to perform soft\ncatching motions for safe and robust grasping, and (v) a gating network trained\nto synthesize the actions given by the reaching and grasping policy. The former\ntwo modules are trained via supervised learning and the latter three use deep\nreinforcement learning in a simulated environment. We conduct extensive\nevaluations of our framework in simulation for each module and the integrated\nsystem, to demonstrate high success rates of in-flight catching and robustness\nto perturbations and sensory noise. Whilst only simple cylindrical and\nspherical objects are used for training, the integrated system shows successful\ngeneralization to a variety of household objects that are not used in training.\n","authors":["Wenbin Hu","Fernando Acero","Eleftherios Triantafyllidis","Zhaocheng Liu","Zhibin Li"],"pdf_url":"https://arxiv.org/pdf/2312.13987v1.pdf","comment":"8 pages. Accepted and presented at IEEE IROS 2023"},{"id":"http://arxiv.org/abs/2312.13985v1","updated":"2023-12-21T16:18:33Z","published":"2023-12-21T16:18:33Z","title":"Rényi Pufferfish Privacy: General Additive Noise Mechanisms and\n Privacy Amplification by Iteration","summary":" Pufferfish privacy is a flexible generalization of differential privacy that\nallows to model arbitrary secrets and adversary's prior knowledge about the\ndata. Unfortunately, designing general and tractable Pufferfish mechanisms that\ndo not compromise utility is challenging. Furthermore, this framework does not\nprovide the composition guarantees needed for a direct use in iterative machine\nlearning algorithms. To mitigate these issues, we introduce a R\\'enyi\ndivergence-based variant of Pufferfish and show that it allows us to extend the\napplicability of the Pufferfish framework. We first generalize the Wasserstein\nmechanism to cover a wide range of noise distributions and introduce several\nways to improve its utility. We also derive stronger guarantees against\nout-of-distribution adversaries. Finally, as an alternative to composition, we\nprove privacy amplification results for contractive noisy iterations and\nshowcase the first use of Pufferfish in private convex optimization. A common\ningredient underlying our results is the use and extension of shift reduction\nlemmas.\n","authors":["Clément Pierquin","Aurélien Bellet","Marc Tommasi","Matthieu Boussard"],"pdf_url":"https://arxiv.org/pdf/2312.13985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13978v1","updated":"2023-12-21T16:06:44Z","published":"2023-12-21T16:06:44Z","title":"Metalearning with Very Few Samples Per Task","summary":" Metalearning and multitask learning are two frameworks for solving a group of\nrelated learning tasks more efficiently than we could hope to solve each of the\nindividual tasks on their own. In multitask learning, we are given a fixed set\nof related learning tasks and need to output one accurate model per task,\nwhereas in metalearning we are given tasks that are drawn i.i.d. from a\nmetadistribution and need to output some common information that can be easily\nspecialized to new, previously unseen tasks from the metadistribution.\n In this work, we consider a binary classification setting where tasks are\nrelated by a shared representation, that is, every task $P$ of interest can be\nsolved by a classifier of the form $f_{P} \\circ h$ where $h \\in H$ is a map\nfrom features to some representation space that is shared across tasks, and\n$f_{P} \\in F$ is a task-specific classifier from the representation space to\nlabels. The main question we ask in this work is how much data do we need to\nmetalearn a good representation? Here, the amount of data is measured in terms\nof both the number of tasks $t$ that we need to see and the number of samples\n$n$ per task. We focus on the regime where the number of samples per task is\nextremely small. Our main result shows that, in a distribution-free setting\nwhere the feature vectors are in $\\mathbb{R}^d$, the representation is a linear\nmap from $\\mathbb{R}^d \\to \\mathbb{R}^k$, and the task-specific classifiers are\nhalfspaces in $\\mathbb{R}^k$, we can metalearn a representation with error\n$\\varepsilon$ using just $n = k+2$ samples per task, and $d \\cdot\n(1/\\varepsilon)^{O(k)}$ tasks. Learning with so few samples per task is\nremarkable because metalearning would be impossible with $k+1$ samples per\ntask, and because we cannot even hope to learn an accurate task-specific\nclassifier with just $k+2$ samples per task.\n","authors":["Maryam Aliakbarpour","Konstantina Bairaktari","Gavin Brown","Adam Smith","Jonathan Ullman"],"pdf_url":"https://arxiv.org/pdf/2312.13978v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13970v1","updated":"2023-12-21T15:56:09Z","published":"2023-12-21T15:56:09Z","title":"On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and\n Efficient Gradient Methods","summary":" This paper studies the Partial Optimal Transport (POT) problem between two\nunbalanced measures with at most $n$ supports and its applications in various\nAI tasks such as color transfer or domain adaptation. There is hence the need\nfor fast approximations of POT with increasingly large problem sizes in arising\napplications. We first theoretically and experimentally investigate the\ninfeasibility of the state-of-the-art Sinkhorn algorithm for POT due to its\nincompatible rounding procedure, which consequently degrades its qualitative\nperformance in real world applications like point-cloud registration. To this\nend, we propose a novel rounding algorithm for POT, and then provide a feasible\nSinkhorn procedure with a revised computation complexity of\n$\\mathcal{\\widetilde O}(n^2/\\varepsilon^4)$. Our rounding algorithm also\npermits the development of two first-order methods to approximate the POT\nproblem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent\n(APDAGD), finds an $\\varepsilon$-approximate solution to the POT problem in\n$\\mathcal{\\widetilde O}(n^{2.5}/\\varepsilon)$, which is better in $\\varepsilon$\nthan revised Sinkhorn. The second method, Dual Extrapolation, achieves the\ncomputation complexity of $\\mathcal{\\widetilde O}(n^2/\\varepsilon)$, thereby\nbeing the best in the literature. We further demonstrate the flexibility of POT\ncompared to standard OT as well as the practicality of our algorithms on real\napplications where two marginal distributions are unbalanced.\n","authors":["Anh Duc Nguyen","Tuan Dung Nguyen","Quang Minh Nguyen","Hoang H. Nguyen","Kim-Chuan Toh"],"pdf_url":"https://arxiv.org/pdf/2312.13970v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13947v1","updated":"2023-12-21T15:36:52Z","published":"2023-12-21T15:36:52Z","title":"PhysRFANet: Physics-Guided Neural Network for Real-Time Prediction of\n Thermal Effect During Radiofrequency Ablation Treatment","summary":" Radiofrequency ablation (RFA) is a widely used minimally invasive technique\nfor ablating solid tumors. Achieving precise personalized treatment\nnecessitates feedback information on in situ thermal effects induced by the RFA\nprocedure. While computer simulation facilitates the prediction of electrical\nand thermal phenomena associated with RFA, its practical implementation in\nclinical settings is hindered by high computational demands. In this paper, we\npropose a physics-guided neural network model, named PhysRFANet, to enable\nreal-time prediction of thermal effect during RFA treatment. The networks,\ndesigned for predicting temperature distribution and the corresponding ablation\nlesion, were trained using biophysical computational models that integrated\nelectrostatics, bio-heat transfer, and cell necrosis, alongside magnetic\nresonance (MR) images of breast cancer patients. Validation of the\ncomputational model was performed through experiments on ex vivo bovine liver\ntissue. Our model demonstrated a 96% Dice score in predicting the lesion volume\nand an RMSE of 0.4854 for temperature distribution when tested with foreseen\ntumor images. Notably, even with unforeseen images, it achieved a 93% Dice\nscore for the ablation lesion and an RMSE of 0.6783 for temperature\ndistribution. All networks were capable of inferring results within 10 ms. The\npresented technique, applied to optimize the placement of the electrode for a\nspecific target region, holds significant promise in enhancing the safety and\nefficacy of RFA treatments.\n","authors":["Minwoo Shin","Minjee Seo","Seonaeng Cho","Juil Park","Joon Ho Kwon","Deukhee Lee","Kyungho Yoon"],"pdf_url":"https://arxiv.org/pdf/2312.13947v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13933v1","updated":"2023-12-21T15:28:02Z","published":"2023-12-21T15:28:02Z","title":"Structured Probabilistic Coding","summary":" This paper presents a new supervised representation learning framework,\nnamely Structured Probabilistic Coding (SPC), to learn compact and informative\nrepresentations from input related to the target task. SPC is an encoder-only\nprobabilistic coding technology with a structured regularization from the\ntarget label space. By extracting compact and informative representations from\ninput related to the target task, SPC can enhance the generalization ability of\npre-trained language models for better language understanding. Specifically,\nthe hidden representation is encoded into a Gaussian distribution space, while\nmaximizing the prior entropy of latent representations concerning label space.\nThis technique can simultaneously perform information encoding and task\nprediction in one module to more fully utilize the effective information from\ninput data, and use variational inference in the output space to reduce\nrandomness and uncertainty. To better control the probability distribution in\nthe latent space, a structured regularization is proposed to promote\nclass-level uniformity in the latent space. With the regularization term, SPC\ncan preserve the Gaussian distribution structure of latent code as well as\nbetter cover the hidden space with class uniformly. We conduct evaluations on\n12 natural language understanding tasks. The results show that our SPC can\neffectively improve the performance of pre-trained language models for various\nclassification and regression tasks. Experiments demonstrate that SPC can\nenhance the generalization capability, robustness to label noise, and\nclustering quality of output representations.\n","authors":["Dou Hu","Lingwei Wei","Yaxin Liu","Wei Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2312.13933v1.pdf","comment":"11 pages, accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13931v1","updated":"2023-12-21T15:26:26Z","published":"2023-12-21T15:26:26Z","title":"Joint Sensing and Task-Oriented Communications with Image and Wireless\n Data Modalities for Dynamic Spectrum Access","summary":" This paper introduces a deep learning approach to dynamic spectrum access,\nleveraging the synergy of multi-modal image and spectrum data for the\nidentification of potential transmitters. We consider an edge device equipped\nwith a camera that is taking images of potential objects such as vehicles that\nmay harbor transmitters. Recognizing the computational constraints and trust\nissues associated with on-device computation, we propose a collaborative system\nwherein the edge device communicates selectively processed information to a\ntrusted receiver acting as a fusion center, where a decision is made to\nidentify whether a potential transmitter is present, or not. To achieve this,\nwe employ task-oriented communications, utilizing an encoder at the transmitter\nfor joint source coding, channel coding, and modulation. This architecture\nefficiently transmits essential information of reduced dimension for object\nclassification. Simultaneously, the transmitted signals may reflect off objects\nand return to the transmitter, allowing for the collection of target sensing\ndata. Then the collected sensing data undergoes a second round of encoding at\nthe transmitter, with the reduced-dimensional information communicated back to\nthe fusion center through task-oriented communications. On the receiver side, a\ndecoder performs the task of identifying a transmitter by fusing data received\nthrough joint sensing and task-oriented communications. The two encoders at the\ntransmitter and the decoder at the receiver are jointly trained, enabling a\nseamless integration of image classification and wireless signal detection.\nUsing AWGN and Rayleigh channel models, we demonstrate the effectiveness of the\nproposed approach, showcasing high accuracy in transmitter identification\nacross diverse channel conditions while sustaining low latency in decision\nmaking.\n","authors":["Yalin E. Sagduyu","Tugba Erpek","Aylin Yener","Sennur Ulukus"],"pdf_url":"https://arxiv.org/pdf/2312.13931v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.07277v2","updated":"2023-12-21T15:26:22Z","published":"2023-09-13T19:33:26Z","title":"Limitations of Face Image Generation","summary":" Text-to-image diffusion models have achieved widespread popularity due to\ntheir unprecedented image generation capability. In particular, their ability\nto synthesize and modify human faces has spurred research into using generated\nface images in both training data augmentation and model performance\nassessments. In this paper, we study the efficacy and shortcomings of\ngenerative models in the context of face generation. Utilizing a combination of\nqualitative and quantitative measures, including embedding-based metrics and\nuser studies, we present a framework to audit the characteristics of generated\nfaces conditioned on a set of social attributes. We applied our framework on\nfaces generated through state-of-the-art text-to-image diffusion models. We\nidentify several limitations of face image generation that include faithfulness\nto the text prompt, demographic disparities, and distributional shifts.\nFurthermore, we present an analytical model that provides insights into how\ntraining data selection contributes to the performance of generative models.\n","authors":["Harrison Rosenberg","Shimaa Ahmed","Guruprasad V Ramesh","Ramya Korlakai Vinayak","Kassem Fawaz"],"pdf_url":"https://arxiv.org/pdf/2309.07277v2.pdf","comment":"Accepted to The 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI 2024)"},{"id":"http://arxiv.org/abs/2312.13927v1","updated":"2023-12-21T15:22:07Z","published":"2023-12-21T15:22:07Z","title":"On the convergence of loss and uncertainty-based active learning\n algorithms","summary":" We study convergence rates of loss and uncertainty-based active learning\nalgorithms under various assumptions. First, we provide a set of conditions\nunder which a convergence rate guarantee holds, and use this for linear\nclassifiers and linearly separable datasets to show convergence rate guarantees\nfor loss-based sampling and different loss functions. Second, we provide a\nframework that allows us to derive convergence rate bounds for loss-based\nsampling by deploying known convergence rate bounds for stochastic gradient\ndescent algorithms. Third, and last, we propose an active learning algorithm\nthat combines sampling of points and stochastic Polyak's step size. We show a\ncondition on the sampling that ensures a convergence rate guarantee for this\nalgorithm for smooth convex loss functions. Our numerical results demonstrate\nefficiency of our proposed algorithm.\n","authors":["Daniel Haimovich","Dima Karamshuk","Fridolin Linder","Niek Tax","Milan Vojnovic"],"pdf_url":"https://arxiv.org/pdf/2312.13927v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.14236v4","updated":"2023-12-21T15:17:36Z","published":"2022-11-25T16:56:42Z","title":"Strategyproof Decision-Making in Panel Data Settings and Beyond","summary":" We consider the problem of decision-making using panel data, in which a\ndecision-maker gets noisy, repeated measurements of multiple units (or agents).\nWe consider a setup where there is a pre-intervention period, when the\nprincipal observes the outcomes of each unit, after which the principal uses\nthese observations to assign a treatment to each unit. Unlike this classical\nsetting, we permit the units generating the panel data to be strategic, i.e.\nunits may modify their pre-intervention outcomes in order to receive a more\ndesirable intervention. The principal's goal is to design a strategyproof\nintervention policy, i.e. a policy that assigns units to their\nutility-maximizing interventions despite their potential strategizing. We first\nidentify a necessary and sufficient condition under which a strategyproof\nintervention policy exists, and provide a strategyproof mechanism with a simple\nclosed form when one does exist. Along the way, we prove impossibility results\nfor strategic multiclass classification, which may be of independent interest.\nWhen there are two interventions, we establish that there always exists a\nstrategyproof mechanism, and provide an algorithm for learning such a\nmechanism. For three or more interventions, we provide an algorithm for\nlearning a strategyproof mechanism if there exists a sufficiently large gap in\nthe principal's rewards between different interventions. Finally, we\nempirically evaluate our model using real-world panel data collected from\nproduct sales over 18 months. We find that our methods compare favorably to\nbaselines which do not take strategic interactions into consideration, even in\nthe presence of model misspecification.\n","authors":["Keegan Harris","Anish Agarwal","Chara Podimata","Zhiwei Steven Wu"],"pdf_url":"https://arxiv.org/pdf/2211.14236v4.pdf","comment":"In the fiftieth ACM SIGMETRICS International Conference on\n Measurement and Modeling of Computer Systems (SIGMETRICS 2024)"},{"id":"http://arxiv.org/abs/2310.19583v3","updated":"2023-12-21T15:14:22Z","published":"2023-10-30T14:41:53Z","title":"GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View\n Stereo","summary":" Traditional multi-view stereo (MVS) methods rely heavily on photometric and\ngeometric consistency constraints, but newer machine learning-based MVS methods\ncheck geometric consistency across multiple source views only as a\npost-processing step. In this paper, we present a novel approach that\nexplicitly encourages geometric consistency of reference view depth maps across\nmultiple source views at different scales during learning (see Fig. 1). We find\nthat adding this geometric consistency loss significantly accelerates learning\nby explicitly penalizing geometrically inconsistent pixels, reducing the\ntraining iteration requirements to nearly half that of other MVS methods. Our\nextensive experiments show that our approach achieves a new state-of-the-art on\nthe DTU and BlendedMVS datasets, and competitive results on the Tanks and\nTemples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt\nto enforce multi-view, multi-scale geometric consistency during learning.\n","authors":["Vibhas K. Vats","Sripad Joshi","David J. Crandall","Md. Alimoor Reza","Soon-heung Jung"],"pdf_url":"https://arxiv.org/pdf/2310.19583v3.pdf","comment":"Accepted in WACV 2024 Link:\n https://openaccess.thecvf.com/content/WACV2024/html/Vats_GC-MVSNet_Multi-View_Multi-Scale_Geometrically-Consistent_Multi-View_Stereo_WACV_2024_paper.html"},{"id":"http://arxiv.org/abs/2312.13923v1","updated":"2023-12-21T15:12:12Z","published":"2023-12-21T15:12:12Z","title":"Fed-CO$_{2}$: Cooperation of Online and Offline Models for Severe Data\n Heterogeneity in Federated Learning","summary":" Federated Learning (FL) has emerged as a promising distributed learning\nparadigm that enables multiple clients to learn a global model collaboratively\nwithout sharing their private data. However, the effectiveness of FL is highly\ndependent on the quality of the data that is being used for training. In\nparticular, data heterogeneity issues, such as label distribution skew and\nfeature skew, can significantly impact the performance of FL. Previous studies\nin FL have primarily focused on addressing label distribution skew data\nheterogeneity, while only a few recent works have made initial progress in\ntackling feature skew issues. Notably, these two forms of data heterogeneity\nhave been studied separately and have not been well explored within a unified\nFL framework. To address this gap, we propose Fed-CO$_{2}$, a universal FL\nframework that handles both label distribution skew and feature skew within a\n\\textbf{C}ooperation mechanism between the \\textbf{O}nline and \\textbf{O}ffline\nmodels. Specifically, the online model learns general knowledge that is shared\namong all clients, while the offline model is trained locally to learn the\nspecialized knowledge of each individual client. To further enhance model\ncooperation in the presence of feature shifts, we design an intra-client\nknowledge transfer mechanism that reinforces mutual learning between the online\nand offline models, and an inter-client knowledge transfer mechanism to\nincrease the models' domain generalization ability. Extensive experiments show\nthat our Fed-CO$_{2}$ outperforms a wide range of existing personalized\nfederated learning algorithms in terms of handling label distribution skew and\nfeature skew, both individually and collectively. The empirical results are\nsupported by our convergence analyses in a simplified setting.\n","authors":["Zhongyi Cai","Ye Shi","Wei Huang","Jingya Wang"],"pdf_url":"https://arxiv.org/pdf/2312.13923v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13910v1","updated":"2023-12-21T14:55:21Z","published":"2023-12-21T14:55:21Z","title":"Multi-Agent Probabilistic Ensembles with Trajectory Sampling for\n Connected Autonomous Vehicles","summary":" Autonomous Vehicles (AVs) have attracted significant attention in recent\nyears and Reinforcement Learning (RL) has shown remarkable performance in\nimproving the autonomy of vehicles. In that regard, the widely adopted\nModel-Free RL (MFRL) promises to solve decision-making tasks in connected AVs\n(CAVs), contingent on the readiness of a significant amount of data samples for\ntraining. Nevertheless, it might be infeasible in practice and possibly lead to\nlearning instability. In contrast, Model-Based RL (MBRL) manifests itself in\nsample-efficient learning, but the asymptotic performance of MBRL might lag\nbehind the state-of-the-art MFRL algorithms. Furthermore, most studies for CAVs\nare limited to the decision-making of a single AV only, thus underscoring the\nperformance due to the absence of communications. In this study, we try to\naddress the decision-making problem of multiple CAVs with limited\ncommunications and propose a decentralized Multi-Agent Probabilistic Ensembles\nwith Trajectory Sampling algorithm MA-PETS. In particular, in order to better\ncapture the uncertainty of the unknown environment, MA-PETS leverages\nProbabilistic Ensemble (PE) neural networks to learn from communicated samples\namong neighboring CAVs. Afterwards, MA-PETS capably develops Trajectory\nSampling (TS)-based model-predictive control for decision-making. On this\nbasis, we derive the multi-agent group regret bound affected by the number of\nagents within the communication range and mathematically validate that\nincorporating effective information exchange among agents into the multi-agent\nlearning scheme contributes to reducing the group regret bound in the worst\ncase. Finally, we empirically demonstrate the superiority of MA-PETS in terms\nof the sample efficiency comparable to MFBL.\n","authors":["Ruoqi Wen","Jiahao Huang","Rongpeng Li","Guoru Ding","Zhifeng Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.13910v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13906v1","updated":"2023-12-21T14:51:23Z","published":"2023-12-21T14:51:23Z","title":"EfficientPPS: Part-aware Panoptic Segmentation of Transparent Objects\n for Robotic Manipulation","summary":" The use of autonomous robots for assistance tasks in hospitals has the\npotential to free up qualified staff and im-prove patient care. However, the\nubiquity of deformable and transparent objects in hospital settings poses\nsignif-icant challenges to vision-based perception systems. We present\nEfficientPPS, a neural architecture for part-aware panoptic segmentation that\nprovides robots with semantically rich visual information for grasping and\nma-nipulation tasks. We also present an unsupervised data collection and\nlabelling method to reduce the need for human involvement in the training\nprocess. EfficientPPS is evaluated on a dataset containing real-world hospital\nobjects and demonstrated to be robust and efficient in grasping transparent\ntransfusion bags with a collaborative robot arm.\n","authors":["Benjamin Alt","Minh Dang Nguyen","Andreas Hermann","Darko Katic","Rainer Jäkel","Rüdiger Dillmann","Eric Sax"],"pdf_url":"https://arxiv.org/pdf/2312.13906v1.pdf","comment":"8 pages, 8 figures, presented at the 56th International Symposium on\n Robotics (ISR Europe)"},{"id":"http://arxiv.org/abs/2312.13905v1","updated":"2023-12-21T14:51:04Z","published":"2023-12-21T14:51:04Z","title":"Domain-Specific Fine-Tuning of Large Language Models for Interactive\n Robot Programming","summary":" Industrial robots are applied in a widening range of industries, but robot\nprogramming mostly remains a task limited to programming experts. We propose a\nnatural language-based assistant for programming of advanced, industrial\nrobotic applications and investigate strategies for domain-specific fine-tuning\nof foundation models with limited data and compute.\n","authors":["Benjamin Alt","Urs Keßner","Aleksandar Taranovic","Darko Katic","Andreas Hermann","Rainer Jäkel","Gerhard Neumann"],"pdf_url":"https://arxiv.org/pdf/2312.13905v1.pdf","comment":"5 pages, 1 figure, accepted to the 2024 European Robotics Forum"},{"id":"http://arxiv.org/abs/2312.13896v1","updated":"2023-12-21T14:42:42Z","published":"2023-12-21T14:42:42Z","title":"Comparative Evaluation of Anomaly Detection Methods for Fraud Detection\n in Online Credit Card Payments","summary":" This study explores the application of anomaly detection (AD) methods in\nimbalanced learning tasks, focusing on fraud detection using real online credit\ncard payment data. We assess the performance of several recent AD methods and\ncompare their effectiveness against standard supervised learning methods.\nOffering evidence of distribution shift within our dataset, we analyze its\nimpact on the tested models' performances. Our findings reveal that LightGBM\nexhibits significantly superior performance across all evaluated metrics but\nsuffers more from distribution shifts than AD methods. Furthermore, our\ninvestigation reveals that LightGBM also captures the majority of frauds\ndetected by AD methods. This observation challenges the potential benefits of\nensemble methods to combine supervised, and AD approaches to enhance\nperformance. In summary, this research provides practical insights into the\nutility of these techniques in real-world scenarios, showing LightGBM's\nsuperiority in fraud detection while highlighting challenges related to\ndistribution shifts.\n","authors":["Hugo Thimonier","Fabrice Popineau","Arpad Rimmel","Bich-Liên Doan","Fabrice Daniel"],"pdf_url":"https://arxiv.org/pdf/2312.13896v1.pdf","comment":"Accepted at ICICT 2024"},{"id":"http://arxiv.org/abs/2310.09574v2","updated":"2023-12-21T14:38:32Z","published":"2023-10-14T12:55:43Z","title":"Reduced Policy Optimization for Continuous Control with Hard Constraints","summary":" Recent advances in constrained reinforcement learning (RL) have endowed\nreinforcement learning with certain safety guarantees. However, deploying\nexisting constrained RL algorithms in continuous control tasks with general\nhard constraints remains challenging, particularly in those situations with\nnon-convex hard constraints. Inspired by the generalized reduced gradient (GRG)\nalgorithm, a classical constrained optimization technique, we propose a reduced\npolicy optimization (RPO) algorithm that combines RL with GRG to address\ngeneral hard constraints. RPO partitions actions into basic actions and\nnonbasic actions following the GRG method and outputs the basic actions via a\npolicy network. Subsequently, RPO calculates the nonbasic actions by solving\nequations based on equality constraints using the obtained basic actions. The\npolicy network is then updated by implicitly differentiating nonbasic actions\nwith respect to basic actions. Additionally, we introduce an action projection\nprocedure based on the reduced gradient and apply a modified Lagrangian\nrelaxation technique to ensure inequality constraints are satisfied. To the\nbest of our knowledge, RPO is the first attempt that introduces GRG to RL as a\nway of efficiently handling both equality and inequality hard constraints. It\nis worth noting that there is currently a lack of RL environments with complex\nhard constraints, which motivates us to develop three new benchmarks: two\nrobotics manipulation tasks and a smart grid operation control task. With these\nbenchmarks, RPO achieves better performance than previous constrained RL\nalgorithms in terms of both cumulative reward and constraint violation. We\nbelieve RPO, along with the new benchmarks, will open up new opportunities for\napplying RL to real-world problems with complex constraints.\n","authors":["Shutong Ding","Jingya Wang","Yali Du","Ye Shi"],"pdf_url":"https://arxiv.org/pdf/2310.09574v2.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2310.09583v2","updated":"2023-12-21T14:35:29Z","published":"2023-10-14T13:28:36Z","title":"Two Sides of The Same Coin: Bridging Deep Equilibrium Models and Neural\n ODEs via Homotopy Continuation","summary":" Deep Equilibrium Models (DEQs) and Neural Ordinary Differential Equations\n(Neural ODEs) are two branches of implicit models that have achieved remarkable\nsuccess owing to their superior performance and low memory consumption. While\nboth are implicit models, DEQs and Neural ODEs are derived from different\nmathematical formulations. Inspired by homotopy continuation, we establish a\nconnection between these two models and illustrate that they are actually two\nsides of the same coin. Homotopy continuation is a classical method of solving\nnonlinear equations based on a corresponding ODE. Given this connection, we\nproposed a new implicit model called HomoODE that inherits the property of high\naccuracy from DEQs and the property of stability from Neural ODEs. Unlike DEQs,\nwhich explicitly solve an equilibrium-point-finding problem via Newton's\nmethods in the forward pass, HomoODE solves the equilibrium-point-finding\nproblem implicitly using a modified Neural ODE via homotopy continuation.\nFurther, we developed an acceleration method for HomoODE with a shared\nlearnable initial point. It is worth noting that our model also provides a\nbetter understanding of why Augmented Neural ODEs work as long as the augmented\npart is regarded as the equilibrium point to find. Comprehensive experiments\nwith several image classification tasks demonstrate that HomoODE surpasses\nexisting implicit models in terms of both accuracy and memory consumption.\n","authors":["Shutong Ding","Tianyu Cui","Jingya Wang","Ye Shi"],"pdf_url":"https://arxiv.org/pdf/2310.09583v2.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2307.06971v2","updated":"2023-12-21T14:25:52Z","published":"2023-07-13T11:57:04Z","title":"Short Boolean Formulas as Explanations in Practice","summary":" We investigate explainability via short Boolean formulas in the data model\nbased on unary relations. As an explanation of length k, we take a Boolean\nformula of length k that minimizes the error with respect to the target\nattribute to be explained. We first provide novel quantitative bounds for the\nexpected error in this scenario. We then also demonstrate how the setting works\nin practice by studying three concrete data sets. In each case, we calculate\nexplanation formulas of different lengths using an encoding in Answer Set\nProgramming. The most accurate formulas we obtain achieve errors similar to\nother methods on the same data sets. However, due to overfitting, these\nformulas are not necessarily ideal explanations, so we use cross validation to\nidentify a suitable length for explanations. By limiting to shorter formulas,\nwe obtain explanations that avoid overfitting but are still reasonably accurate\nand also, importantly, human interpretable.\n","authors":["Reijo Jaakkola","Tomi Janhunen","Antti Kuusisto","Masood Feyzbakhsh Rankooh","Miikka Vilander"],"pdf_url":"https://arxiv.org/pdf/2307.06971v2.pdf","comment":"Long version of a paper published in JELIA 2023. Changes to version\n 1: typos fixed, clarifications added"},{"id":"http://arxiv.org/abs/2312.13876v1","updated":"2023-12-21T14:20:06Z","published":"2023-12-21T14:20:06Z","title":"Capture the Flag: Uncovering Data Insights with Large Language Models","summary":" The extraction of a small number of relevant insights from vast amounts of\ndata is a crucial component of data-driven decision-making. However,\naccomplishing this task requires considerable technical skills, domain\nexpertise, and human labor. This study explores the potential of using Large\nLanguage Models (LLMs) to automate the discovery of insights in data,\nleveraging recent advances in reasoning and code generation techniques. We\npropose a new evaluation methodology based on a \"capture the flag\" principle,\nmeasuring the ability of such models to recognize meaningful and pertinent\ninformation (flags) in a dataset. We further propose two proof-of-concept\nagents, with different inner workings, and compare their ability to capture\nsuch flags in a real-world sales dataset. While the work reported here is\npreliminary, our results are sufficiently interesting to mandate future\nexploration by the community.\n","authors":["Issam Laradji","Perouz Taslakian","Sai Rajeswar","Valentina Zantedeschi","Alexandre Lacoste","Nicolas Chapados","David Vazquez","Christopher Pal","Alexandre Drouin"],"pdf_url":"https://arxiv.org/pdf/2312.13876v1.pdf","comment":"14 pages, 1 figure, Foundation Models for Decision Making Workshop at\n NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.06668v3","updated":"2023-12-21T14:18:54Z","published":"2023-08-13T02:59:36Z","title":"Foundation Models in Smart Agriculture: Basics, Opportunities, and\n Challenges","summary":" The past decade has witnessed the rapid development of ML and DL\nmethodologies in agricultural systems, showcased by great successes in variety\nof agricultural applications. However, these conventional ML/DL models have\ncertain limitations: They heavily rely on large, costly-to-acquire labeled\ndatasets for training, require specialized expertise for development and\nmaintenance, and are mostly tailored for specific tasks, thus lacking\ngeneralizability. Recently, foundation models have demonstrated remarkable\nsuccesses in language and vision tasks across various domains. These models are\ntrained on a vast amount of data from multiple domains and modalities. Once\ntrained, they can accomplish versatile tasks with just minor fine-tuning and\nminimal task-specific labeled data. Despite their proven effectiveness and huge\npotential, there has been little exploration of applying FMs to agriculture\nfields. Therefore, this study aims to explore the potential of FMs in the field\nof smart agriculture. In particular, we present conceptual tools and technical\nbackground to facilitate the understanding of the problem space and uncover new\nresearch directions in this field. To this end, we first review recent FMs in\nthe general computer science domain and categorize them into four categories:\nlanguage FMs, vision FMs, multimodal FMs, and reinforcement learning FMs.\nSubsequently, we outline the process of developing agriculture FMs and discuss\ntheir potential applications in smart agriculture. We also discuss the unique\nchallenges associated with developing AFMs, including model training,\nvalidation, and deployment. Through this study, we contribute to the\nadvancement of AI in agriculture by introducing AFMs as a promising paradigm\nthat can significantly mitigate the reliance on extensive labeled datasets and\nenhance the efficiency, effectiveness, and generalization of agricultural AI\nsystems.\n","authors":["Jiajia Li","Mingle Xu","Lirong Xiang","Dong Chen","Weichao Zhuang","Xunyuan Yin","Zhaojian Li"],"pdf_url":"https://arxiv.org/pdf/2308.06668v3.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2312.13875v1","updated":"2023-12-21T14:16:38Z","published":"2023-12-21T14:16:38Z","title":"Best Arm Identification in Batched Multi-armed Bandit Problems","summary":" Recently multi-armed bandit problem arises in many real-life scenarios where\narms must be sampled in batches, due to limited time the agent can wait for the\nfeedback. Such applications include biological experimentation and online\nmarketing. The problem is further complicated when the number of arms is large\nand the number of batches is small. We consider pure exploration in a batched\nmulti-armed bandit problem. We introduce a general linear programming framework\nthat can incorporate objectives of different theoretical settings in best arm\nidentification. The linear program leads to a two-stage algorithm that can\nachieve good theoretical properties. We demonstrate by numerical studies that\nthe algorithm also has good performance compared to certain UCB-type or\nThompson sampling methods.\n","authors":["Shengyu Cao","Simai He","Ruoqing Jiang","Jin Xu","Hongsong Yuan"],"pdf_url":"https://arxiv.org/pdf/2312.13875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.06058v2","updated":"2023-12-21T14:11:00Z","published":"2023-03-10T16:43:48Z","title":"A General Recipe for the Analysis of Randomized Multi-Armed Bandit\n Algorithms","summary":" In this paper we propose a general methodology to derive regret bounds for\nrandomized multi-armed bandit algorithms. It consists in checking a set of\nsufficient conditions on the sampling probability of each arm and on the family\nof distributions to prove a logarithmic regret. As a direct application we\nrevisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and\nThompson Sampling (TS), under various models for the distributions including\nsingle parameter exponential families, Gaussian distributions, bounded\ndistributions, or distributions satisfying some conditions on their moments. In\nparticular, we prove that MED is asymptotically optimal for all these models,\nbut also provide a simple regret analysis of some TS algorithms for which the\noptimality is already known. We then further illustrate the interest of our\napproach, by analyzing a new Non-Parametric TS algorithm (h-NPTS), adapted to\nsome families of unbounded reward distributions with a bounded h-moment. This\nmodel can for instance capture some non-parametric families of distributions\nwhose variance is upper bounded by a known constant.\n","authors":["Dorian Baudry","Kazuya Suzuki","Junya Honda"],"pdf_url":"https://arxiv.org/pdf/2303.06058v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13868v1","updated":"2023-12-21T14:07:47Z","published":"2023-12-21T14:07:47Z","title":"Data-driven path collective variables","summary":" Identifying optimal collective variables to model transformations, using\natomic-scale simulations, is a long-standing challenge. We propose a new method\nfor the generation, optimization, and comparison of collective variables, which\ncan be thought of as a data-driven generalization of the path collective\nvariable concept. It consists in a kernel ridge regression of the committor\nprobability, which encodes a transformation's progress. The resulting\ncollective variable is one-dimensional, interpretable, and differentiable,\nmaking it appropriate for enhanced sampling simulations requiring biasing. We\ndemonstrate the validity of the method on two different applications: a\nprecipitation model, and the association of Li$^+$ and F$^-$ in water. For the\nformer, we show that global descriptors such as the permutation invariant\nvector allow to reach an accuracy far from the one achieved \\textit{via}\nsimpler, more intuitive variables. For the latter, we show that information\ncorrelated with the transformation mechanism is contained in the first\nsolvation shell only, and that inertial effects prevent the derivation of\noptimal collective variables from the atomic positions only.\n","authors":["Arthur France-Lanord","Hadrien Vroylandt","Mathieu Salanne","Benjamin Rotenberg","A. Marco Saitta","Fabio Pietrucci"],"pdf_url":"https://arxiv.org/pdf/2312.13868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14961v3","updated":"2023-12-21T14:02:07Z","published":"2023-05-24T09:56:20Z","title":"Deep Learning for Survival Analysis: A Review","summary":" The influx of deep learning (DL) techniques into the field of survival\nanalysis in recent years has led to substantial methodological progress; for\ninstance, learning from unstructured or high-dimensional data such as images,\ntext or omics data. In this work, we conduct a comprehensive systematic review\nof DL-based methods for time-to-event analysis, characterizing them according\nto both survival- and DL-related attributes. In summary, the reviewed methods\noften address only a small subset of tasks relevant to time-to-event data -\ne.g., single-risk right-censored data - and neglect to incorporate more complex\nsettings. Our findings are summarized in an editable, open-source, interactive\ntable: https://survival-org.github.io/DL4Survival. As this research area is\nadvancing rapidly, we encourage community contribution in order to keep this\ndatabase up to date.\n","authors":["Simon Wiegrebe","Philipp Kopper","Raphael Sonabend","Bernd Bischl","Andreas Bender"],"pdf_url":"https://arxiv.org/pdf/2305.14961v3.pdf","comment":"29 pages, 7 figures, 2 tables, 1 interactive table"},{"id":"http://arxiv.org/abs/2312.13863v1","updated":"2023-12-21T14:01:51Z","published":"2023-12-21T14:01:51Z","title":"Manipulating Trajectory Prediction with Backdoors","summary":" Autonomous vehicles ought to predict the surrounding agents' trajectories to\nallow safe maneuvers in uncertain and complex traffic situations. As companies\nincreasingly apply trajectory prediction in the real world, security becomes a\nrelevant concern. In this paper, we focus on backdoors - a security threat\nacknowledged in other fields but so far overlooked for trajectory prediction.\nTo this end, we describe and investigate four triggers that could affect\ntrajectory prediction. We then show that these triggers (for example, a braking\nvehicle), when correlated with a desired output (for example, a curve) during\ntraining, cause the desired output of a state-of-the-art trajectory prediction\nmodel. In other words, the model has good benign performance but is vulnerable\nto backdoors. This is the case even if the trigger maneuver is performed by a\nnon-casual agent behind the target vehicle. As a side-effect, our analysis\nreveals interesting limitations within trajectory prediction models. Finally,\nwe evaluate a range of defenses against backdoors. While some, like simple\noffroad checks, do not enable detection for all triggers, clustering is a\npromising candidate to support manual inspection to find backdoors.\n","authors":["Kaouther Massoud","Kathrin Grosse","Mickael Chen","Matthieu Cord","Patrick Pérez","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2312.13863v1.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.12450v2","updated":"2023-12-21T13:43:41Z","published":"2023-12-11T02:27:45Z","title":"Can It Edit? Evaluating the Ability of Large Language Models to Follow\n Code Editing Instructions","summary":" A significant amount of research is focused on developing and evaluating\nlarge language models for a variety of code synthesis tasks. These include\nsynthesizing code from natural language instructions, synthesizing tests from\ncode, and synthesizing explanations of code. In contrast, the behavior of\ninstructional code editing with LLMs is understudied. These are tasks in which\nthe model is instructed to update a block of code provided in a prompt. The\nediting instruction may ask for a feature to added or removed, describe a bug\nand ask for a fix, ask for a different kind of solution, or many other common\ncode editing tasks.\n We introduce a carefully crafted benchmark of code editing tasks and use it\nevaluate several cutting edge LLMs. Our evaluation exposes a significant gap\nbetween the capabilities of state-of-the-art open and closed models. For\nexample, even GPT-3.5-Turbo is 8.8% better than the best open model at editing\ncode.\n We also introduce a new, carefully curated, permissively licensed training\nset of code edits coupled with natural language instructions. Using this\ntraining set, we show that we can fine-tune open Code LLMs to significantly\nimprove their code editing capabilities.\n","authors":["Federico Cassano","Luisa Li","Akul Sethi","Noah Shinn","Abby Brennan-Jones","Anton Lozhkov","Carolyn Jane Anderson","Arjun Guha"],"pdf_url":"https://arxiv.org/pdf/2312.12450v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13842v1","updated":"2023-12-21T13:40:31Z","published":"2023-12-21T13:40:31Z","title":"Statistical learning theory and Occam's razor: The argument from\n empirical risk minimization","summary":" This paper considers the epistemic justification for a simplicity preference\nin inductive inference that may be obtained from the machine learning framework\nof statistical learning theory. Uniting elements from both earlier arguments\nsuggesting and rejecting such a justification, the paper spells out a qualified\nmeans-ends and model-relative justificatory argument, built on statistical\nlearning theory's central mathematical learning guarantee for the method of\nempirical risk minimization.\n","authors":["Tom F. Sterkenburg"],"pdf_url":"https://arxiv.org/pdf/2312.13842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00137v2","updated":"2023-12-21T13:40:22Z","published":"2023-11-30T19:00:50Z","title":"The Multiverse of Dynamic Mode Decomposition Algorithms","summary":" Dynamic Mode Decomposition (DMD) is a popular data-driven analysis technique\nused to decompose complex, nonlinear systems into a set of modes, revealing\nunderlying patterns and dynamics through spectral analysis. This review\npresents a comprehensive and pedagogical examination of DMD, emphasizing the\nrole of Koopman operators in transforming complex nonlinear dynamics into a\nlinear framework. A distinctive feature of this review is its focus on the\nrelationship between DMD and the spectral properties of Koopman operators, with\nparticular emphasis on the theory and practice of DMD algorithms for spectral\ncomputations. We explore the diverse \"multiverse\" of DMD methods, categorized\ninto three main areas: linear regression-based methods, Galerkin\napproximations, and structure-preserving techniques. Each category is studied\nfor its unique contributions and challenges, providing a detailed overview of\nsignificant algorithms and their applications as outlined in Table 1. We\ninclude a MATLAB package with examples and applications to enhance the\npractical understanding of these methods. This review serves as both a\npractical guide and a theoretical reference for various DMD methods, accessible\nto both experts and newcomers, and enabling readers to delve into their areas\nof interest in the expansive field of DMD.\n","authors":["Matthew J. Colbrook"],"pdf_url":"https://arxiv.org/pdf/2312.00137v2.pdf","comment":"review article, 88 pages, 28 figures,"},{"id":"http://arxiv.org/abs/2312.13839v1","updated":"2023-12-21T13:39:18Z","published":"2023-12-21T13:39:18Z","title":"Q-SENN: Quantized Self-Explaining Neural Networks","summary":" Explanations in Computer Vision are often desired, but most Deep Neural\nNetworks can only provide saliency maps with questionable faithfulness.\nSelf-Explaining Neural Networks (SENN) extract interpretable concepts with\nfidelity, diversity, and grounding to combine them linearly for\ndecision-making. While they can explain what was recognized, initial\nrealizations lack accuracy and general applicability. We propose the\nQuantized-Self-Explaining Neural Network Q-SENN. Q-SENN satisfies or exceeds\nthe desiderata of SENN while being applicable to more complex datasets and\nmaintaining most or all of the accuracy of an uninterpretable baseline model,\nout-performing previous work in all considered metrics. Q-SENN describes the\nrelationship between every class and feature as either positive, negative or\nneutral instead of an arbitrary number of possible relations, enforcing more\nbinary human-friendly features. Since every class is assigned just 5\ninterpretable features on average, Q-SENN shows convincing local and global\ninterpretability. Additionally, we propose a feature alignment method, capable\nof aligning learned features with human language-based concepts without\nadditional supervision. Thus, what is learned can be more easily verbalized.\nThe code is published: https://github.com/ThomasNorr/Q-SENN\n","authors":["Thomas Norrenbrock","Marco Rudolph","Bodo Rosenhahn"],"pdf_url":"https://arxiv.org/pdf/2312.13839v1.pdf","comment":"Accepted to AAAI 2024, SRRAI"},{"id":"http://arxiv.org/abs/2312.11562v3","updated":"2023-12-21T13:21:59Z","published":"2023-12-17T15:16:13Z","title":"A Survey of Reasoning with Foundation Models: Concepts, Methodologies,\n and Outlook","summary":" Reasoning, a crucial ability for complex problem-solving, plays a pivotal\nrole in various real-world settings such as negotiation, medical diagnosis, and\ncriminal investigation. It serves as a fundamental methodology in the field of\nArtificial General Intelligence (AGI). With the ongoing development of\nfoundation models, there is a growing interest in exploring their abilities in\nreasoning tasks. In this paper, we introduce seminal foundation models proposed\nor adaptable for reasoning, highlighting the latest advancements in various\nreasoning tasks, methods, and benchmarks. We then delve into the potential\nfuture directions behind the emergence of reasoning abilities within foundation\nmodels. We also discuss the relevance of multimodal learning, autonomous\nagents, and super alignment in the context of reasoning. By discussing these\nfuture research directions, we hope to inspire researchers in their exploration\nof this field, stimulate further advancements in reasoning with foundation\nmodels, and contribute to the development of AGI.\n","authors":["Jiankai Sun","Chuanyang Zheng","Enze Xie","Zhengying Liu","Ruihang Chu","Jianing Qiu","Jiaqi Xu","Mingyu Ding","Hongyang Li","Mengzhe Geng","Yue Wu","Wenhai Wang","Junsong Chen","Zhangyue Yin","Xiaozhe Ren","Jie Fu","Junxian He","Wu Yuan","Qi Liu","Xihui Liu","Yu Li","Hao Dong","Yu Cheng","Ming Zhang","Pheng Ann Heng","Jifeng Dai","Ping Luo","Jingdong Wang","Ji-Rong Wen","Xipeng Qiu","Yike Guo","Hui Xiong","Qun Liu","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2312.11562v3.pdf","comment":"20 Figures, 160 Pages, 750+ References, Project Page\n https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models"},{"id":"http://arxiv.org/abs/2302.03616v3","updated":"2023-12-21T13:06:12Z","published":"2023-02-07T17:21:51Z","title":"Can gamification reduce the burden of self-reporting in mHealth\n applications? A feasibility study using machine learning from smartwatch data\n to estimate cognitive load","summary":" The effectiveness of digital treatments can be measured by requiring patients\nto self-report their state through applications, however, it can be\noverwhelming and causes disengagement. We conduct a study to explore the impact\nof gamification on self-reporting. Our approach involves the creation of a\nsystem to assess cognitive load (CL) through the analysis of\nphotoplethysmography (PPG) signals. The data from 11 participants is utilized\nto train a machine learning model to detect CL. Subsequently, we create two\nversions of surveys: a gamified and a traditional one. We estimate the CL\nexperienced by other participants (13) while completing surveys. We find that\nCL detector performance can be enhanced via pre-training on stress detection\ntasks. For 10 out of 13 participants, a personalized CL detector can achieve an\nF1 score above 0.7. We find no difference between the gamified and non-gamified\nsurveys in terms of CL but participants prefer the gamified version.\n","authors":["Michal K. Grzeszczyk","Paulina Adamczyk","Sylwia Marek","Ryszard Pręcikowski","Maciej Kuś","M. Patrycja Lelujko","Rosmary Blanco","Tomasz Trzciński","Arkadiusz Sitek","Maciej Malawski","Aneta Lisowska"],"pdf_url":"https://arxiv.org/pdf/2302.03616v3.pdf","comment":"Accepted for AMIA 2023"},{"id":"http://arxiv.org/abs/2312.13807v1","updated":"2023-12-21T12:56:40Z","published":"2023-12-21T12:56:40Z","title":"Optimized classification with neural ODEs via separability","summary":" Classification of $N$ points becomes a simultaneous control problem when\nviewed through the lens of neural ordinary differential equations (neural\nODEs), which represent the time-continuous limit of residual networks. For the\nnarrow model, with one neuron per hidden layer, it has been shown that the task\ncan be achieved using $O(N)$ neurons. In this study, we focus on estimating the\nnumber of neurons required for efficient cluster-based classification,\nparticularly in the worst-case scenario where points are independently and\nuniformly distributed in $[0,1]^d$. Our analysis provides a novel method for\nquantifying the probability of requiring fewer than $O(N)$ neurons, emphasizing\nthe asymptotic behavior as both $d$ and $N$ increase. Additionally, under the\nsole assumption that the data are in general position, we propose a new\nconstructive algorithm that simultaneously classifies clusters of $d$ points\nfrom any initial configuration, effectively reducing the maximal complexity to\n$O(N/d)$ neurons.\n","authors":["Antonio Álvarez-López","Rafael Orive-Illera","Enrique Zuazua"],"pdf_url":"https://arxiv.org/pdf/2312.13807v1.pdf","comment":"26 pages, 10 figures"},{"id":"http://arxiv.org/abs/2305.15194v2","updated":"2023-12-21T12:55:57Z","published":"2023-05-24T14:31:20Z","title":"DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion\n Models","summary":" In this study, we aim to extend the capabilities of diffusion-based\ntext-to-image (T2I) generation models by incorporating diverse modalities\nbeyond textual description, such as sketch, box, color palette, and style\nembedding, within a single model. We thus design a multimodal T2I diffusion\nmodel, coined as DiffBlender, by separating the channels of conditions into\nthree types, i.e., image forms, spatial tokens, and non-spatial tokens. The\nunique architecture of DiffBlender facilitates adding new input modalities,\npioneering a scalable framework for conditional image generation. Notably, we\nachieve this without altering the parameters of the existing generative model,\nStable Diffusion, only with updating partial components. Our study establishes\nnew benchmarks in multimodal generation through quantitative and qualitative\ncomparisons with existing conditional generation methods. We demonstrate that\nDiffBlender faithfully blends all the provided information and showcase its\nvarious applications in the detailed image synthesis.\n","authors":["Sungnyun Kim","Junsoo Lee","Kibeom Hong","Daesik Kim","Namhyuk Ahn"],"pdf_url":"https://arxiv.org/pdf/2305.15194v2.pdf","comment":"Project page: https://sungnyun.github.io/diffblender/"},{"id":"http://arxiv.org/abs/2312.13795v1","updated":"2023-12-21T12:36:53Z","published":"2023-12-21T12:36:53Z","title":"Sparse Training for Federated Learning with Regularized Error Correction","summary":" Federated Learning (FL) has attracted much interest due to the significant\nadvantages it brings to training deep neural network (DNN) models. However,\nsince communications and computation resources are limited, training DNN models\nin FL systems face challenges such as elevated computational and communication\ncosts in complex tasks. Sparse training schemes gain increasing attention in\norder to scale down the dimensionality of each client (i.e., node)\ntransmission. Specifically, sparsification with error correction methods is a\npromising technique, where only important updates are sent to the parameter\nserver (PS) and the rest are accumulated locally. While error correction\nmethods have shown to achieve a significant sparsification level of the\nclient-to-PS message without harming convergence, pushing sparsity further\nremains unresolved due to the staleness effect. In this paper, we propose a\nnovel algorithm, dubbed Federated Learning with Accumulated Regularized\nEmbeddings (FLARE), to overcome this challenge. FLARE presents a novel sparse\ntraining approach via accumulated pulling of the updated models with\nregularization on the embeddings in the FL process, providing a powerful\nsolution to the staleness effect, and pushing sparsity to an exceptional level.\nThe performance of FLARE is validated through extensive experiments on diverse\nand complex models, achieving a remarkable sparsity level (10 times and more\nbeyond the current state-of-the-art) along with significantly improved\naccuracy. Additionally, an open-source software package has been developed for\nthe benefit of researchers and developers in related fields.\n","authors":["Ran Greidi","Kobi Cohen"],"pdf_url":"https://arxiv.org/pdf/2312.13795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13783v1","updated":"2023-12-21T12:14:31Z","published":"2023-12-21T12:14:31Z","title":"Few Shot Part Segmentation Reveals Compositional Logic for Industrial\n Anomaly Detection","summary":" Logical anomalies (LA) refer to data violating underlying logical constraints\ne.g., the quantity, arrangement, or composition of components within an image.\nDetecting accurately such anomalies requires models to reason about various\ncomponent types through segmentation. However, curation of pixel-level\nannotations for semantic segmentation is both time-consuming and expensive.\nAlthough there are some prior few-shot or unsupervised co-part segmentation\nalgorithms, they often fail on images with industrial object. These images have\ncomponents with similar textures and shapes, and a precise differentiation\nproves challenging. In this study, we introduce a novel component segmentation\nmodel for LA detection that leverages a few labeled samples and unlabeled\nimages sharing logical constraints. To ensure consistent segmentation across\nunlabeled images, we employ a histogram matching loss in conjunction with an\nentropy loss. As segmentation predictions play a crucial role, we propose to\nenhance both local and global sample validity detection by capturing key\naspects from visual semantics via three memory banks: class histograms,\ncomponent composition embeddings and patch-level representations. For effective\nLA detection, we propose an adaptive scaling strategy to standardize anomaly\nscores from different memory banks in inference. Extensive experiments on the\npublic benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA\ndetection vs. 89.6% from competing methods.\n","authors":["Soopil Kim","Sion An","Philip Chikontwe","Myeongkyun Kang","Ehsan Adeli","Kilian M. Pohl","Sanghyun Park"],"pdf_url":"https://arxiv.org/pdf/2312.13783v1.pdf","comment":"Accepted at AAAI2024"},{"id":"http://arxiv.org/abs/2305.05807v2","updated":"2023-12-21T11:59:11Z","published":"2023-05-09T23:40:23Z","title":"Even Small Correlation and Diversity Shifts Pose Dataset-Bias Issues","summary":" Distribution shifts are common in real-world datasets and can affect the\nperformance and reliability of deep learning models. In this paper, we study\ntwo types of distribution shifts: diversity shifts, which occur when test\nsamples exhibit patterns unseen during training, and correlation shifts, which\noccur when test data present a different correlation between seen invariant and\nspurious features. We propose an integrated protocol to analyze both types of\nshifts using datasets where they co-exist in a controllable manner. Finally, we\napply our approach to a real-world classification problem of skin cancer\nanalysis, using out-of-distribution datasets and specialized bias annotations.\nOur protocol reveals three findings: 1) Models learn and propagate correlation\nshifts even with low-bias training; this poses a risk of accumulating and\ncombining unaccountable weak biases; 2) Models learn robust features in high-\nand low-bias scenarios but use spurious ones if test samples have them; this\nsuggests that spurious correlations do not impair the learning of robust\nfeatures; 3) Diversity shift can reduce the reliance on spurious correlations;\nthis is counter intuitive since we expect biased models to depend more on\nbiases when invariant features are missing. Our work has implications for\ndistribution shift research and practice, providing new insights into how\nmodels learn and rely on spurious correlations under different types of shifts.\n","authors":["Alceu Bissoto","Catarina Barata","Eduardo Valle","Sandra Avila"],"pdf_url":"https://arxiv.org/pdf/2305.05807v2.pdf","comment":"Paper under consideration at Pattern Recognition Letters"},{"id":"http://arxiv.org/abs/2312.13772v1","updated":"2023-12-21T11:55:10Z","published":"2023-12-21T11:55:10Z","title":"On Task Performance and Model Calibration with Supervised and\n Self-Ensembled In-Context Learning","summary":" Following the standard supervised fine-tuning (SFT) paradigm, in-context\nlearning (ICL) has become an efficient approach propelled by the recent\nadvancements in large language models (LLMs), yielding promising performance\nacross various tasks in few-shot data setups. However, both paradigms are prone\nto suffer from the critical problem of overconfidence (i.e., miscalibration),\nespecially in such limited data setups. In this work, we deliver an in-depth\nanalysis of the behavior across different choices of learning methods from the\nperspective of both performance and calibration, as well as their interplay.\nThrough extensive controlled experiments, we find that simultaneous gains for\nboth task performance and calibration are difficult to achieve, and the problem\nof miscalibration exists across all learning methods in low-resource\nscenarios.To address this challenging trade-off between performance and\ncalibration, we then investigate the potential of self-ensembling techniques\napplied at different modeling stages (e.g., variations of in-context examples\nor variations in prompts or different ensembling strategies). We justify the\nfeasibility of self-ensembling on SFT in addition to ICL, to make the\npredictions more calibrated and have comparable or even better performance. Our\nwork sheds light on which learning paradigm to choose and how to enhance both\ntask performance and calibration of LLMs.\n","authors":["Chengzu Li","Han Zhou","Goran Glavaš","Anna Korhonen","Ivan Vulić"],"pdf_url":"https://arxiv.org/pdf/2312.13772v1.pdf","comment":"9 pages, 4 figures, 5 tables (20 pages, 5 figures, 13 tables\n including references and appendices)"},{"id":"http://arxiv.org/abs/2312.11779v2","updated":"2023-12-21T11:45:55Z","published":"2023-12-19T01:28:46Z","title":"Are you talking to ['xem'] or ['x', 'em']? On Tokenization and\n Addressing Misgendering in LLMs with Pronoun Tokenization Parity","summary":" A large body of NLP research has documented the ways gender biases manifest\nand amplify within large language models (LLMs), though this research has\npredominantly operated within a gender binary-centric context. A growing body\nof work has identified the harmful limitations of this gender-exclusive\nframing; many LLMs cannot correctly and consistently refer to persons outside\nthe gender binary, especially if they use neopronouns. While data scarcity has\nbeen identified as a possible culprit, the precise mechanisms through which it\ninfluences LLM misgendering remain underexplored. Our work addresses this gap\nby studying data scarcity's role in subword tokenization and, consequently, the\nformation of LLM word representations. We uncover how the Byte-Pair Encoding\n(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun\nmisgendering through out-of-vocabulary behavior. We introduce pronoun\ntokenization parity (PTP), a novel approach to reduce LLM neopronoun\nmisgendering by preserving a token's functional structure. We evaluate PTP's\nefficacy using pronoun consistency-based metrics and a novel syntax-based\nmetric. Through several controlled experiments, finetuning LLMs with PTP\nimproves neopronoun consistency from 14.5% to 58.4%, highlighting the\nsignificant role tokenization plays in LLM pronoun consistency.\n","authors":["Anaelia Ovalle","Ninareh Mehrabi","Palash Goyal","Jwala Dhamala","Kai-Wei Chang","Richard Zemel","Aram Galstyan","Rahul Gupta"],"pdf_url":"https://arxiv.org/pdf/2312.11779v2.pdf","comment":"Accepted to 2023 Neurips Queer in AI workshop"},{"id":"http://arxiv.org/abs/2312.13764v1","updated":"2023-12-21T11:43:41Z","published":"2023-12-21T11:43:41Z","title":"A Semantic Space is Worth 256 Language Descriptions: Make Stronger\n Segmentation Models with Descriptive Properties","summary":" This paper introduces ProLab, a novel approach using property-level label\nspace for creating strong interpretable segmentation models. Instead of relying\nsolely on category-specific annotations, ProLab uses descriptive properties\ngrounded in common sense knowledge for supervising segmentation models. It is\nbased on two core designs. First, we employ Large Language Models (LLMs) and\ncarefully crafted prompts to generate descriptions of all involved categories\nthat carry meaningful common sense knowledge and follow a structured format.\nSecond, we introduce a description embedding model preserving semantic\ncorrelation across descriptions and then cluster them into a set of descriptive\nproperties (e.g., 256) using K-Means. These properties are based on\ninterpretable common sense knowledge consistent with theories of human\nrecognition. We empirically show that our approach makes segmentation models\nperform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal\nContext, Cityscapes, and BDD). Our method also shows better scalability with\nextended training steps than category-level supervision. Our interpretable\nsegmentation framework also emerges with the generalization ability to segment\nout-of-domain or unknown categories using only in-domain descriptive\nproperties. Code is available at https://github.com/lambert-x/ProLab.\n","authors":["Junfei Xiao","Ziqi Zhou","Wenxuan Li","Shiyi Lan","Jieru Mei","Zhiding Yu","Alan Yuille","Yuyin Zhou","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2312.13764v1.pdf","comment":"Preprint. Code is available at https://github.com/lambert-x/ProLab"},{"id":"http://arxiv.org/abs/2312.13763v1","updated":"2023-12-21T11:41:02Z","published":"2023-12-21T11:41:02Z","title":"Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed\n Diffusion Models","summary":" Text-guided diffusion models have revolutionized image and video generation\nand have also been successfully used for optimization-based 3D object\nsynthesis. Here, we instead focus on the underexplored text-to-4D setting and\nsynthesize dynamic, animated 3D objects using score distillation methods with\nan additional temporal dimension. Compared to previous work, we pursue a novel\ncompositional generation-based approach, and combine text-to-image,\ntext-to-video, and 3D-aware multiview diffusion models to provide feedback\nduring 4D object optimization, thereby simultaneously enforcing temporal\nconsistency, high-quality visual appearance and realistic geometry. Our method,\ncalled Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with\ndeformation fields as 4D representation. Crucial to AYG is a novel method to\nregularize the distribution of the moving 3D Gaussians and thereby stabilize\nthe optimization and induce motion. We also propose a motion amplification\nmechanism as well as a new autoregressive synthesis scheme to generate and\ncombine multiple 4D sequences for longer generation. These techniques allow us\nto synthesize vivid dynamic scenes, outperform previous work qualitatively and\nquantitatively and achieve state-of-the-art text-to-4D performance. Due to the\nGaussian 4D representation, different 4D animations can be seamlessly combined,\nas we demonstrate. AYG opens up promising avenues for animation, simulation and\ndigital content creation as well as synthetic data generation.\n","authors":["Huan Ling","Seung Wook Kim","Antonio Torralba","Sanja Fidler","Karsten Kreis"],"pdf_url":"https://arxiv.org/pdf/2312.13763v1.pdf","comment":"Project page:\n https://research.nvidia.com/labs/toronto-ai/AlignYourGaussians/"},{"id":"http://arxiv.org/abs/2312.13754v1","updated":"2023-12-21T11:35:45Z","published":"2023-12-21T11:35:45Z","title":"Cross-Layer Optimization for Fault-Tolerant Deep Learning","summary":" Fault-tolerant deep learning accelerator is the basis for highly reliable\ndeep learning processing and critical to deploy deep learning in\nsafety-critical applications such as avionics and robotics. Since deep learning\nis known to be computing- and memory-intensive, traditional fault-tolerant\napproaches based on redundant computing will incur substantial overhead\nincluding power consumption and chip area. To this end, we propose to\ncharacterize deep learning vulnerability difference across both neurons and\nbits of each neuron, and leverage the vulnerability difference to enable\nselective protection of the deep learning processing components from the\nperspective of architecture layer and circuit layer respectively. At the same\ntime, we observe the correlation between model quantization and bit protection\noverhead of the underlying processing elements of deep learning accelerators,\nand propose to reduce the bit protection overhead by adding additional\nquantization constrain without compromising the model accuracy. Finally, we\nemploy Bayesian optimization strategy to co-optimize the correlated cross-layer\ndesign parameters at algorithm layer, architecture layer, and circuit layer to\nminimize the hardware resource consumption while fulfilling multiple user\nconstraints including reliability, accuracy, and performance of the deep\nlearning processing at the same time.\n","authors":["Qing Zhang","Cheng Liu","Bo Liu","Haitong Huang","Ying Wang","Huawei Li","Xiaowei Li"],"pdf_url":"https://arxiv.org/pdf/2312.13754v1.pdf","comment":"16 pages, it has been presented at CCF-DAC 2023 while CCF-DAC does\n not own the copyright"},{"id":"http://arxiv.org/abs/2308.01196v2","updated":"2023-12-21T11:27:00Z","published":"2023-07-27T22:57:55Z","title":"Sustainable Transparency in Recommender Systems: Bayesian Ranking of\n Images for Explainability","summary":" Recommender Systems have become crucial in the modern world, commonly guiding\nusers towards relevant content or products, and having a large influence over\nthe decisions of users and citizens. However, ensuring transparency and user\ntrust in these systems remains a challenge; personalized explanations have\nemerged as a solution, offering justifications for recommendations. Among the\nexisting approaches for generating personalized explanations, using existing\nvisual content created by users is a promising option to maximize transparency\nand user trust. State-of-the-art models that follow this approach, despite\nleveraging highly optimized architectures, employ surrogate learning tasks that\ndo not efficiently model the objective of ranking images as explanations for a\ngiven recommendation; this leads to a suboptimal training process with high\ncomputational costs that may not be reduced without affecting model\nperformance. This work presents BRIE, a novel model where we leverage Bayesian\nPairwise Ranking to enhance the training process, allowing us to consistently\noutperform state-of-the-art models in six real-world datasets while reducing\nits model size by up to 64 times and its CO${_2}$ emissions by up to 75% in\ntraining and inference.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Berta Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2308.01196v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13716v1","updated":"2023-12-21T10:29:17Z","published":"2023-12-21T10:29:17Z","title":"Critic-Guided Decision Transformer for Offline Reinforcement Learning","summary":" Recent advancements in offline reinforcement learning (RL) have underscored\nthe capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm\nthat learns the action distribution based on target returns for each state in a\nsupervised manner. However, prevailing RCSL methods largely focus on\ndeterministic trajectory modeling, disregarding stochastic state transitions\nand the diversity of future trajectory distributions. A fundamental challenge\narises from the inconsistency between the sampled returns within individual\ntrajectories and the expected returns across multiple trajectories.\nFortunately, value-based methods offer a solution by leveraging a value\nfunction to approximate the expected returns, thereby addressing the\ninconsistency effectively. Building upon these insights, we propose a novel\napproach, termed the Critic-Guided Decision Transformer (CGDT), which combines\nthe predictability of long-term returns from value-based methods with the\ntrajectory modeling capability of the Decision Transformer. By incorporating a\nlearned value function, known as the critic, CGDT ensures a direct alignment\nbetween the specified target returns and the expected returns of actions. This\nintegration bridges the gap between the deterministic nature of RCSL and the\nprobabilistic characteristics of value-based methods. Empirical evaluations on\nstochastic environments and D4RL benchmark datasets demonstrate the superiority\nof CGDT over traditional RCSL methods. These results highlight the potential of\nCGDT to advance the state of the art in offline RL and extend the applicability\nof RCSL to a wide range of RL tasks.\n","authors":["Yuanfu Wang","Chao Yang","Ying Wen","Yu Liu","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2312.13716v1.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13711v1","updated":"2023-12-21T10:23:16Z","published":"2023-12-21T10:23:16Z","title":"A Learning oriented DLP System based on Classification Model","summary":" Data is the key asset for organizations and data sharing is lifeline for\norganization growth; which may lead to data loss. Data leakage is the most\ncritical issue being faced by organizations. In order to mitigate the data\nleakage issues data leakage prevention systems (DLPSs) are deployed at various\nlevels by the organizations. DLPSs are capable to protect all kind of data i.e.\nDAR, DIM/DIT, DIU. Statistical analysis, regular expression, data\nfingerprinting are common approaches exercised in DLP system. Out of these\ntechniques; statistical analysis approach is most appropriate for proposed DLP\nmodel of data security. This paper defines a statistical DLP model for document\nclassification. Model uses various statistical approaches like TF-IDF (Term\nFrequency- Inverse Document Frequency) a renowned term count/weighing function,\nVectorization, Gradient boosting document classification etc. to classify the\ndocuments before allowing any access to it. Machine learning is used to test\nand train the model. Proposed model also introduces an extremely efficient and\nmore accurate approach; IGBCA (Improvised Gradient Boosting Classification\nAlgorithm); for document classification, to prevent them from possible data\nleakage. Results depicts that proposed model can classify documents with high\naccuracy and on basis of which data can be prevented from being loss.\n","authors":["Kishu Gupta","Ashwani Kush"],"pdf_url":"https://arxiv.org/pdf/2312.13711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07919v2","updated":"2023-12-21T10:20:42Z","published":"2023-11-14T05:34:50Z","title":"Qwen-Audio: Advancing Universal Audio Understanding via Unified\n Large-Scale Audio-Language Models","summary":" Recently, instruction-following audio-language models have received broad\nattention for audio interaction with humans. However, the absence of\npre-trained audio models capable of handling diverse audio types and tasks has\nhindered progress in this field. Consequently, most existing works have only\nbeen able to support a limited range of interaction capabilities. In this\npaper, we develop the Qwen-Audio model and address this limitation by scaling\nup audio-language pre-training to cover over 30 tasks and various audio types,\nsuch as human speech, natural sounds, music, and songs, to facilitate universal\naudio understanding abilities. However, directly co-training all tasks and\ndatasets can lead to interference issues, as the textual labels associated with\ndifferent datasets exhibit considerable variations due to differences in task\nfocus, language, granularity of annotation, and text structure. To overcome the\none-to-many interference, we carefully design a multi-task training framework\nby conditioning on a sequence of hierarchical tags to the decoder for\nencouraging knowledge sharing and avoiding interference through shared and\nspecified tags respectively. Remarkably, Qwen-Audio achieves impressive\nperformance across diverse benchmark tasks without requiring any task-specific\nfine-tuning, surpassing its counterparts. Building upon the capabilities of\nQwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from\nvarious audios and text inputs, enabling multi-turn dialogues and supporting\nvarious audio-central scenarios.\n","authors":["Yunfei Chu","Jin Xu","Xiaohuan Zhou","Qian Yang","Shiliang Zhang","Zhijie Yan","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2311.07919v2.pdf","comment":"The code, checkpoints and demo are released at\n https://github.com/QwenLM/Qwen-Audio"},{"id":"http://arxiv.org/abs/2312.13704v1","updated":"2023-12-21T10:14:27Z","published":"2023-12-21T10:14:27Z","title":"A Forecasting-Based DLP Approach for Data Security","summary":" Sensitive data leakage is the major growing problem being faced by\nenterprises in this technical era. Data leakage causes severe threats for\norganization of data safety which badly affects the reputation of\norganizations. Data leakage is the flow of sensitive data/information from any\ndata holder to an unauthorized destination. Data leak prevention (DLP) is set\nof techniques that try to alleviate the threats which may hinder data security.\nDLP unveils guilty user responsible for data leakage and ensures that user\nwithout appropriate permission cannot access sensitive data and also provides\nprotection to sensitive data if sensitive data is shared accidentally. In this\npaper, data leakage prevention (DLP) model is used to restrict/grant data\naccess permission to user, based on the forecast of their access to data. This\nstudy provides a DLP solution using data statistical analysis to forecast the\ndata access possibilities of any user in future based on the access to data in\nthe past. The proposed approach makes use of renowned simple piecewise linear\nfunction for learning/training to model. The results show that the proposed DLP\napproach with high level of precision can correctly classify between users even\nin cases of extreme data access.\n","authors":["Kishu Gupta","Ashwani Kush"],"pdf_url":"https://arxiv.org/pdf/2312.13704v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.03291v2","updated":"2023-12-21T10:08:52Z","published":"2023-09-06T18:11:09Z","title":"Ultra-fast high-dynamic range imaging of Cygnus A with the R2D2 deep\n neural network series","summary":" We present a novel AI approach for high-resolution high-dynamic range\nsynthesis imaging by radio interferometry (RI) in astronomy. R2D2, standing for\n``{R}esidual-to-{R}esidual {D}NN series for high-{D}ynamic range imaging'', is\na model-based data-driven approach relying on hybrid deep neural networks\n(DNNs) and data-consistency updates. Its reconstruction is built as a series of\nresidual images estimated as the outputs of DNNs, each taking the residual\ndirty image of the previous iteration as an input. The approach can be\ninterpreted as a learned version of a matching pursuit approach, whereby model\ncomponents are iteratively identified from residual dirty images, and of which\nCLEAN is a well-known example. We propose two variants of the R2D2 model, built\nupon two distinctive DNN architectures: a standard U-Net, and a novel unrolled\narchitecture. We demonstrate their use for monochromatic intensity imaging on\nhighly-sensitive observations of the radio galaxy Cygnus A at S band, from the\nVery Large Array (VLA). R2D2 is validated against CLEAN and the recent RI\nalgorithms AIRI and uSARA, which respectively inject a learned implicit\nregularization and an advanced handcrafted sparsity-based regularization into\nthe RI data. With only few terms in its series, the R2D2 model is able to\ndeliver high-precision imaging, superseding the resolution of CLEAN, and\nmatching the precision of AIRI and uSARA. In terms of computational efficiency,\nR2D2 runs at a fraction of the cost of AIRI and uSARA, and is also faster than\nCLEAN, opening the door to near real-time precision imaging in RI.\n","authors":["Aghabiglou A","Chu C S","Jackson A","Dabbech A","Wiaux Y"],"pdf_url":"https://arxiv.org/pdf/2309.03291v2.pdf","comment":"submitted to ApJL"},{"id":"http://arxiv.org/abs/2312.13699v1","updated":"2023-12-21T10:02:17Z","published":"2023-12-21T10:02:17Z","title":"Adapt & Align: Continual Learning with Generative Models Latent Space\n Alignment","summary":" In this work, we introduce Adapt & Align, a method for continual learning of\nneural networks by aligning latent representations in generative models. Neural\nNetworks suffer from abrupt loss in performance when retrained with additional\ntraining data from different distributions. At the same time, training with\nadditional data without access to the previous examples rarely improves the\nmodel's performance. In this work, we propose a new method that mitigates those\nproblems by employing generative models and splitting the process of their\nupdate into two parts. In the first one, we train a local generative model\nusing only data from a new task. In the second phase, we consolidate latent\nrepresentations from the local model with a global one that encodes knowledge\nof all past experiences. We introduce our approach with Variational\nAuteoncoders and Generative Adversarial Networks. Moreover, we show how we can\nuse those generative models as a general method for continual knowledge\nconsolidation that can be used in downstream tasks such as classification.\n","authors":["Kamil Deja","Bartosz Cywiński","Jan Rybarczyk","Tomasz Trzciński"],"pdf_url":"https://arxiv.org/pdf/2312.13699v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13136v2","updated":"2023-12-21T09:51:09Z","published":"2023-12-20T15:56:40Z","title":"Molecular Hypergraph Neural Networks","summary":" Graph neural networks (GNNs) have demonstrated promising performance across\nvarious chemistry-related tasks. However, conventional graphs only model the\npairwise connectivity in molecules, failing to adequately represent\nhigher-order connections like multi-center bonds and conjugated structures. To\ntackle this challenge, we introduce molecular hypergraphs and propose Molecular\nHypergraph Neural Networks (MHNN) to predict the optoelectronic properties of\norganic semiconductors, where hyperedges represent conjugated structures. A\ngeneral algorithm is designed for irregular high-order connections, which can\nefficiently operate on molecular hypergraphs with hyperedges of various orders.\nThe results show that MHNN outperforms all baseline models on most tasks of\nOPV, OCELOTv1 and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D\ngeometric information, surpassing the baseline model that utilizes atom\npositions. Moreover, MHNN achieves better performance than pretrained GNNs\nunder limited training data, underscoring its excellent data efficiency. This\nwork provides a new strategy for more general molecular representations and\nproperty prediction tasks related to high-order connections.\n","authors":["Junwu Chen","Philippe Schwaller"],"pdf_url":"https://arxiv.org/pdf/2312.13136v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07069v2","updated":"2023-12-21T09:47:19Z","published":"2023-12-12T08:43:20Z","title":"Context Matters: Data-Efficient Augmentation of Large Language Models\n for Scientific Applications","summary":" In this paper, we explore the challenges inherent to Large Language Models\n(LLMs) like GPT-4, particularly their propensity for hallucinations, logic\nmistakes, and incorrect conclusions when tasked with answering complex\nquestions. The capacity of LLMs to present erroneous answers in a coherent and\nsemantically rigorous manner further complicates the detection of factual\ninaccuracies. This issue is especially pronounced in fields that require\nspecialized expertise. Our work delves into these challenges, aiming to enhance\nthe understanding and mitigation of such errors, thereby contributing to the\nimprovement of LLM accuracy and reliability in scientific and other specialized\ndomains. Our findings reveal a non-linear relationship between the context's\nrelevancy and the answers' measured quality. In addition, we demonstrate that\nwith the correct calibration, it is possible to automate the grading procedure\n-- a finding suggesting that, at least to some degree, the LLMs can be used to\nself-examine the quality of their own performance. Finally, we describe an\nexperimental platform that can be seen as a proof-of-concept of the techniques\ndescribed in this work.\n","authors":["Xiang Li","Haoran Tang","Siyu Chen","Ziwei Wang","Anurag Maravi","Marcin Abram"],"pdf_url":"https://arxiv.org/pdf/2312.07069v2.pdf","comment":"11 pages, 6 figures, 4 tables, 3 pages of supplementary material"},{"id":"http://arxiv.org/abs/2304.10549v2","updated":"2023-12-21T09:43:50Z","published":"2023-04-19T10:17:18Z","title":"A note on the connectedness property of union-free generic sets of\n partial orders","summary":" This short note describes and proves a connectedness property which was\nintroduced in Blocher et al. [2023] in the context of data depth functions for\npartial orders. The connectedness property gives a structural insight into\nunion-free generic sets. These sets, presented in Blocher et al. [2023], are\ndefined by using a closure operator on the set of all partial orders which\nnaturally appears within the theory of formal concept analysis. In the language\nof formal concept analysis, the property of connectedness can be vividly\nproven. However, since within Blocher et al. [2023] we did not discuss formal\nconcept analysis, we outsourced the proof to this note.\n","authors":["Georg Schollmeyer","Hannah Blocher"],"pdf_url":"https://arxiv.org/pdf/2304.10549v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.07967v2","updated":"2023-12-21T09:08:34Z","published":"2023-11-14T07:46:03Z","title":"Comparison of two data fusion approaches for land use classification","summary":" Accurate land use maps, describing the territory from an anthropic\nutilisation point of view, are useful tools for land management and planning.\nTo produce them, the use of optical images alone remains limited. It is\ntherefore necessary to make use of several heterogeneous sources, each carrying\ncomplementary or contradictory information due to their imperfections or their\ndifferent specifications. This study compares two different approaches i.e. a\npre-classification and a post-classification fusion approach for combining\nseveral sources of spatial data in the context of land use classification. The\napproaches are applied on authoritative land use data located in the Gers\ndepartment in the southwest of France. Pre-classification fusion, while not\nexplicitly modeling imperfections, has the best final results, reaching an\noverall accuracy of 97% and a macro-mean F1 score of 88%.\n","authors":["Martin Cubaud","Arnaud Le Bris","Laurence Jolivet","Ana-Maria Olteanu-Raimond"],"pdf_url":"https://arxiv.org/pdf/2311.07967v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13677v1","updated":"2023-12-21T09:00:24Z","published":"2023-12-21T09:00:24Z","title":"Parallel Trust-Region Approaches in Neural Network Training: Beyond\n Traditional Methods","summary":" We propose to train neural networks (NNs) using a novel variant of the\n``Additively Preconditioned Trust-region Strategy'' (APTS). The proposed method\nis based on a parallelizable additive domain decomposition approach applied to\nthe neural network's parameters. Built upon the TR framework, the APTS method\nensures global convergence towards a minimizer. Moreover, it eliminates the\nneed for computationally expensive hyper-parameter tuning, as the TR algorithm\nautomatically determines the step size in each iteration. We demonstrate the\ncapabilities, strengths, and limitations of the proposed APTS training method\nby performing a series of numerical experiments. The presented numerical study\nincludes a comparison with widely used training methods such as SGD, Adam,\nLBFGS, and the standard TR method.\n","authors":["Ken Trotti","Samuel A. Cruz Alegría","Alena Kopaničáková","Rolf Krause"],"pdf_url":"https://arxiv.org/pdf/2312.13677v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13439v2","updated":"2023-12-21T09:00:09Z","published":"2023-09-23T17:42:13Z","title":"Finding Order in Chaos: A Novel Data Augmentation Method for Time Series\n in Contrastive Learning","summary":" The success of contrastive learning is well known to be dependent on data\naugmentation. Although the degree of data augmentations has been well\ncontrolled by utilizing pre-defined techniques in some domains like vision,\ntime-series data augmentation is less explored and remains a challenging\nproblem due to the complexity of the data generation mechanism, such as the\nintricate mechanism involved in the cardiovascular system. Moreover, there is\nno widely recognized and general time-series augmentation method that can be\napplied across different tasks. In this paper, we propose a novel data\naugmentation method for quasi-periodic time-series tasks that aims to connect\nintra-class samples together, and thereby find order in the latent space. Our\nmethod builds upon the well-known mixup technique by incorporating a novel\napproach that accounts for the periodic nature of non-stationary time-series.\nAlso, by controlling the degree of chaos created by data augmentation, our\nmethod leads to improved feature representations and performance on downstream\ntasks. We evaluate our proposed method on three time-series tasks, including\nheart rate estimation, human activity recognition, and cardiovascular disease\ndetection. Extensive experiments against state-of-the-art methods show that the\nproposed approach outperforms prior works on optimal data generation and known\ndata augmentation techniques in the three tasks, reflecting the effectiveness\nof the presented method. Source code:\nhttps://github.com/eth-siplab/Finding_Order_in_Chaos\n","authors":["Berken Utku Demirel","Christian Holz"],"pdf_url":"https://arxiv.org/pdf/2309.13439v2.pdf","comment":"Published at the Conference on Neural Information Processing Systems\n (NeurIPS) 2023"},{"id":"http://arxiv.org/abs/2312.13671v1","updated":"2023-12-21T08:50:41Z","published":"2023-12-21T08:50:41Z","title":"Text2Analysis: A Benchmark of Table Question Answering with Advanced\n Data Analysis and Unclear Queries","summary":" Tabular data analysis is crucial in various fields, and large language models\nshow promise in this area. However, current research mostly focuses on\nrudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like\nforecasting and chart generation. To address this gap, we developed the\nText2Analysis benchmark, incorporating advanced analysis tasks that go beyond\nthe SQL-compatible operations and require more in-depth analysis. We also\ndevelop five innovative and effective annotation methods, harnessing the\ncapabilities of large language models to enhance data quality and quantity.\nAdditionally, we include unclear queries that resemble real-world user\nquestions to test how well models can understand and tackle such challenges.\nFinally, we collect 2249 query-result pairs with 347 tables. We evaluate five\nstate-of-the-art models using three different metrics and the results show that\nour benchmark presents introduces considerable challenge in the field of\ntabular data analysis, paving the way for more advanced research opportunities.\n","authors":["Xinyi He","Mengyu Zhou","Xinrun Xu","Xiaojun Ma","Rui Ding","Lun Du","Yan Gao","Ran Jia","Xu Chen","Shi Han","Zejian Yuan","Dongmei Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.13671v1.pdf","comment":"Accepted by AAAI'2024"},{"id":"http://arxiv.org/abs/2306.01423v2","updated":"2023-12-21T08:39:17Z","published":"2023-06-02T10:29:33Z","title":"Improving Gradient-Trend Identification: Fast-Adaptive Moment Estimation\n with Finance-Inspired Triple Exponential Moving Average","summary":" The performance improvement of deep networks significantly depends on their\noptimizers. With existing optimizers, precise and efficient recognition of the\ngradients trend remains a challenge. Existing optimizers predominantly adopt\ntechniques based on the first-order exponential moving average (EMA), which\nresults in noticeable delays that impede the real-time tracking of gradients\ntrend and consequently yield sub-optimal performance. To overcome this\nlimitation, we introduce a novel optimizer called fast-adaptive moment\nestimation (FAME). Inspired by the triple exponential moving average (TEMA)\nused in the financial domain, FAME leverages the potency of higher-order TEMA\nto improve the precision of identifying gradient trends. TEMA plays a central\nrole in the learning process as it actively influences optimization dynamics;\nthis role differs from its conventional passive role as a technical indicator\nin financial contexts. Because of the introduction of TEMA into the\noptimization process, FAME can identify gradient trends with higher accuracy\nand fewer lag issues, thereby offering smoother and more consistent responses\nto gradient fluctuations compared to conventional first-order EMA. To study the\neffectiveness of our novel FAME optimizer, we conducted comprehensive\nexperiments encompassing six diverse computer-vision benchmarks and tasks,\nspanning detection, classification, and semantic comprehension. We integrated\nFAME into 15 learning architectures and compared its performance with those of\nsix popular optimizers. Results clearly showed that FAME is more robust and\naccurate and provides superior performance stability by minimizing noise (i.e.,\ntrend fluctuations). Notably, FAME achieves higher accuracy levels in\nremarkably fewer training epochs than its counterparts, clearly indicating its\nsignificance for optimizing deep networks in computer-vision tasks.\n","authors":["Roi Peleg","Teddy Lazebnik","Assaf Hoogi"],"pdf_url":"https://arxiv.org/pdf/2306.01423v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12815v2","updated":"2023-12-21T08:30:58Z","published":"2023-09-22T12:08:53Z","title":"Improving Generalization in Game Agents with Data Augmentation in\n Imitation Learning","summary":" Imitation learning is an effective approach for training game-playing agents\nand, consequently, for efficient game production. However, generalization - the\nability to perform well in related but unseen scenarios - is an essential\nrequirement that remains an unsolved challenge for game AI. Generalization is\ndifficult for imitation learning agents because it requires the algorithm to\ntake meaningful actions outside of the training distribution. In this paper we\npropose a solution to this challenge. Inspired by the success of data\naugmentation in supervised learning, we augment the training data so the\ndistribution of states and actions in the dataset better represents the real\nstate-action distribution. This study evaluates methods for combining and\napplying data augmentations to observations, to improve generalization of\nimitation learning agents. It also provides a performance benchmark of these\naugmentations across several 3D environments. These results demonstrate that\ndata augmentation is a promising framework for improving generalization in\nimitation learning agents.\n","authors":["Derek Yadgaroff","Alessandro Sestini","Konrad Tollmar","Ayca Ozcelikkale","Linus Gisslén"],"pdf_url":"https://arxiv.org/pdf/2309.12815v2.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.13650v1","updated":"2023-12-21T08:21:44Z","published":"2023-12-21T08:21:44Z","title":"Distributed Quantum Neural Networks via Partitioned Features Encoding","summary":" Quantum neural networks are expected to be a promising application in\nnear-term quantum computation, but face challenges such as vanishing gradients\nduring optimization and limited expressibility by a limited number of qubits\nand shallow circuits. To mitigate these challenges, distributed quantum neural\nnetworks have been proposed to make a prediction by approximating a large\ncircuit with multiple small circuits. However, the approximation of a large\ncircuit requires an exponential number of small circuit evaluations. Here, we\ninstead propose to distribute partitioned features over multiple small quantum\nneural networks and use the ensemble of their expectation values to generate\npredictions. To verify our distributed approach, we demonstrate multi-class\nclassifications of handwritten digit datasets. Especially for the MNIST\ndataset, we succeeded in ten class classifications of the dataset with\nexceeding 96% accuracy. Our proposed method not only achieved highly accurate\npredictions for a large dataset but also reduced the hardware requirements for\neach quantum neural network compared to a single quantum neural network. Our\nresults highlight distributed quantum neural networks as a promising direction\nfor practical quantum machine learning algorithms compatible with near-term\nquantum devices. We hope that our approach is useful for exploring quantum\nmachine learning applications.\n","authors":["Yoshiaki Kawase"],"pdf_url":"https://arxiv.org/pdf/2312.13650v1.pdf","comment":"9 pages, 2 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.13632v1","updated":"2023-12-21T07:48:54Z","published":"2023-12-21T07:48:54Z","title":"ProvFL: Client-Driven Interpretability of Global Model Predictions in\n Federated Learning","summary":" Federated Learning (FL) trains a collaborative machine learning model by\naggregating multiple privately trained clients' models over several training\nrounds. Such a long, continuous action of model aggregations poses significant\nchallenges in reasoning about the origin and composition of such a global\nmodel. Regardless of the quality of the global model or if it has a fault,\nunderstanding the model's origin is equally important for debugging,\ninterpretability, and explainability in federated learning. FL application\ndevelopers often question: (1) what clients contributed towards a global model\nand (2) if a global model predicts a label, which clients are responsible for\nit?\n We introduce, neuron provenance, a fine-grained lineage capturing mechanism\nthat tracks the flow of information between the individual participating\nclients in FL and the final global model. We operationalize this concept in\nProvFL that functions on two key principles. First, recognizing that monitoring\nevery neuron of every client's model statically is ineffective and noisy due to\nthe uninterpretable nature of individual neurons, ProvFL dynamically isolates\ninfluential and sensitive neurons in the global model, significantly reducing\nthe search space. Second, as multiple clients' models are fused in each round\nto form a global model, tracking each client's contribution becomes\nchallenging. ProvFL leverages the invertible nature of fusion algorithms to\nprecisely isolate each client's contribution derived from selected neurons.\nWhen asked to localize the clients responsible for the given behavior (i.e.,\nprediction) of the global model, ProvFL successfully localizes them with an\naverage provenance accuracy of 97%. Additionally, ProvFL outperforms the\nstate-of-the-art FL fault localization approach by an average margin of 50%.\n","authors":["Waris Gill","Ali Anwar","Muhammad Ali Gulzar"],"pdf_url":"https://arxiv.org/pdf/2312.13632v1.pdf","comment":"22 pages. For access to the source code used in this study, please\n contact the authors directly"},{"id":"http://arxiv.org/abs/2312.13630v1","updated":"2023-12-21T07:48:15Z","published":"2023-12-21T07:48:15Z","title":"MFABA: A More Faithful and Accelerated Boundary-based Attribution Method\n for Deep Neural Networks","summary":" To better understand the output of deep neural networks (DNN), attribution\nbased methods have been an important approach for model interpretability, which\nassign a score for each input dimension to indicate its importance towards the\nmodel outcome. Notably, the attribution methods use the axioms of sensitivity\nand implementation invariance to ensure the validity and reliability of\nattribution results. Yet, the existing attribution methods present challenges\nfor effective interpretation and efficient computation. In this work, we\nintroduce MFABA, an attribution algorithm that adheres to axioms, as a novel\nmethod for interpreting DNN. Additionally, we provide the theoretical proof and\nin-depth analysis for MFABA algorithm, and conduct a large scale experiment.\nThe results demonstrate its superiority by achieving over 101.5142 times faster\nspeed than the state-of-the-art attribution algorithms. The effectiveness of\nMFABA is thoroughly evaluated through the statistical analysis in comparison to\nother methods, and the full implementation package is open-source at:\nhttps://github.com/LMBTough/MFABA\n","authors":["Zhiyu Zhu","Huaming Chen","Jiayu Zhang","Xinyi Wang","Zhibo Jin","Minhui Xue","Dongxiao Zhu","Kim-Kwang Raymond Choo"],"pdf_url":"https://arxiv.org/pdf/2312.13630v1.pdf","comment":"Accepted by The 38th Annual AAAI Conference on Artificial\n Intelligence (AAAI-24)"},{"id":"http://arxiv.org/abs/2312.11460v2","updated":"2023-12-21T07:46:20Z","published":"2023-12-18T18:59:06Z","title":"Hybrid Internal Model: A Simple and Efficient Learner for Agile Legged\n Locomotion","summary":" Robust locomotion control depends on accurate state estimations. However, the\nsensors of most legged robots can only provide partial and noisy observations,\nmaking the estimation particularly challenging, especially for external states\nlike terrain frictions and elevation maps. Inspired by the classical Internal\nModel Control principle, we consider these external states as disturbances and\nintroduce Hybrid Internal Model (HIM) to estimate them according to the\nresponse of the robot. The response, which we refer to as the hybrid internal\nembedding, contains the robot's explicit velocity and implicit stability\nrepresentation, corresponding to two primary goals for locomotion tasks:\nexplicitly tracking velocity and implicitly maintaining stability. We use\ncontrastive learning to optimize the embedding to be close to the robot's\nsuccessor state, in which the response is naturally embedded. HIM has several\nappealing benefits: It only needs the robot's proprioceptions, i.e., those from\njoint encoders and IMU as observations. It innovatively maintains consistent\nobservations between simulation reference and reality that avoids information\nloss in mimicking learning. It exploits batch-level information that is more\nrobust to noises and keeps better sample efficiency. It only requires 1 hour of\ntraining on an RTX 4090 to enable a quadruped robot to traverse any terrain\nunder any disturbances. A wealth of real-world experiments demonstrates its\nagility, even in high-difficulty tasks and cases never occurred during the\ntraining process, revealing remarkable open-world generalizability.\n","authors":["Junfeng Long","Zirui Wang","Quanyi Li","Jiawei Gao","Liu Cao","Jiangmiao Pang"],"pdf_url":"https://arxiv.org/pdf/2312.11460v2.pdf","comment":"Use 1 hour to train a quadruped robot capable of traversing any\n terrain under any disturbances in the open world, Project Page:\n https://github.com/OpenRobotLab/HIMLoco"},{"id":"http://arxiv.org/abs/2312.13628v1","updated":"2023-12-21T07:38:59Z","published":"2023-12-21T07:38:59Z","title":"Where and How to Attack? A Causality-Inspired Recipe for Generating\n Counterfactual Adversarial Examples","summary":" Deep neural networks (DNNs) have been demonstrated to be vulnerable to\nwell-crafted \\emph{adversarial examples}, which are generated through either\nwell-conceived $\\mathcal{L}_p$-norm restricted or unrestricted attacks.\nNevertheless, the majority of those approaches assume that adversaries can\nmodify any features as they wish, and neglect the causal generating process of\nthe data, which is unreasonable and unpractical. For instance, a modification\nin income would inevitably impact features like the debt-to-income ratio within\na banking system. By considering the underappreciated causal generating\nprocess, first, we pinpoint the source of the vulnerability of DNNs via the\nlens of causality, then give theoretical results to answer \\emph{where to\nattack}. Second, considering the consequences of the attack interventions on\nthe current state of the examples to generate more realistic adversarial\nexamples, we propose CADE, a framework that can generate\n\\textbf{C}ounterfactual \\textbf{AD}versarial \\textbf{E}xamples to answer\n\\emph{how to attack}. The empirical results demonstrate CADE's effectiveness,\nas evidenced by its competitive performance across diverse attack scenarios,\nincluding white-box, transfer-based, and random intervention attacks.\n","authors":["Ruichu Cai","Yuxuan Zhu","Jie Qiao","Zefeng Liang","Furui Liu","Zhifeng Hao"],"pdf_url":"https://arxiv.org/pdf/2312.13628v1.pdf","comment":"Accepted by AAAI-2024"},{"id":"http://arxiv.org/abs/2312.13616v1","updated":"2023-12-21T07:05:21Z","published":"2023-12-21T07:05:21Z","title":"Navigating the Structured What-If Spaces: Counterfactual Generation via\n Structured Diffusion","summary":" Generating counterfactual explanations is one of the most effective\napproaches for uncovering the inner workings of black-box neural network models\nand building user trust. While remarkable strides have been made in generative\nmodeling using diffusion models in domains like vision, their utility in\ngenerating counterfactual explanations in structured modalities remains\nunexplored. In this paper, we introduce Structured Counterfactual Diffuser or\nSCD, the first plug-and-play framework leveraging diffusion for generating\ncounterfactual explanations in structured data. SCD learns the underlying data\ndistribution via a diffusion model which is then guided at test time to\ngenerate counterfactuals for any arbitrary black-box model, input, and desired\nprediction. Our experiments show that our counterfactuals not only exhibit high\nplausibility compared to the existing state-of-the-art but also show\nsignificantly better proximity and diversity.\n","authors":["Nishtha Madaan","Srikanta Bedathur"],"pdf_url":"https://arxiv.org/pdf/2312.13616v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2312.13614v1","updated":"2023-12-21T07:03:15Z","published":"2023-12-21T07:03:15Z","title":"Structure-Aware Path Inference for Neural Finite State Transducers","summary":" Neural finite-state transducers (NFSTs) form an expressive family of\nneurosymbolic sequence transduction models. An NFST models each string pair as\nhaving been generated by a latent path in a finite-state transducer. As they\nare deep generative models, both training and inference of NFSTs require\ninference networks that approximate posterior distributions over such latent\nvariables. In this paper, we focus on the resulting challenge of imputing the\nlatent alignment path that explains a given pair of input and output strings\n(e.g., during training). We train three autoregressive approximate models for\namortized inference of the path, which can then be used as proposal\ndistributions for importance sampling. All three models perform lookahead. Our\nmost sophisticated (and novel) model leverages the FST structure to consider\nthe graph of future paths; unfortunately, we find that it loses out to the\nsimpler approaches -- except on an artificial task that we concocted to confuse\nthe simpler approaches.\n","authors":["Weiting Tan","Chu-cheng Lin","Jason Eisner"],"pdf_url":"https://arxiv.org/pdf/2312.13614v1.pdf","comment":"In Proceedings of ICBINB Workshop at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.13611v1","updated":"2023-12-21T07:01:18Z","published":"2023-12-21T07:01:18Z","title":"Topology Learning for Heterogeneous Decentralized Federated Learning\n over Unreliable D2D Networks","summary":" With the proliferation of intelligent mobile devices in wireless\ndevice-to-device (D2D) networks, decentralized federated learning (DFL) has\nattracted significant interest. Compared to centralized federated learning\n(CFL), DFL mitigates the risk of central server failures due to communication\nbottlenecks. However, DFL faces several challenges, such as the severe\nheterogeneity of data distributions in diverse environments, and the\ntransmission outages and package errors caused by the adoption of the User\nDatagram Protocol (UDP) in D2D networks. These challenges often degrade the\nconvergence of training DFL models. To address these challenges, we conduct a\nthorough theoretical convergence analysis for DFL and derive a convergence\nbound. By defining a novel quantity named unreliable links-aware neighborhood\ndiscrepancy in this convergence bound, we formulate a tractable optimization\nobjective, and develop a novel Topology Learning method considering the\nRepresentation Discrepancy and Unreliable Links in DFL, named ToLRDUL.\nIntensive experiments under both feature skew and label skew settings have\nvalidated the effectiveness of our proposed method, demonstrating improved\nconvergence speed and test accuracy, consistent with our theoretical findings.\n","authors":["Zheshun Wu","Zenglin Xu","Dun Zeng","Junfan Li","Jie Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13611v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.02843v2","updated":"2023-12-21T06:43:36Z","published":"2022-11-05T07:55:55Z","title":"Unleashing the Power of Graph Data Augmentation on Covariate\n Distribution Shift","summary":" The issue of distribution shifts is emerging as a critical concern in graph\nrepresentation learning. From the perspective of invariant learning and stable\nlearning, a recently well-established paradigm for out-of-distribution\ngeneralization, stable features of the graph are assumed to causally determine\nlabels, while environmental features tend to be unstable and can lead to the\ntwo primary types of distribution shifts. The correlation shift is often caused\nby the spurious correlation between environmental features and labels that\ndiffers between the training and test data; the covariate shift often stems\nfrom the presence of new environmental features in test data. However, most\nstrategies, such as invariant learning or graph augmentation, typically\nstruggle with limited training environments or perturbed stable features, thus\nexposing limitations in handling the problem of covariate shift. To address\nthis challenge, we propose a simple-yet-effective data augmentation strategy,\nAdversarial Invariant Augmentation (AIA), to handle the covariate shift on\ngraphs. Specifically, given the training data, AIA aims to extrapolate and\ngenerate new environments, while concurrently preserving the original stable\nfeatures during the augmentation process. Such a design equips the graph\nclassification model with an enhanced capability to identify stable features in\nnew environments, thereby effectively tackling the covariate shift in data.\nExtensive experiments with in-depth empirical analysis demonstrate the\nsuperiority of our approach. The implementation codes are publicly available at\nhttps://github.com/yongduosui/AIA.\n","authors":["Yongduo Sui","Qitian Wu","Jiancan Wu","Qing Cui","Longfei Li","Jun Zhou","Xiang Wang","Xiangnan He"],"pdf_url":"https://arxiv.org/pdf/2211.02843v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12863v2","updated":"2023-12-21T06:30:46Z","published":"2023-12-20T09:27:09Z","title":"Federated Learning While Providing Model as a Service: Joint Training\n and Inference Optimization","summary":" While providing machine learning model as a service to process users'\ninference requests, online applications can periodically upgrade the model\nutilizing newly collected data. Federated learning (FL) is beneficial for\nenabling the training of models across distributed clients while keeping the\ndata locally. However, existing work has overlooked the coexistence of model\ntraining and inference under clients' limited resources. This paper focuses on\nthe joint optimization of model training and inference to maximize inference\nperformance at clients. Such an optimization faces several challenges. The\nfirst challenge is to characterize the clients' inference performance when\nclients may partially participate in FL. To resolve this challenge, we\nintroduce a new notion of age of model (AoM) to quantify client-side model\nfreshness, based on which we use FL's global model convergence error as an\napproximate measure of inference performance. The second challenge is the tight\ncoupling among clients' decisions, including participation probability in FL,\nmodel download probability, and service rates. Toward the challenges, we\npropose an online problem approximation to reduce the problem complexity and\noptimize the resources to balance the needs of model training and inference.\nExperimental results demonstrate that the proposed algorithm improves the\naverage inference accuracy by up to 12%.\n","authors":["Pengchao Han","Shiqiang Wang","Yang Jiao","Jianwei Huang"],"pdf_url":"https://arxiv.org/pdf/2312.12863v2.pdf","comment":"Accepted by IEEE International Conference on Computer Communications\n (INFOCOM) 2024"},{"id":"http://arxiv.org/abs/2312.13602v1","updated":"2023-12-21T06:28:02Z","published":"2023-12-21T06:28:02Z","title":"Peer-to-Peer Learning + Consensus with Non-IID Data","summary":" Peer-to-peer deep learning algorithms are enabling distributed edge devices\nto collaboratively train deep neural networks without exchanging raw training\ndata or relying on a central server. Peer-to-Peer Learning (P2PL) and other\nalgorithms based on Distributed Local-Update Stochastic/mini-batch Gradient\nDescent (local DSGD) rely on interleaving epochs of training with distributed\nconsensus steps. This process leads to model parameter drift/divergence amongst\nparticipating devices in both IID and non-IID settings. We observe that model\ndrift results in significant oscillations in test performance evaluated after\nlocal training and consensus phases. We then identify factors that amplify\nperformance oscillations and demonstrate that our novel approach, P2PL with\nAffinity, dampens test performance oscillations in non-IID settings without\nincurring any additional communication cost.\n","authors":["Srinivasa Pranav","José M. F. Moura"],"pdf_url":"https://arxiv.org/pdf/2312.13602v1.pdf","comment":"Asilomar Conference on Signals, Systems, and Computers 2023\n Camera-Ready Version"},{"id":"http://arxiv.org/abs/2303.17564v3","updated":"2023-12-21T06:21:11Z","published":"2023-03-30T17:30:36Z","title":"BloombergGPT: A Large Language Model for Finance","summary":" The use of NLP in the realm of financial technology is broad and complex,\nwith applications ranging from sentiment analysis and named entity recognition\nto question answering. Large Language Models (LLMs) have been shown to be\neffective on a variety of tasks; however, no LLM specialized for the financial\ndomain has been reported in literature. In this work, we present BloombergGPT,\na 50 billion parameter language model that is trained on a wide range of\nfinancial data. We construct a 363 billion token dataset based on Bloomberg's\nextensive data sources, perhaps the largest domain-specific dataset yet,\naugmented with 345 billion tokens from general purpose datasets. We validate\nBloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite\nof internal benchmarks that most accurately reflect our intended usage. Our\nmixed dataset training leads to a model that outperforms existing models on\nfinancial tasks by significant margins without sacrificing performance on\ngeneral LLM benchmarks. Additionally, we explain our modeling choices, training\nprocess, and evaluation methodology. We release Training Chronicles (Appendix\nC) detailing our experience in training BloombergGPT.\n","authors":["Shijie Wu","Ozan Irsoy","Steven Lu","Vadim Dabravolski","Mark Dredze","Sebastian Gehrmann","Prabhanjan Kambadur","David Rosenberg","Gideon Mann"],"pdf_url":"https://arxiv.org/pdf/2303.17564v3.pdf","comment":"Updated to include Training Chronicles (Appendix C)"},{"id":"http://arxiv.org/abs/2312.13596v1","updated":"2023-12-21T06:02:25Z","published":"2023-12-21T06:02:25Z","title":"Anchoring Path for Inductive Relation Prediction in Knowledge Graphs","summary":" Aiming to accurately predict missing edges representing relations between\nentities, which are pervasive in real-world Knowledge Graphs (KGs), relation\nprediction plays a critical role in enhancing the comprehensiveness and utility\nof KGs. Recent research focuses on path-based methods due to their inductive\nand explainable properties. However, these methods face a great challenge when\nlots of reasoning paths do not form Closed Paths (CPs) in the KG. To address\nthis challenge, we propose Anchoring Path Sentence Transformer (APST) by\nintroducing Anchoring Paths (APs) to alleviate the reliance of CPs.\nSpecifically, we develop a search-based description retrieval method to enrich\nentity descriptions and an assessment mechanism to evaluate the rationality of\nAPs. APST takes both APs and CPs as the inputs of a unified Sentence\nTransformer architecture, enabling comprehensive predictions and high-quality\nexplanations. We evaluate APST on three public datasets and achieve\nstate-of-the-art (SOTA) performance in 30 of 36 transductive, inductive, and\nfew-shot experimental settings.\n","authors":["Zhixiang Su","Di Wang","Chunyan Miao","Lizhen Cui"],"pdf_url":"https://arxiv.org/pdf/2312.13596v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.07864v3","updated":"2023-12-21T05:45:52Z","published":"2022-11-15T03:10:05Z","title":"Federated Adaptive Prompt Tuning for Multi-domain Collaborative Learning","summary":" Federated learning (FL) enables multiple clients to collaboratively train a\nglobal model without disclosing their data. Previous researches often require\ntraining the complete model parameters. However, the emergence of powerful\npre-trained models makes it possible to achieve higher performance with fewer\nlearnable parameters in FL. In this paper, we propose a federated adaptive\nprompt tuning algorithm, FedAPT, for multi-domain collaborative image\nclassification with powerful foundation models, like CLIP. Compared with direct\nfederated prompt tuning, our core idea is to adaptively unlock specific domain\nknowledge for each test sample in order to provide them with personalized\nprompts. To implement this idea, we design an adaptive prompt tuning module,\nwhich consists of a meta prompt, an adaptive network, and some keys. The server\nrandomly generates a set of keys and assigns a unique key to each client. Then\nall clients cooperatively train the global adaptive network and meta prompt\nwith the local datasets and the frozen keys. Ultimately, the global aggregation\nmodel can assign a personalized prompt to CLIP based on the domain features of\neach test sample. We perform extensive experiments on two multi-domain image\nclassification datasets across two different settings -- supervised and\nunsupervised. The results show that FedAPT can achieve better performance with\nless than 10\\% of the number of parameters of the fully trained model, and the\nglobal model can perform well in diverse client domains simultaneously.\n","authors":["Shangchao Su","Mingzhao Yang","Bin Li","Xiangyang Xue"],"pdf_url":"https://arxiv.org/pdf/2211.07864v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13584v1","updated":"2023-12-21T05:27:16Z","published":"2023-12-21T05:27:16Z","title":"Wave Physics-informed Matrix Factorizations","summary":" With the recent success of representation learning methods, which includes\ndeep learning as a special case, there has been considerable interest in\ndeveloping techniques that incorporate known physical constraints into the\nlearned representation. As one example, in many applications that involve a\nsignal propagating through physical media (e.g., optics, acoustics, fluid\ndynamics, etc), it is known that the dynamics of the signal must satisfy\nconstraints imposed by the wave equation. Here we propose a matrix\nfactorization technique that decomposes such signals into a sum of components,\nwhere each component is regularized to ensure that it {nearly} satisfies wave\nequation constraints. Although our proposed formulation is non-convex, we prove\nthat our model can be efficiently solved to global optimality. Through this\nline of work we establish theoretical connections between wave-informed\nlearning and filtering theory in signal processing. We further demonstrate the\napplication of this work on modal analysis problems commonly arising in\nstructural diagnostics and prognostics.\n","authors":["Harsha Vardhan Tetali","Joel B. Harley","Benjamin D. Haeffele"],"pdf_url":"https://arxiv.org/pdf/2312.13584v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2107.09144"},{"id":"http://arxiv.org/abs/2312.13583v1","updated":"2023-12-21T05:17:10Z","published":"2023-12-21T05:17:10Z","title":"Fine-tuning Graph Neural Networks by Preserving Graph Generative\n Patterns","summary":" Recently, the paradigm of pre-training and fine-tuning graph neural networks\nhas been intensively studied and applied in a wide range of graph mining tasks.\nIts success is generally attributed to the structural consistency between\npre-training and downstream datasets, which, however, does not hold in many\nreal-world scenarios. Existing works have shown that the structural divergence\nbetween pre-training and downstream graphs significantly limits the\ntransferability when using the vanilla fine-tuning strategy. This divergence\nleads to model overfitting on pre-training graphs and causes difficulties in\ncapturing the structural properties of the downstream graphs. In this paper, we\nidentify the fundamental cause of structural divergence as the discrepancy of\ngenerative patterns between the pre-training and downstream graphs.\nFurthermore, we propose G-Tuning to preserve the generative patterns of\ndownstream graphs. Given a downstream graph G, the core idea is to tune the\npre-trained GNN so that it can reconstruct the generative patterns of G, the\ngraphon W. However, the exact reconstruction of a graphon is known to be\ncomputationally expensive. To overcome this challenge, we provide a theoretical\nanalysis that establishes the existence of a set of alternative graphons called\ngraphon bases for any given graphon. By utilizing a linear combination of these\ngraphon bases, we can efficiently approximate W. This theoretical finding forms\nthe basis of our proposed model, as it enables effective learning of the\ngraphon bases and their associated coefficients. Compared with existing\nalgorithms, G-Tuning demonstrates an average improvement of 0.5% and 2.6% on\nin-domain and out-of-domain transfer learning experiments, respectively.\n","authors":["Yifei Sun","Qi Zhu","Yang Yang","Chunping Wang","Tianyu Fan","Jiajun Zhu","Lei Chen"],"pdf_url":"https://arxiv.org/pdf/2312.13583v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.13575v1","updated":"2023-12-21T04:48:34Z","published":"2023-12-21T04:48:34Z","title":"ARBiBench: Benchmarking Adversarial Robustness of Binarized Neural\n Networks","summary":" Network binarization exhibits great potential for deployment on\nresource-constrained devices due to its low computational cost. Despite the\ncritical importance, the security of binarized neural networks (BNNs) is rarely\ninvestigated. In this paper, we present ARBiBench, a comprehensive benchmark to\nevaluate the robustness of BNNs against adversarial perturbations on CIFAR-10\nand ImageNet. We first evaluate the robustness of seven influential BNNs on\nvarious white-box and black-box attacks. The results reveal that 1) The\nadversarial robustness of BNNs exhibits a completely opposite performance on\nthe two datasets under white-box attacks. 2) BNNs consistently exhibit better\nadversarial robustness under black-box attacks. 3) Different BNNs exhibit\ncertain similarities in their robustness performance. Then, we conduct\nexperiments to analyze the adversarial robustness of BNNs based on these\ninsights. Our research contributes to inspiring future research on enhancing\nthe robustness of BNNs and advancing their application in real-world scenarios.\n","authors":["Peng Zhao","Jiehua Zhang","Bowen Peng","Longguang Wang","YingMei Wei","Yu Liu","Li Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13575v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.04273v2","updated":"2023-12-21T04:39:59Z","published":"2023-04-09T16:35:31Z","title":"Multimodal Brain-Computer Interface for In-Vehicle Driver Cognitive Load\n Measurement: Dataset and Baselines","summary":" Through this paper, we introduce a novel driver cognitive load assessment\ndataset, CL-Drive, which contains Electroencephalogram (EEG) signals along with\nother physiological signals such as Electrocardiography (ECG) and Electrodermal\nActivity (EDA) as well as eye tracking data. The data was collected from 21\nsubjects while driving in an immersive vehicle simulator, in various driving\nconditions, to induce different levels of cognitive load in the subjects. The\ntasks consisted of 9 complexity levels for 3 minutes each. Each driver reported\ntheir subjective cognitive load every 10 seconds throughout the experiment. The\ndataset contains the subjective cognitive load recorded as ground truth. In\nthis paper, we also provide benchmark classification results for different\nmachine learning and deep learning models for both binary and ternary label\ndistributions. We followed 2 evaluation criteria namely 10-fold and\nleave-one-subject-out (LOSO). We have trained our models on both hand-crafted\nfeatures as well as on raw data.\n","authors":["Prithila Angkan","Behnam Behinaein","Zunayed Mahmud","Anubhav Bhatti","Dirk Rodenburg","Paul Hungler","Ali Etemad"],"pdf_url":"https://arxiv.org/pdf/2304.04273v2.pdf","comment":"16 pages, 9 figures, 11 tables. This work has been accepted to the\n IEEE Transactions on Intelligent Transportation Systems. \\c{opyright} 2023\n IEEE. Personal use of this material is permitted. Permission from IEEE must\n be obtained for all other uses"},{"id":"http://arxiv.org/abs/2312.12655v2","updated":"2023-12-21T04:29:24Z","published":"2023-12-19T22:57:13Z","title":"Can Transformers Learn Sequential Function Classes In Context?","summary":" In-context learning (ICL) has revolutionized the capabilities of transformer\nmodels in NLP. In our project, we extend the understanding of the mechanisms\nunderpinning ICL by exploring whether transformers can learn from sequential,\nnon-textual function class data distributions. We introduce a novel sliding\nwindow sequential function class and employ toy-sized transformers with a GPT-2\narchitecture to conduct our experiments. Our analysis indicates that these\nmodels can indeed leverage ICL when trained on non-textual sequential function\nclasses. Additionally, our experiments with randomized y-label sequences\nhighlights that transformers retain some ICL capabilities even when the label\nassociations are obfuscated. We provide evidence that transformers can reason\nwith and understand sequentiality encoded within function classes, as reflected\nby the effective learning of our proposed tasks. Our results also show that the\nperformance deteriorated with increasing randomness in the labels, though not\nto the extent one might expect, implying a potential robustness of learned\nsequentiality against label noise. Future research may want to look into how\nprevious explanations of transformers, such as induction heads and task\nvectors, relate to sequentiality in ICL in these toy examples. Our\ninvestigation lays the groundwork for further research into how transformers\nprocess and perceive sequential data.\n","authors":["Ryan Campbell","Emma Guo","Evan Hu","Reya Vir","Ethan Hsiao"],"pdf_url":"https://arxiv.org/pdf/2312.12655v2.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2305.15616v3","updated":"2023-12-21T04:28:02Z","published":"2023-05-24T23:24:18Z","title":"Reversible and irreversible bracket-based dynamics for deep graph neural\n networks","summary":" Recent works have shown that physics-inspired architectures allow the\ntraining of deep graph neural networks (GNNs) without oversmoothing. The role\nof these physics is unclear, however, with successful examples of both\nreversible (e.g., Hamiltonian) and irreversible (e.g., diffusion) phenomena\nproducing comparable results despite diametrically opposed mechanisms, and\nfurther complications arising due to empirical departures from mathematical\ntheory. This work presents a series of novel GNN architectures based upon\nstructure-preserving bracket-based dynamical systems, which are provably\nguaranteed to either conserve energy or generate positive dissipation with\nincreasing depth. It is shown that the theoretically principled framework\nemployed here allows for inherently explainable constructions, which\ncontextualize departures from theory in current architectures and better\nelucidate the roles of reversibility and irreversibility in network\nperformance.\n","authors":["Anthony Gruber","Kookjin Lee","Nathaniel Trask"],"pdf_url":"https://arxiv.org/pdf/2305.15616v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13565v1","updated":"2023-12-21T04:19:43Z","published":"2023-12-21T04:19:43Z","title":"Automatic Curriculum Learning with Gradient Reward Signals","summary":" This paper investigates the impact of using gradient norm reward signals in\nthe context of Automatic Curriculum Learning (ACL) for deep reinforcement\nlearning (DRL). We introduce a framework where the teacher model, utilizing the\ngradient norm information of a student model, dynamically adapts the learning\ncurriculum. This approach is based on the hypothesis that gradient norms can\nprovide a nuanced and effective measure of learning progress. Our experimental\nsetup involves several reinforcement learning environments (PointMaze, AntMaze,\nand AdroitHandRelocate), to assess the efficacy of our method. We analyze how\ngradient norm rewards influence the teacher's ability to craft challenging yet\nachievable learning sequences, ultimately enhancing the student's performance.\nOur results show that this approach not only accelerates the learning process\nbut also leads to improved generalization and adaptability in complex tasks.\nThe findings underscore the potential of gradient norm signals in creating more\nefficient and robust ACL systems, opening new avenues for research in\ncurriculum learning and reinforcement learning.\n","authors":["Ryan Campbell","Junsang Yoon"],"pdf_url":"https://arxiv.org/pdf/2312.13565v1.pdf","comment":"11 pages, 15 figures"},{"id":"http://arxiv.org/abs/2312.13558v1","updated":"2023-12-21T03:51:08Z","published":"2023-12-21T03:51:08Z","title":"The Truth is in There: Improving Reasoning in Language Models with\n Layer-Selective Rank Reduction","summary":" Transformer-based Large Language Models (LLMs) have become a fixture in\nmodern machine learning. Correspondingly, significant resources are allocated\ntowards research that aims to further advance this technology, typically\nresulting in models of increasing size that are trained on increasing amounts\nof data. This work, however, demonstrates the surprising result that it is\noften possible to significantly improve the performance of LLMs by selectively\nremoving higher-order components of their weight matrices. This simple\nintervention, which we call LAyer-SElective Rank reduction (LASER), can be done\non a model after training has completed, and requires no additional parameters\nor data. We show extensive experiments demonstrating the generality of this\nfinding across language models and datasets, and provide in-depth analyses\noffering insights into both when LASER is effective and the mechanism by which\nit operates.\n","authors":["Pratyusha Sharma","Jordan T. Ash","Dipendra Misra"],"pdf_url":"https://arxiv.org/pdf/2312.13558v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13555v1","updated":"2023-12-21T03:46:29Z","published":"2023-12-21T03:46:29Z","title":"CR-SAM: Curvature Regularized Sharpness-Aware Minimization","summary":" The capacity to generalize to future unseen data stands as one of the utmost\ncrucial attributes of deep neural networks. Sharpness-Aware Minimization (SAM)\naims to enhance the generalizability by minimizing worst-case loss using\none-step gradient ascent as an approximation. However, as training progresses,\nthe non-linearity of the loss landscape increases, rendering one-step gradient\nascent less effective. On the other hand, multi-step gradient ascent will incur\nhigher training cost. In this paper, we introduce a normalized Hessian trace to\naccurately measure the curvature of loss landscape on {\\em both} training and\ntest sets. In particular, to counter excessive non-linearity of loss landscape,\nwe propose Curvature Regularized SAM (CR-SAM), integrating the normalized\nHessian trace as a SAM regularizer. Additionally, we present an efficient way\nto compute the trace via finite differences with parallelism. Our theoretical\nanalysis based on PAC-Bayes bounds establishes the regularizer's efficacy in\nreducing generalization error. Empirical evaluation on CIFAR and ImageNet\ndatasets shows that CR-SAM consistently enhances classification performance for\nResNet and Vision Transformer (ViT) models across various datasets. Our code is\navailable at https://github.com/TrustAIoT/CR-SAM.\n","authors":["Tao Wu","Tie Luo","Donald C. Wunsch"],"pdf_url":"https://arxiv.org/pdf/2312.13555v1.pdf","comment":"AAAI 2024, main track"},{"id":"http://arxiv.org/abs/2312.09244v2","updated":"2023-12-21T03:40:07Z","published":"2023-12-14T18:59:04Z","title":"Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate\n Reward Hacking","summary":" Reward models play a key role in aligning language model applications towards\nhuman preferences. However, this setup creates an incentive for the language\nmodel to exploit errors in the reward model to achieve high estimated reward, a\nphenomenon often termed \\emph{reward hacking}. A natural mitigation is to train\nan ensemble of reward models, aggregating over model outputs to obtain a more\nrobust reward estimate. We explore the application of reward ensembles to\nalignment at both training time (through reinforcement learning) and inference\ntime (through reranking). First, we show that reward models are\n\\emph{underspecified}: reward models that perform similarly in-distribution can\nyield very different rewards when used in alignment, due to distribution shift.\nSecond, underspecification results in overoptimization, where alignment to one\nreward model does not improve reward as measured by another reward model\ntrained on the same data. Third, overoptimization is mitigated by the use of\nreward ensembles, and ensembles that vary by their \\emph{pretraining} seeds\nlead to better generalization than ensembles that differ only by their\n\\emph{fine-tuning} seeds, with both outperforming individual reward models.\nHowever, even pretrain reward ensembles do not eliminate reward hacking: we\nshow several qualitative reward hacking phenomena that are not mitigated by\nensembling because all reward models in the ensemble exhibit similar error\npatterns.\n","authors":["Jacob Eisenstein","Chirag Nagpal","Alekh Agarwal","Ahmad Beirami","Alex D'Amour","DJ Dvijotham","Adam Fisch","Katherine Heller","Stephen Pfohl","Deepak Ramachandran","Peter Shaw","Jonathan Berant"],"pdf_url":"https://arxiv.org/pdf/2312.09244v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13032v2","updated":"2023-12-21T03:02:35Z","published":"2023-12-20T13:56:27Z","title":"NodeMixup: Tackling Under-Reaching for Graph Neural Networks","summary":" Graph Neural Networks (GNNs) have become mainstream methods for solving the\nsemi-supervised node classification problem. However, due to the uneven\nlocation distribution of labeled nodes in the graph, labeled nodes are only\naccessible to a small portion of unlabeled nodes, leading to the\n\\emph{under-reaching} issue. In this study, we firstly reveal under-reaching by\nconducting an empirical investigation on various well-known graphs. Then, we\ndemonstrate that under-reaching results in unsatisfactory distribution\nalignment between labeled and unlabeled nodes through systematic experimental\nanalysis, significantly degrading GNNs' performance. To tackle under-reaching\nfor GNNs, we propose an architecture-agnostic method dubbed NodeMixup. The\nfundamental idea is to (1) increase the reachability of labeled nodes by\nlabeled-unlabeled pairs mixup, (2) leverage graph structures via fusing the\nneighbor connections of intra-class node pairs to improve performance gains of\nmixup, and (3) use neighbor label distribution similarity incorporating node\ndegrees to determine sampling weights for node mixup. Extensive experiments\ndemonstrate the efficacy of NodeMixup in assisting GNNs in handling\nunder-reaching. The source code is available at\n\\url{https://github.com/WeigangLu/NodeMixup}.\n","authors":["Weigang Lu","Ziyu Guan","Wei Zhao","Yaming Yang","Long Jin"],"pdf_url":"https://arxiv.org/pdf/2312.13032v2.pdf","comment":"Accepted by AAAI-24"},{"id":"http://arxiv.org/abs/2312.12464v2","updated":"2023-12-21T02:43:26Z","published":"2023-12-18T21:11:17Z","title":"Towards Better Serialization of Tabular Data for Few-shot Classification\n with Large Language Models","summary":" We present a study on the integration of Large Language Models (LLMs) in\ntabular data classification, emphasizing an efficient framework. Building upon\nexisting work done in TabLLM (arXiv:2210.10723), we introduce three novel\nserialization techniques, including the standout LaTeX serialization method.\nThis method significantly boosts the performance of LLMs in processing\ndomain-specific datasets, Our method stands out for its memory efficiency and\nability to fully utilize complex data structures. Through extensive\nexperimentation, including various serialization approaches like feature\ncombination and importance, we demonstrate our work's superiority in accuracy\nand efficiency over traditional models.\n","authors":["Sukriti Jaitly","Tanay Shah","Ashish Shugani","Razik Singh Grewal"],"pdf_url":"https://arxiv.org/pdf/2312.12464v2.pdf","comment":"4 pages, 2 figures"},{"id":"http://arxiv.org/abs/2312.13536v1","updated":"2023-12-21T02:37:56Z","published":"2023-12-21T02:37:56Z","title":"Domain Adaptive Graph Classification","summary":" Despite the remarkable accomplishments of graph neural networks (GNNs), they\ntypically rely on task-specific labels, posing potential challenges in terms of\ntheir acquisition. Existing work have been made to address this issue through\nthe lens of unsupervised domain adaptation, wherein labeled source graphs are\nutilized to enhance the learning process for target data. However, the\nsimultaneous exploration of graph topology and reduction of domain disparities\nremains a substantial hurdle. In this paper, we introduce the Dual Adversarial\nGraph Representation Learning (DAGRL), which explore the graph topology from\ndual branches and mitigate domain discrepancies via dual adversarial learning.\nOur method encompasses a dual-pronged structure, consisting of a graph\nconvolutional network branch and a graph kernel branch, which enables us to\ncapture graph semantics from both implicit and explicit perspectives. Moreover,\nour approach incorporates adaptive perturbations into the dual branches, which\nalign the source and target distribution to address domain discrepancies.\nExtensive experiments on a wild range graph classification datasets demonstrate\nthe effectiveness of our proposed method.\n","authors":["Siyang Luo","Ziyi Jiang","Zhenghan Chen","Xiaoxuan Liang"],"pdf_url":"https://arxiv.org/pdf/2312.13536v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10423v2","updated":"2023-12-21T02:23:32Z","published":"2023-12-16T11:32:28Z","title":"Stochastic Bayesian Optimization with Unknown Continuous Context\n Distribution via Kernel Density Estimation","summary":" Bayesian optimization (BO) is a sample-efficient method and has been widely\nused for optimizing expensive black-box functions. Recently, there has been a\nconsiderable interest in BO literature in optimizing functions that are\naffected by context variable in the environment, which is uncontrollable by\ndecision makers. In this paper, we focus on the optimization of functions'\nexpectations over continuous context variable, subject to an unknown\ndistribution. To address this problem, we propose two algorithms that employ\nkernel density estimation to learn the probability density function (PDF) of\ncontinuous context variable online. The first algorithm is simpler, which\ndirectly optimizes the expectation under the estimated PDF. Considering that\nthe estimated PDF may have high estimation error when the true distribution is\ncomplicated, we further propose the second algorithm that optimizes the\ndistributionally robust objective. Theoretical results demonstrate that both\nalgorithms have sub-linear Bayesian cumulative regret on the expectation\nobjective. Furthermore, we conduct numerical experiments to empirically\ndemonstrate the effectiveness of our algorithms.\n","authors":["Xiaobin Huang","Lei Song","Ke Xue","Chao Qian"],"pdf_url":"https://arxiv.org/pdf/2312.10423v2.pdf","comment":"AAAI 2024 Accept"},{"id":"http://arxiv.org/abs/2312.13530v1","updated":"2023-12-21T02:14:41Z","published":"2023-12-21T02:14:41Z","title":"HW-V2W-Map: Hardware Vulnerability to Weakness Mapping Framework for\n Root Cause Analysis with GPT-assisted Mitigation Suggestion","summary":" The escalating complexity of modern computing frameworks has resulted in a\nsurge in the cybersecurity vulnerabilities reported to the National\nVulnerability Database (NVD) by practitioners. Despite the fact that the\nstature of NVD is one of the most significant databases for the latest insights\ninto vulnerabilities, extracting meaningful trends from such a large amount of\nunstructured data is still challenging without the application of suitable\ntechnological methodologies. Previous efforts have mostly concentrated on\nsoftware vulnerabilities; however, a holistic strategy incorporates approaches\nfor mitigating vulnerabilities, score prediction, and a knowledge-generating\nsystem that may extract relevant insights from the Common Weakness Enumeration\n(CWE) and Common Vulnerability Exchange (CVE) databases is notably absent. As\nthe number of hardware attacks on Internet of Things (IoT) devices continues to\nrapidly increase, we present the Hardware Vulnerability to Weakness Mapping\n(HW-V2W-Map) Framework, which is a Machine Learning (ML) framework focusing on\nhardware vulnerabilities and IoT security. The architecture that we have\nproposed incorporates an Ontology-driven Storytelling framework, which\nautomates the process of updating the ontology in order to recognize patterns\nand evolution of vulnerabilities over time and provides approaches for\nmitigating the vulnerabilities. The repercussions of vulnerabilities can be\nmitigated as a result of this, and conversely, future exposures can be\npredicted and prevented. Furthermore, our proposed framework utilized\nGenerative Pre-trained Transformer (GPT) Large Language Models (LLMs) to\nprovide mitigation suggestions.\n","authors":["Yu-Zheng Lin","Muntasir Mamun","Muhtasim Alam Chowdhury","Shuyu Cai","Mingyu Zhu","Banafsheh Saber Latibari","Kevin Immanuel Gubbi","Najmeh Nazari Bavarsad","Arjun Caputo","Avesta Sasan","Houman Homayoun","Setareh Rafatirad","Pratik Satam","Soheil Salehi"],"pdf_url":"https://arxiv.org/pdf/2312.13530v1.pdf","comment":"22 pages, 10 pages appendix, 10 figures, Submitted to ACM TODAES"},{"id":"http://arxiv.org/abs/2312.13519v1","updated":"2023-12-21T01:50:02Z","published":"2023-12-21T01:50:02Z","title":"Secure Information Embedding in Images with Hybrid Firefly Algorithm","summary":" Various methods have been proposed to secure access to sensitive information\nover time, such as the many cryptographic methods in use to facilitate secure\ncommunications on the internet. But other methods like steganography have been\noverlooked which may be more suitable in cases where the act of transmission of\nsensitive information itself should remain a secret. Multiple techniques that\nare commonly discussed for such scenarios suffer from low capacity and high\ndistortion in the output signal. This research introduces a novel\nsteganographic approach for concealing a confidential portable document format\n(PDF) document within a host image by employing the Hybrid Firefly algorithm\n(HFA) proposed to select the pixel arrangement. This algorithm combines two\nwidely used optimization algorithms to improve their performance. The suggested\nmethodology utilizes the HFA algorithm to conduct a search for optimal pixel\nplacements in the spatial domain. The purpose of this search is to accomplish\ntwo main goals: increasing the host image's capacity and reducing distortion.\nMoreover, the proposed approach intends to reduce the time required for the\nembedding procedure. The findings indicate a decrease in image distortion and\nan accelerated rate of convergence in the search process. The resultant\nembeddings exhibit robustness against steganalytic assaults, hence rendering\nthe identification of the embedded data a formidable undertaking.\n","authors":["Sahil Nokhwal","Manoj Chandrasekharan","Ankit Chaudhary"],"pdf_url":"https://arxiv.org/pdf/2312.13519v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.01057v2","updated":"2023-12-21T01:30:38Z","published":"2023-12-02T08:04:29Z","title":"RLHF and IIA: Perverse Incentives","summary":" Existing algorithms for reinforcement learning from human feedback (RLHF) can\nincentivize responses at odds with preferences because they are based on models\nthat assume independence of irrelevant alternatives (IIA). The perverse\nincentives induced by IIA give rise to egregious behavior when innovating on\nquery formats or learning algorithms.\n","authors":["Wanqiao Xu","Shi Dong","Xiuyuan Lu","Grace Lam","Zheng Wen","Benjamin Van Roy"],"pdf_url":"https://arxiv.org/pdf/2312.01057v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.03907v3","updated":"2023-12-21T01:25:28Z","published":"2023-04-08T04:23:46Z","title":"Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic\n Embedding","summary":" This paper presents an approach, Spectral Dynamics Embedding Control (SDEC),\nto optimal control for nonlinear stochastic systems. This method leverages an\ninfinite-dimensional feature to linearly represent the state-action value\nfunction and exploits finite-dimensional truncation approximation for practical\nimplementation. To characterize the effectiveness of these finite dimensional\napproximations, we provide an in-depth theoretical analysis to characterize the\napproximation error induced by the finite-dimension truncation and statistical\nerror induced by finite-sample approximation in both policy evaluation and\npolicy optimization. Our analysis includes two prominent kernel approximation\nmethods: truncations onto random features and Nystrom features. We also\nempirically test the algorithm and compare the performance with Koopman-based,\niLQR, and energy-based methods on a few benchmark problems.\n","authors":["Tongzheng Ren","Zhaolin Ren","Haitong Ma","Na Li","Bo Dai"],"pdf_url":"https://arxiv.org/pdf/2304.03907v3.pdf","comment":"Compared to v1, added analysis of Nystrom features, more streamlined\n proofs, and more extensive numerical studies; compared to v2, corrected a\n small error in ordering of author list"},{"id":"http://arxiv.org/abs/2105.08526v2","updated":"2023-12-21T01:23:26Z","published":"2021-05-18T13:43:18Z","title":"Transformers à Grande Vitesse","summary":" Robust travel time predictions are of prime importance in managing any\ntransportation infrastructure, and particularly in rail networks where they\nhave major impacts both on traffic regulation and passenger satisfaction. We\naim at predicting the travel time of trains on rail sections at the scale of an\nentire rail network in real-time, by estimating trains' delays relative to a\ntheoretical circulation plan.\n Predicting the evolution of a given train's delay is a uniquely hard problem,\ndistinct from mainstream road traffic forecasting problems, since it involves\nseveral hard-to-model phenomena: train spacing, station congestion and\nheterogeneous rolling stock among others. We first offer empirical evidence of\nthe previously unexplored phenomenon of delay propagation at the scale of a\nrailway network, leading to delays being amplified by interactions between\ntrains and the network's physical limitations.\n We then contribute a novel technique using the transformer architecture and\npre-trained embeddings to make real-time massively parallel predictions for\ntrain delays at the scale of the whole rail network (over 3000 trains at peak\nhours, making predictions at an average horizon of 70 minutes). Our approach\nyields very positive results on real-world data when compared to currently-used\nand experimental prediction techniques.\n","authors":["Farid Arthaud","Guillaume Lecoeur","Alban Pierre"],"pdf_url":"https://arxiv.org/pdf/2105.08526v2.pdf","comment":"10 pages including 1 page of appendices, 5 figures. Presented at\n IAROR RailBelgrade 2023 and published in Journal of Rail Transport P&M"},{"id":"http://arxiv.org/abs/2301.11442v3","updated":"2023-12-21T01:17:17Z","published":"2023-01-26T22:06:24Z","title":"Communication-Efficient Collaborative Regret Minimization in Multi-Armed\n Bandits","summary":" In this paper, we study the collaborative learning model, which concerns the\ntradeoff between parallelism and communication overhead in multi-agent\nmulti-armed bandits. For regret minimization in multi-armed bandits, we present\nthe first set of tradeoffs between the number of rounds of communication among\nthe agents and the regret of the collaborative learning process.\n","authors":["Nikolai Karpov","Qin Zhang"],"pdf_url":"https://arxiv.org/pdf/2301.11442v3.pdf","comment":"13 pages, 1 figure"},{"id":"http://arxiv.org/abs/2312.13511v1","updated":"2023-12-21T01:12:44Z","published":"2023-12-21T01:12:44Z","title":"Symmetry-enforcing neural networks with applications to constitutive\n modeling","summary":" The use of machine learning techniques to homogenize the effective behavior\nof arbitrary microstructures has been shown to be not only efficient but also\naccurate. In a recent work, we demonstrated how to combine state-of-the-art\nmicromechanical modeling and advanced machine learning techniques to homogenize\ncomplex microstructures exhibiting non-linear and history dependent behaviors.\nThe resulting homogenized model, termed smart constitutive law (SCL), enables\nthe adoption of microstructurally informed constitutive laws into finite\nelement solvers at a fraction of the computational cost required by traditional\nconcurrent multiscale approaches. In this work, the capabilities of SCLs are\nexpanded via the introduction of a novel methodology that enforces material\nsymmetries at the neuron level, applicable across various neural network\narchitectures. This approach utilizes tensor-based features in neural networks,\nfacilitating the concise and accurate representation of symmetry-preserving\noperations, and is general enough to be extend to problems beyond constitutive\nmodeling. Details on the construction of these tensor-based neural networks and\ntheir application in learning constitutive laws are presented for both elastic\nand inelastic materials. The superiority of this approach over traditional\nneural networks is demonstrated in scenarios with limited data and strong\nsymmetries, through comprehensive testing on various materials, including\nisotropic neo-Hookean materials and tensegrity lattice metamaterials. This work\nis concluded by a discussion on the potential of this methodology to discover\nsymmetry bases in materials and by an outline of future research directions.\n","authors":["Kévin Garanger","Julie Kraus","Julian J. Rimoli"],"pdf_url":"https://arxiv.org/pdf/2312.13511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11650v5","updated":"2023-12-21T01:06:56Z","published":"2023-05-19T12:58:25Z","title":"Moment Matching Denoising Gibbs Sampling","summary":" Energy-Based Models (EBMs) offer a versatile framework for modeling complex\ndata distributions. However, training and sampling from EBMs continue to pose\nsignificant challenges. The widely-used Denoising Score Matching (DSM) method\nfor scalable EBM training suffers from inconsistency issues, causing the energy\nmodel to learn a `noisy' data distribution. In this work, we propose an\nefficient sampling framework: (pseudo)-Gibbs sampling with moment matching,\nwhich enables effective sampling from the underlying clean model when given a\n`noisy' model that has been well-trained via DSM. We explore the benefits of\nour approach compared to related methods and demonstrate how to scale the\nmethod to high-dimensional datasets.\n","authors":["Mingtian Zhang","Alex Hawkins-Hooker","Brooks Paige","David Barber"],"pdf_url":"https://arxiv.org/pdf/2305.11650v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13508v1","updated":"2023-12-21T00:55:12Z","published":"2023-12-21T00:55:12Z","title":"Multimodal Federated Learning with Missing Modality via Prototype Mask\n and Contrast","summary":" In real-world scenarios, multimodal federated learning often faces the\npractical challenge of intricate modality missing, which poses constraints on\nbuilding federated frameworks and significantly degrades model inference\naccuracy. Existing solutions for addressing missing modalities generally\ninvolve developing modality-specific encoders on clients and training modality\nfusion modules on servers. However, these methods are primarily constrained to\nspecific scenarios with either unimodal clients or complete multimodal clients,\nstruggling to generalize effectively in the intricate modality missing\nscenarios. In this paper, we introduce a prototype library into the\nFedAvg-based Federated Learning framework, thereby empowering the framework\nwith the capability to alleviate the global model performance degradation\nresulting from modality missing during both training and testing. The proposed\nmethod utilizes prototypes as masks representing missing modalities to\nformulate a task-calibrated training loss and a model-agnostic uni-modality\ninference strategy. In addition, a proximal term based on prototypes is\nconstructed to enhance local training. Experimental results demonstrate the\nstate-of-the-art performance of our approach. Compared to the baselines, our\nmethod improved inference accuracy by 3.7\\% with 50\\% modality missing during\ntraining and by 23.8\\% during uni-modality inference. Code is available at\nhttps://github.com/BaoGuangYin/PmcmFL.\n","authors":["Guangyin Bao","Qi Zhang","Duoqian Miao","Zixuan Gong","Liang Hu"],"pdf_url":"https://arxiv.org/pdf/2312.13508v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2312.12337v2","updated":"2023-12-21T00:26:03Z","published":"2023-12-19T17:03:50Z","title":"pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable\n Generalizable 3D Reconstruction","summary":" We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D\nradiance fields parameterized by 3D Gaussian primitives from pairs of images.\nOur model features real-time and memory-efficient rendering for scalable\ntraining as well as fast 3D reconstruction at inference time. To overcome local\nminima inherent to sparse and locally supported representations, we predict a\ndense probability distribution over 3D and sample Gaussian means from that\nprobability distribution. We make this sampling operation differentiable via a\nreparameterization trick, allowing us to back-propagate gradients through the\nGaussian splatting representation. We benchmark our method on wide-baseline\nnovel view synthesis on the real-world RealEstate10k and ACID datasets, where\nwe outperform state-of-the-art light field transformers and accelerate\nrendering by 2.5 orders of magnitude while reconstructing an interpretable and\neditable 3D radiance field.\n","authors":["David Charatan","Sizhe Li","Andrea Tagliasacchi","Vincent Sitzmann"],"pdf_url":"https://arxiv.org/pdf/2312.12337v2.pdf","comment":"Project page: https://dcharatan.github.io/pixelsplat"},{"id":"http://arxiv.org/abs/2304.06762v3","updated":"2023-12-21T00:18:48Z","published":"2023-04-13T18:04:19Z","title":"Shall We Pretrain Autoregressive Language Models with Retrieval? A\n Comprehensive Study","summary":" Large decoder-only language models (LMs) can be largely improved in terms of\nperplexity by retrieval (e.g., RETRO), but its impact on text generation\nquality and downstream task accuracy is unclear. Thus, it is still an open\nquestion: shall we pretrain large autoregressive LMs with retrieval? To answer\nit, we perform a comprehensive study on a scalable pre-trained\nretrieval-augmented LM (i.e., RETRO) compared with standard GPT and\nretrieval-augmented GPT incorporated at fine-tuning or inference stages. We\nfirst provide the recipe to reproduce RETRO up to 9.5B parameters while\nretrieving a text corpus with 330B tokens. Based on that, we have the following\nnovel findings: i) RETRO outperforms GPT on text generation with much less\ndegeneration (i.e., repetition), moderately higher factual accuracy, and\nslightly lower toxicity with a nontoxic retrieval database. ii) On the LM\nEvaluation Harness benchmark, RETRO largely outperforms GPT on\nknowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore,\nwe introduce a simple variant of the model, RETRO++, which largely improves\nopen-domain QA results of original RETRO (e.g., EM score +8.6 on Natural\nQuestion) and significantly outperforms retrieval-augmented GPT in both\nfine-tuning and zero-shot evaluation settings. Our findings highlight the\npromising direction of pretraining autoregressive LMs with retrieval as future\nfoundation models. We release our code and model at:\nhttps://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md\n","authors":["Boxin Wang","Wei Ping","Peng Xu","Lawrence McAfee","Zihan Liu","Mohammad Shoeybi","Yi Dong","Oleksii Kuchaiev","Bo Li","Chaowei Xiao","Anima Anandkumar","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2304.06762v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2312.14334v1","updated":"2023-12-21T23:42:00Z","published":"2023-12-21T23:42:00Z","title":"DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias\n Correction)","summary":" The Adam optimizer is a popular choice in contemporary deep learning, due to\nits strong empirical performance. However we observe that in privacy sensitive\nscenarios, the traditional use of Differential Privacy (DP) with the Adam\noptimizer leads to sub-optimal performance on several tasks. We find that this\nperformance degradation is due to a DP bias in Adam's second moment estimator,\nintroduced by the addition of independent noise in the gradient computation to\nenforce DP guarantees. This DP bias leads to a different scaling for low\nvariance parameter updates, that is inconsistent with the behavior of\nnon-private Adam. We propose DP-AdamBC, an optimization algorithm which removes\nthe bias in the second moment estimation and retrieves the expected behaviour\nof Adam. Empirically, DP-AdamBC significantly improves the optimization\nperformance of DP-Adam by up to 3.5% in final accuracy in image, text, and\ngraph node classification tasks.\n","authors":["Qiaoyue Tang","Frederick Shpilevskiy","Mathias Lécuyer"],"pdf_url":"https://arxiv.org/pdf/2312.14334v1.pdf","comment":"Published as a conference paper at the 38th Annual AAAI Conference on\n Artificial Intelligence, Vancouver, 2024"},{"id":"http://arxiv.org/abs/2312.14333v1","updated":"2023-12-21T23:34:08Z","published":"2023-12-21T23:34:08Z","title":"Behaviour Modelling of Social Animals via Causal Structure Discovery and\n Graph Neural Networks","summary":" Better understanding the natural world is a crucial task with a wide range of\napplications. In environments with close proximity between humans and animals,\nsuch as zoos, it is essential to better understand the causes behind animal\nbehaviour and what interventions are responsible for changes in their\nbehaviours. This can help to predict unusual behaviours, mitigate detrimental\neffects and increase the well-being of animals. There has been work on\nmodelling the dynamics behind swarms of birds and insects but the complex\nsocial behaviours of mammalian groups remain less explored. In this work, we\npropose a method to build behavioural models using causal structure discovery\nand graph neural networks for time series. We apply this method to a mob of\nmeerkats in a zoo environment and study its ability to predict future actions\nand model the behaviour distribution at an individual-level and at a group\nlevel. We show that our method can match and outperform standard deep learning\narchitectures and generate more realistic data, while using fewer parameters\nand providing increased interpretability.\n","authors":["Gaël Gendron","Yang Chen","Mitchell Rogers","Yiping Liu","Mihailo Azhar","Shahrokh Heidari","David Arturo Soriano Valdez","Kobe Knowles","Padriac O'Leary","Simon Eyre","Michael Witbrock","Gillian Dobbie","Jiamou Liu","Patrice Delmas"],"pdf_url":"https://arxiv.org/pdf/2312.14333v1.pdf","comment":"9 pages, 7 figures, accepted as an extended abstract and poster at\n AAMAS 2024"},{"id":"http://arxiv.org/abs/2312.06914v3","updated":"2023-12-21T23:32:07Z","published":"2023-12-12T00:54:39Z","title":"Exploring Novel Object Recognition and Spontaneous Location Recognition\n Machine Learning Analysis Techniques in Alzheimer's Mice","summary":" Understanding object recognition patterns in mice is crucial for advancing\nbehavioral neuroscience and has significant implications for human health,\nparticularly in the realm of Alzheimer's research. This study is centered on\nthe development, application, and evaluation of a state-of-the-art\ncomputational pipeline designed to analyze such behaviors, specifically\nfocusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition\n(SLR) tasks. The pipeline integrates three advanced computational models:\nAny-Maze for initial data collection, DeepLabCut for detailed pose estimation,\nand Convolutional Neural Networks (CNNs) for nuanced behavioral classification.\nEmployed across four distinct mouse groups, this pipeline demonstrated high\nlevels of accuracy and robustness. Despite certain challenges like video\nquality limitations and the need for manual calculations, the results affirm\nthe pipeline's efficacy and potential for scalability. The study serves as a\nproof of concept for a multidimensional computational approach to behavioral\nneuroscience, emphasizing the pipeline's versatility and readiness for future,\nmore complex analyses.\n","authors":["Soham Bafana"],"pdf_url":"https://arxiv.org/pdf/2312.06914v3.pdf","comment":"Aspects of the paper contain errors, and data in the pipeline must be\n vetted one more time. More testing is necessary"},{"id":"http://arxiv.org/abs/2312.14331v1","updated":"2023-12-21T23:31:35Z","published":"2023-12-21T23:31:35Z","title":"Maximum entropy GFlowNets with soft Q-learning","summary":" Generative Flow Networks (GFNs) have emerged as a powerful tool for sampling\ndiscrete objects from unnormalized distributions, offering a scalable\nalternative to Markov Chain Monte Carlo (MCMC) methods. While GFNs draw\ninspiration from maximum entropy reinforcement learning (RL), the connection\nbetween the two has largely been unclear and seemingly applicable only in\nspecific cases. This paper addresses the connection by constructing an\nappropriate reward function, thereby establishing an exact relationship between\nGFNs and maximum entropy RL. This construction allows us to introduce maximum\nentropy GFNs, which, in contrast to GFNs with uniform backward policy, achieve\nthe maximum entropy attainable by GFNs without constraints on the state space.\n","authors":["Sobhan Mohammadpour","Emmanuel Bengio","Emma Frejinger","Pierre-Luc Bacon"],"pdf_url":"https://arxiv.org/pdf/2312.14331v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14329v1","updated":"2023-12-21T23:20:47Z","published":"2023-12-21T23:20:47Z","title":"Invariant Anomaly Detection under Distribution Shifts: A Causal\n Perspective","summary":" Anomaly detection (AD) is the machine learning task of identifying highly\ndiscrepant abnormal samples by solely relying on the consistency of the normal\ntraining samples. Under the constraints of a distribution shift, the assumption\nthat training samples and test samples are drawn from the same distribution\nbreaks down. In this work, by leveraging tools from causal inference we attempt\nto increase the resilience of anomaly detection models to different kinds of\ndistribution shifts. We begin by elucidating a simple yet necessary statistical\nproperty that ensures invariant representations, which is critical for robust\nAD under both domain and covariate shifts. From this property, we derive a\nregularization term which, when minimized, leads to partial distribution\ninvariance across environments. Through extensive experimental evaluation on\nboth synthetic and real-world tasks, covering a range of six different AD\nmethods, we demonstrated significant improvements in out-of-distribution\nperformance. Under both covariate and domain shift, models regularized with our\nproposed term showed marked increased robustness. Code is available at:\nhttps://github.com/JoaoCarv/invariant-anomaly-detection.\n","authors":["João B. S. Carvalho","Mengtao Zhang","Robin Geyer","Carlos Cotrini","Joachim M. Buhmann"],"pdf_url":"https://arxiv.org/pdf/2312.14329v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.01479v4","updated":"2023-12-21T22:56:45Z","published":"2023-12-03T18:41:54Z","title":"OpenVoice: Versatile Instant Voice Cloning","summary":" We introduce OpenVoice, a versatile voice cloning approach that requires only\na short audio clip from the reference speaker to replicate their voice and\ngenerate speech in multiple languages. OpenVoice represents a significant\nadvancement in addressing the following open challenges in the field: 1)\nFlexible Voice Style Control. OpenVoice enables granular control over voice\nstyles, including emotion, accent, rhythm, pauses, and intonation, in addition\nto replicating the tone color of the reference speaker. The voice styles are\nnot directly copied from and constrained by the style of the reference speaker.\nPrevious approaches lacked the ability to flexibly manipulate voice styles\nafter cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves\nzero-shot cross-lingual voice cloning for languages not included in the\nmassive-speaker training set. Unlike previous approaches, which typically\nrequire extensive massive-speaker multi-lingual (MSML) dataset for all\nlanguages, OpenVoice can clone voices into a new language without any\nmassive-speaker training data for that language. OpenVoice is also\ncomputationally efficient, costing tens of times less than commercially\navailable APIs that offer even inferior performance. To foster further research\nin the field, we have made the source code and trained model publicly\naccessible. We also provide qualitative results in our demo website. Prior to\nits public release, our internal version of OpenVoice was used tens of millions\nof times by users worldwide between May and October 2023, serving as the\nbackend of MyShell.\n","authors":["Zengyi Qin","Wenliang Zhao","Xumin Yu","Xin Sun"],"pdf_url":"https://arxiv.org/pdf/2312.01479v4.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2306.03638v3","updated":"2023-12-21T22:37:27Z","published":"2023-06-04T11:31:41Z","title":"Provable convergence guarantees for black-box variational inference","summary":" Black-box variational inference is widely used in situations where there is\nno proof that its stochastic optimization succeeds. We suggest this is due to a\ntheoretical gap in existing stochastic optimization proofs: namely the\nchallenge of gradient estimators with unusual noise bounds, and a composite\nnon-smooth objective. For dense Gaussian variational families, we observe that\nexisting gradient estimators based on reparameterization satisfy a quadratic\nnoise bound and give novel convergence guarantees for proximal and projected\nstochastic gradient descent using this bound. This provides rigorous guarantees\nthat methods similar to those used in practice converge on realistic inference\nproblems.\n","authors":["Justin Domke","Guillaume Garrigos","Robert Gower"],"pdf_url":"https://arxiv.org/pdf/2306.03638v3.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14322v1","updated":"2023-12-21T22:27:32Z","published":"2023-12-21T22:27:32Z","title":"Data Needs and Challenges of Quantum Dot Devices Automation: Workshop\n Report","summary":" Gate-defined quantum dots are a promising candidate system to realize\nscalable, coupled qubit systems and serve as a fundamental building block for\nquantum computers. However, present-day quantum dot devices suffer from\nimperfections that must be accounted for, which hinders the characterization,\ntuning, and operation process. Moreover, with an increasing number of quantum\ndot qubits, the relevant parameter space grows sufficiently to make heuristic\ncontrol infeasible. Thus, it is imperative that reliable and scalable\nautonomous tuning approaches are developed. In this report, we outline current\nchallenges in automating quantum dot device tuning and operation with a\nparticular focus on datasets, benchmarking, and standardization. We also\npresent ideas put forward by the quantum dot community on how to overcome them.\n","authors":["Justyna P. Zwolak","Jacob M. Taylor","Reed Andrews","Jared Benson","Garnett Bryant","Donovan Buterakos","Anasua Chatterjee","Sankar Das Sarma","Mark A. Eriksson","Eliška Greplová","Michael J. Gullans","Fabian Hader","Tyler J. Kovach","Pranav S. Mundada","Mick Ramsey","Torbjoern Rasmussen","Brandon Severin","Anthony Sigillito","Brennan Undseth","Brian Weber"],"pdf_url":"https://arxiv.org/pdf/2312.14322v1.pdf","comment":"White paper/overview based on a workshop held at the National\n Institute of Standards and Technology, Gaithersburg, MD. 13 pages"},{"id":"http://arxiv.org/abs/2304.02086v2","updated":"2023-12-21T21:47:03Z","published":"2023-04-04T19:33:00Z","title":"Decentralized and Privacy-Preserving Learning of Approximate Stackelberg\n Solutions in Energy Trading Games with Demand Response Aggregators","summary":" In this work, a novel Stackelberg game theoretic framework is proposed for\ntrading energy bidirectionally between the demand-response (DR) aggregator and\nthe prosumers. This formulation allows for flexible energy arbitrage and\nadditional monetary rewards while ensuring that the prosumers' desired daily\nenergy demand is met. Then, a scalable (linear with the number of prosumers),\ndecentralized, privacy-preserving algorithm is proposed to find approximate\nequilibria with online sampling and learning of the prosumers' cumulative best\nresponse, which finds applications beyond this energy game. Moreover, cost\nbounds are provided on the quality of the approximate equilibrium solution.\nFinally, real data from the California day-ahead market and the UC Davis campus\nbuilding energy demands are utilized to demonstrate the efficacy of the\nproposed framework and algorithm.\n","authors":["Styliani I. Kampezidou","Justin Romberg","Kyriakos G. Vamvoudakis","Dimitri N. Mavris"],"pdf_url":"https://arxiv.org/pdf/2304.02086v2.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2312.14309v1","updated":"2023-12-21T21:40:47Z","published":"2023-12-21T21:40:47Z","title":"Federated Quantum Long Short-term Memory (FedQLSTM)","summary":" Quantum federated learning (QFL) can facilitate collaborative learning across\nmultiple clients using quantum machine learning (QML) models, while preserving\ndata privacy. Although recent advances in QFL span different tasks like\nclassification while leveraging several data types, no prior work has focused\non developing a QFL framework that utilizes temporal data to approximate\nfunctions useful to analyze the performance of distributed quantum sensing\nnetworks. In this paper, a novel QFL framework that is the first to integrate\nquantum long short-term memory (QLSTM) models with temporal data is proposed.\nThe proposed federated QLSTM (FedQLSTM) framework is exploited for performing\nthe task of function approximation. In this regard, three key use cases are\npresented: Bessel function approximation, sinusoidal delayed quantum feedback\ncontrol function approximation, and Struve function approximation. Simulation\nresults confirm that, for all considered use cases, the proposed FedQLSTM\nframework achieves a faster convergence rate under one local training epoch,\nminimizing the overall computations, and saving 25-33% of the number of\ncommunication rounds needed until convergence compared to an FL framework with\nclassical LSTM models.\n","authors":["Mahdi Chehimi","Samuel Yen-Chi Chen","Walid Saad","Shinjae Yoo"],"pdf_url":"https://arxiv.org/pdf/2312.14309v1.pdf","comment":"20 pages, 9 figures"},{"id":"http://arxiv.org/abs/2306.05745v2","updated":"2023-12-21T21:28:52Z","published":"2023-06-09T08:22:41Z","title":"Two Independent Teachers are Better Role Model","summary":" Recent deep learning models have attracted substantial attention in infant\nbrain analysis. These models have performed state-of-the-art performance, such\nas semi-supervised techniques (e.g., Temporal Ensembling, mean teacher).\nHowever, these models depend on an encoder-decoder structure with stacked local\noperators to gather long-range information, and the local operators limit the\nefficiency and effectiveness. Besides, the $MRI$ data contain different tissue\nproperties ($TPs$) such as $T1$ and $T2$. One major limitation of these models\nis that they use both data as inputs to the segment process, i.e., the models\nare trained on the dataset once, and it requires much computational and memory\nrequirements during inference. In this work, we address the above limitations\nby designing a new deep-learning model, called 3D-DenseUNet, which works as\nadaptable global aggregation blocks in down-sampling to solve the issue of\nspatial information loss. The self-attention module connects the down-sampling\nblocks to up-sampling blocks, and integrates the feature maps in three\ndimensions of spatial and channel, effectively improving the representation\npotential and discriminating ability of the model. Additionally, we propose a\nnew method called Two Independent Teachers ($2IT$), that summarizes the model\nweights instead of label predictions. Each teacher model is trained on\ndifferent types of brain data, $T1$ and $T2$, respectively. Then, a fuse model\nis added to improve test accuracy and enable training with fewer parameters and\nlabels compared to the Temporal Ensembling method without modifying the network\narchitecture. Empirical results demonstrate the effectiveness of the proposed\nmethod. The code is available at\nhttps://github.com/AfifaKhaled/Two-Independent-Teachers-are-Better-Role-Model.\n","authors":["Afifa Khaled","Ahmed A. Mubarak","Kun He"],"pdf_url":"https://arxiv.org/pdf/2306.05745v2.pdf","comment":"This manuscript contains 14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2312.14303v1","updated":"2023-12-21T21:26:09Z","published":"2023-12-21T21:26:09Z","title":"Geo2SigMap: High-Fidelity RF Signal Mapping Using Geographic Databases","summary":" Radio frequency (RF) signal mapping, which is the process of analyzing and\npredicting the RF signal strength and distribution across specific areas, is\ncrucial for cellular network planning and deployment. Traditional approaches to\nRF signal mapping rely on statistical models constructed based on measurement\ndata, which offer low complexity but often lack accuracy, or ray tracing tools,\nwhich provide enhanced precision for the target area but suffer from increased\ncomputational complexity. Recently, machine learning (ML) has emerged as a\ndata-driven method for modeling RF signal propagation, which leverages models\ntrained on synthetic datasets to perform RF signal mapping in \"unseen\" areas.\n In this paper, we present Geo2SigMap, an ML-based framework for efficient and\nhigh-fidelity RF signal mapping using geographic databases. First, we develop\nan automated framework that seamlessly integrates three open-source tools:\nOpenStreetMap (geographic databases), Blender (computer graphics), and Sionna\n(ray tracing), enabling the efficient generation of large-scale 3D building\nmaps and ray tracing models. Second, we propose a cascaded U-Net model, which\nis pre-trained on synthetic datasets and employed to generate detailed RF\nsignal maps, leveraging environmental information and sparse measurement data.\nFinally, we evaluate the performance of Geo2SigMap via a real-world measurement\ncampaign, where three types of user equipment (UE) collect over 45,000 data\npoints related to cellular information from six LTE cells operating in the\ncitizens broadband radio service (CBRS) band. Our results show that Geo2SigMap\nachieves an average root-mean-square-error (RMSE) of 6.04 dB for predicting the\nreference signal received power (RSRP) at the UE, representing an average RMSE\nimprovement of 3.59 dB compared to existing methods.\n","authors":["Yiming Li","Zeyu Li","Zhihui Gao","Tingjun Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14303v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14302v1","updated":"2023-12-21T21:22:41Z","published":"2023-12-21T21:22:41Z","title":"Exploiting Novel GPT-4 APIs","summary":" Language model attacks typically assume one of two extreme threat models:\nfull white-box access to model weights, or black-box access limited to a text\ngeneration API. However, real-world APIs are often more flexible than just text\ngeneration: these APIs expose ``gray-box'' access leading to new threat\nvectors. To explore this, we red-team three new functionalities exposed in the\nGPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that\nfine-tuning a model on as few as 15 harmful examples or 100 benign examples can\nremove core safeguards from GPT-4, enabling a range of harmful outputs.\nFurthermore, we find that GPT-4 Assistants readily divulge the function call\nschema and can be made to execute arbitrary function calls. Finally, we find\nthat knowledge retrieval can be hijacked by injecting instructions into\nretrieval documents. These vulnerabilities highlight that any additions to the\nfunctionality exposed by an API can create new vulnerabilities.\n","authors":["Kellin Pelrine","Mohammad Taufeeque","Michał Zając","Euan McLean","Adam Gleave"],"pdf_url":"https://arxiv.org/pdf/2312.14302v1.pdf","comment":"10 pages, 1 figure, 4 tables"},{"id":"http://arxiv.org/abs/1812.02207v3","updated":"2023-12-21T21:16:41Z","published":"2018-12-05T19:59:20Z","title":"Better Trees: An empirical study on hyperparameter tuning of\n classification decision tree induction algorithms","summary":" Machine learning algorithms often contain many hyperparameters (HPs) whose\nvalues affect the predictive performance of the induced models in intricate\nways. Due to the high number of possibilities for these HP configurations and\ntheir complex interactions, it is common to use optimization techniques to find\nsettings that lead to high predictive performance. However, insights into\nefficiently exploring this vast space of configurations and dealing with the\ntrade-off between predictive and runtime performance remain challenging.\nFurthermore, there are cases where the default HPs fit the suitable\nconfiguration. Additionally, for many reasons, including model validation and\nattendance to new legislation, there is an increasing interest in interpretable\nmodels, such as those created by the Decision Tree (DT) induction algorithms.\nThis paper provides a comprehensive approach for investigating the effects of\nhyperparameter tuning for the two DT induction algorithms most often used, CART\nand C4.5. DT induction algorithms present high predictive performance and\ninterpretable classification models, though many HPs need to be adjusted.\nExperiments were carried out with different tuning strategies to induce models\nand to evaluate HPs' relevance using 94 classification datasets from OpenML.\nThe experimental results point out that different HP profiles for the tuning of\neach algorithm provide statistically significant improvements in most of the\ndatasets for CART, but only in one-third for C4.5. Although different\nalgorithms may present different tuning scenarios, the tuning techniques\ngenerally required few evaluations to find accurate solutions. Furthermore, the\nbest technique for all the algorithms was the IRACE. Finally, we found out that\ntuning a specific small subset of HPs is a good alternative for achieving\noptimal predictive performance.\n","authors":["Rafael Gomes Mantovani","Tomáš Horváth","André L. D. Rossi","Ricardo Cerri","Sylvio Barbon Junior","Joaquin Vanschoren","André Carlos Ponce de Leon Ferreira de Carvalho"],"pdf_url":"https://arxiv.org/pdf/1812.02207v3.pdf","comment":"60 pages, 16 figures"},{"id":"http://arxiv.org/abs/2312.14299v1","updated":"2023-12-21T21:12:39Z","published":"2023-12-21T21:12:39Z","title":"Fairness in Submodular Maximization over a Matroid Constraint","summary":" Submodular maximization over a matroid constraint is a fundamental problem\nwith various applications in machine learning. Some of these applications\ninvolve decision-making over datapoints with sensitive attributes such as\ngender or race. In such settings, it is crucial to guarantee that the selected\nsolution is fairly distributed with respect to this attribute. Recently,\nfairness has been investigated in submodular maximization under a cardinality\nconstraint in both the streaming and offline settings, however the more general\nproblem with matroid constraint has only been considered in the streaming\nsetting and only for monotone objectives. This work fills this gap. We propose\nvarious algorithms and impossibility results offering different trade-offs\nbetween quality, fairness, and generality.\n","authors":["Marwa El Halabi","Jakub Tarnawski","Ashkan Norouzi-Fard","Thuy-Duong Vuong"],"pdf_url":"https://arxiv.org/pdf/2312.14299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14292v1","updated":"2023-12-21T20:48:15Z","published":"2023-12-21T20:48:15Z","title":"Benchmarking Multi-Agent Preference-based Reinforcement Learning for\n Human-AI Teaming","summary":" Preference-based Reinforcement Learning (PbRL) is an active area of research,\nand has made significant strides in single-agent actor and in observer\nhuman-in-the-loop scenarios. However, its application within the co-operative\nmulti-agent RL frameworks, where humans actively participate and express\npreferences for agent behavior, remains largely uncharted. We consider a\ntwo-agent (Human-AI) cooperative setup where both the agents are rewarded\naccording to human's reward function for the team. However, the agent does not\nhave access to it, and instead, utilizes preference-based queries to elicit its\nobjectives and human's preferences for the robot in the human-robot team. We\nintroduce the notion of Human-Flexibility, i.e. whether the human partner is\namenable to multiple team strategies, with a special case being Specified\nOrchestration where the human has a single team policy in mind (most\nconstrained case). We propose a suite of domains to study PbRL for Human-AI\ncooperative setup which explicitly require forced cooperation. Adapting\nstate-of-the-art single-agent PbRL algorithms to our two-agent setting, we\nconduct a comprehensive benchmarking study across our domain suite. Our\nfindings highlight the challenges associated with high degree of\nHuman-Flexibility and the limited access to the human's envisioned policy in\nPbRL for Human-AI cooperation. Notably, we observe that PbRL algorithms exhibit\neffective performance exclusively in the case of Specified Orchestration which\ncan be seen as an upper bound PbRL performance for future research.\n","authors":["Siddhant Bhambri","Mudit Verma","Anil Murthy","Subbarao Kambhampati"],"pdf_url":"https://arxiv.org/pdf/2312.14292v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14285v1","updated":"2023-12-21T20:40:51Z","published":"2023-12-21T20:40:51Z","title":"Probing Biological and Artificial Neural Networks with Task-dependent\n Neural Manifolds","summary":" Recently, growth in our understanding of the computations performed in both\nbiological and artificial neural networks has largely been driven by either\nlow-level mechanistic studies or global normative approaches. However, concrete\nmethodologies for bridging the gap between these levels of abstraction remain\nelusive. In this work, we investigate the internal mechanisms of neural\nnetworks through the lens of neural population geometry, aiming to provide\nunderstanding at an intermediate level of abstraction, as a way to bridge that\ngap. Utilizing manifold capacity theory (MCT) from statistical physics and\nmanifold alignment analysis (MAA) from high-dimensional statistics, we probe\nthe underlying organization of task-dependent manifolds in deep neural networks\nand macaque neural recordings. Specifically, we quantitatively characterize how\ndifferent learning objectives lead to differences in the organizational\nstrategies of these models and demonstrate how these geometric analyses are\nconnected to the decodability of task-relevant information. These analyses\npresent a strong direction for bridging mechanistic and normative theories in\nneural networks through neural population geometry, potentially opening up many\nfuture research avenues in both machine learning and neuroscience.\n","authors":["Michael Kuoch","Chi-Ning Chou","Nikhil Parthasarathy","Joel Dapello","James J. DiCarlo","Haim Sompolinsky","SueYeon Chung"],"pdf_url":"https://arxiv.org/pdf/2312.14285v1.pdf","comment":"To appear in the proceedings of the Conference on Parsimony and\n Learning (CPAL) 2024"},{"id":"http://arxiv.org/abs/2312.03824v2","updated":"2023-12-21T20:37:57Z","published":"2023-12-06T19:00:00Z","title":"nbi: the Astronomer's Package for Neural Posterior Estimation","summary":" Despite the promise of Neural Posterior Estimation (NPE) methods in\nastronomy, the adaptation of NPE into the routine inference workflow has been\nslow. We identify three critical issues: the need for custom featurizer\nnetworks tailored to the observed data, the inference inexactness, and the\nunder-specification of physical forward models. To address the first two\nissues, we introduce a new framework and open-source software nbi (Neural\nBayesian Inference), which supports both amortized and sequential NPE. First,\nnbi provides built-in \"featurizer\" networks with demonstrated efficacy on\nsequential data, such as light curve and spectra, thus obviating the need for\nthis customization on the user end. Second, we introduce a modified algorithm\nSNPE-IS, which facilities asymptotically exact inference by using the surrogate\nposterior under NPE only as a proposal distribution for importance sampling.\nThese features allow nbi to be applied off-the-shelf to astronomical inference\nproblems involving light curves and spectra. We discuss how nbi may serve as an\neffective alternative to existing methods such as Nested Sampling. Our package\nis at https://github.com/kmzzhang/nbi.\n","authors":["Keming Zhang","Joshua S. Bloom","Stéfan van der Walt","Nina Hernitschek"],"pdf_url":"https://arxiv.org/pdf/2312.03824v2.pdf","comment":"Update references. Accepted to NeurIPS 2023 Workshop on Deep Learning\n and Inverse Problems. Initially appeared at ICML 2023 Workshop on Machine\n Learning for Astrophysics. Code at https://github.com/kmzzhang/nbi"},{"id":"http://arxiv.org/abs/2312.14280v1","updated":"2023-12-21T20:25:16Z","published":"2023-12-21T20:25:16Z","title":"Fine-grained Forecasting Models Via Gaussian Process Blurring Effect","summary":" Time series forecasting is a challenging task due to the existence of complex\nand dynamic temporal dependencies. This can lead to incorrect predictions by\neven the best forecasting models. Using more training data is one way to\nimprove the accuracy, but this source is often limited. In contrast, we are\nbuilding on successful denoising approaches for image generation by advocating\nfor an end-to-end forecasting and denoising paradigm.\n We propose an end-to-end forecast-blur-denoise forecasting framework by\nencouraging a division of labors between the forecasting and the denoising\nmodels. The initial forecasting model is directed to focus on accurately\npredicting the coarse-grained behavior, while the denoiser model focuses on\ncapturing the fine-grained behavior that is locally blurred by integrating a\nGaussian Process model. All three parts are interacting for the best end-to-end\nperformance. Our extensive experiments demonstrate that our proposed approach\nis able to improve the forecasting accuracy of several state-of-the-art\nforecasting models as well as several other denoising approaches.\n","authors":["Sepideh Koohfar","Laura Dietz"],"pdf_url":"https://arxiv.org/pdf/2312.14280v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2312.14279v1","updated":"2023-12-21T20:17:01Z","published":"2023-12-21T20:17:01Z","title":"Characterizing and Classifying Developer Forum Posts with their\n Intentions","summary":" With the rapid growth of the developer community, the amount of posts on\nonline technical forums has been growing rapidly, which poses difficulties for\nusers to filter useful posts and find important information. Tags provide a\nconcise feature dimension for users to locate their interested posts and for\nsearch engines to index the most relevant posts according to the queries.\nHowever, most tags are only focused on the technical perspective (e.g., program\nlanguage, platform, tool). In most cases, forum posts in online developer\ncommunities reveal the author's intentions to solve a problem, ask for advice,\nshare information, etc. The modeling of the intentions of posts can provide an\nextra dimension to the current tag taxonomy. By referencing previous studies\nand learning from industrial perspectives, we create a refined taxonomy for the\nintentions of technical forum posts. Through manual labeling and analysis on a\nsampled post dataset extracted from online forums, we understand the relevance\nbetween the constitution of posts (code, error messages) and their intentions.\nFurthermore, inspired by our manual study, we design a pre-trained\ntransformer-based model to automatically predict post intentions. The best\nvariant of our intention prediction framework, which achieves a Micro F1-score\nof 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787,\noutperforms the state-of-the-art baseline approach. Our characterization and\nautomated classification of forum posts regarding their intentions may help\nforum maintainers or third-party tool developers improve the organization and\nretrieval of posts on technical forums. We have released our annotated dataset\nand codes in our supplementary material package.\n","authors":["Xingfang Wu","Eric Laufer","Heng Li","Foutse Khomh","Santhosh Srinivasan","Jayden Luo"],"pdf_url":"https://arxiv.org/pdf/2312.14279v1.pdf","comment":"39 pages"},{"id":"http://arxiv.org/abs/2312.14276v1","updated":"2023-12-21T19:57:29Z","published":"2023-12-21T19:57:29Z","title":"Deep Neural Networks and Finite Elements of Any Order on Arbitrary\n Dimensions","summary":" In this study, we establish that deep neural networks employing ReLU and\nReLU$^2$ activation functions are capable of representing Lagrange finite\nelement functions of any order on simplicial meshes across arbitrary\ndimensions. We introduce a novel global formulation of the basis functions for\nLagrange elements, grounded in a geometric decomposition of these elements and\nleveraging two essential properties of high-dimensional simplicial meshes and\nbarycentric coordinate functions. This representation theory facilitates a\nnatural approximation result for such deep neural networks. Our findings\npresent the first demonstration of how deep neural networks can systematically\ngenerate general continuous piecewise polynomial functions.\n","authors":["Juncai He","Jinchao Xu"],"pdf_url":"https://arxiv.org/pdf/2312.14276v1.pdf","comment":"23 pages, 2 figures"},{"id":"http://arxiv.org/abs/2302.00845v5","updated":"2023-12-21T19:41:57Z","published":"2023-02-02T03:15:29Z","title":"Coordinating Distributed Example Orders for Provably Accelerated\n Training","summary":" Recent research on online Gradient Balancing (GraB) has revealed that there\nexist permutation-based example orderings for SGD that are guaranteed to\noutperform random reshuffling (RR). Whereas RR arbitrarily permutes training\nexamples, GraB leverages stale gradients from prior epochs to order examples --\nachieving a provably faster convergence rate than RR. However, GraB is limited\nby design: while it demonstrates an impressive ability to scale-up training on\ncentralized data, it does not naturally extend to modern distributed ML\nworkloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which\nuses insights from prior work on kernel thinning to translate the benefits of\nprovably faster permutation-based example ordering to distributed settings.\nWith negligible overhead, CD-GraB exhibits a linear speedup in convergence rate\nover centralized GraB and outperforms distributed RR on a variety of benchmark\ntasks.\n","authors":["A. Feder Cooper","Wentao Guo","Khiem Pham","Tiancheng Yuan","Charlie F. Ruan","Yucheng Lu","Christopher De Sa"],"pdf_url":"https://arxiv.org/pdf/2302.00845v5.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2210.11413v3","updated":"2023-12-21T19:41:37Z","published":"2022-10-16T11:53:42Z","title":"Minimizing low-rank models of high-order tensors: Hardness, span, tight\n relaxation, and applications","summary":" We consider the problem of finding the smallest or largest entry of a tensor\nof order N that is specified via its rank decomposition. Stated in a different\nway, we are given N sets of R-dimensional vectors and we wish to select one\nvector from each set such that the sum of the Hadamard product of the selected\nvectors is minimized or maximized. We show that this fundamental tensor problem\nis NP-hard for any tensor rank higher than one, and polynomial-time solvable in\nthe rank-one case. We also propose a continuous relaxation and prove that it is\ntight for any rank. For low-enough ranks, the proposed continuous reformulation\nis amenable to low-complexity gradient-based optimization, and we propose a\nsuite of gradient-based optimization algorithms drawing from projected gradient\ndescent, Frank-Wolfe, or explicit parametrization of the relaxed constraints.\nWe also show that our core results remain valid no matter what kind of polyadic\ntensor model is used to represent the tensor of interest, including Tucker,\nHOSVD/MLSVD, tensor train, or tensor ring. Next, we consider the class of\nproblems that can be posed as special instances of the problem of interest. We\nshow that this class includes the partition problem (and thus all NP-complete\nproblems via polynomial-time transformation), integer least squares, integer\nlinear programming, integer quadratic programming, sign retrieval (a special\nkind of mixed integer programming / restricted version of phase retrieval), and\nmaximum likelihood decoding of parity check codes. We demonstrate promising\nexperimental results on a number of hard problems, including state-of-art\nperformance in decoding low density parity check codes and general parity check\ncodes.\n","authors":["Nicholas D. Sidiropoulos","Paris Karakasis","Aritra Konar"],"pdf_url":"https://arxiv.org/pdf/2210.11413v3.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2312.14260v1","updated":"2023-12-21T19:21:36Z","published":"2023-12-21T19:21:36Z","title":"Elevating Defenses: Bridging Adversarial Training and Watermarking for\n Model Resilience","summary":" Machine learning models are being used in an increasing number of critical\napplications; thus, securing their integrity and ownership is critical. Recent\nstudies observed that adversarial training and watermarking have a conflicting\ninteraction. This work introduces a novel framework to integrate adversarial\ntraining with watermarking techniques to fortify against evasion attacks and\nprovide confident model verification in case of intellectual property theft. We\nuse adversarial training together with adversarial watermarks to train a robust\nwatermarked model. The key intuition is to use a higher perturbation budget to\ngenerate adversarial watermarks compared to the budget used for adversarial\ntraining, thus avoiding conflict. We use the MNIST and Fashion-MNIST datasets\nto evaluate our proposed technique on various model stealing attacks. The\nresults obtained consistently outperform the existing baseline in terms of\nrobustness performance and further prove the resilience of this defense against\npruning and fine-tuning removal attacks.\n","authors":["Janvi Thakkar","Giulio Zizzo","Sergio Maffeis"],"pdf_url":"https://arxiv.org/pdf/2312.14260v1.pdf","comment":"Accepted at DAI Workshop, AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14259v1","updated":"2023-12-21T19:21:19Z","published":"2023-12-21T19:21:19Z","title":"Multi-Agent Bandit Learning through Heterogeneous Action Erasure\n Channels","summary":" Multi-Armed Bandit (MAB) systems are witnessing an upswing in applications\nwithin multi-agent distributed environments, leading to the advancement of\ncollaborative MAB algorithms. In such settings, communication between agents\nexecuting actions and the primary learner making decisions can hinder the\nlearning process. A prevalent challenge in distributed learning is action\nerasure, often induced by communication delays and/or channel noise. This\nresults in agents possibly not receiving the intended action from the learner,\nsubsequently leading to misguided feedback. In this paper, we introduce novel\nalgorithms that enable learners to interact concurrently with distributed\nagents across heterogeneous action erasure channels with different action\nerasure probabilities. We illustrate that, in contrast to existing bandit\nalgorithms, which experience linear regret, our algorithms assure sub-linear\nregret guarantees. Our proposed solutions are founded on a meticulously crafted\nrepetition protocol and scheduling of learning across heterogeneous channels.\nTo our knowledge, these are the first algorithms capable of effectively\nlearning through heterogeneous action erasure channels. We substantiate the\nsuperior performance of our algorithm through numerical experiments,\nemphasizing their practical significance in addressing issues related to\ncommunication constraints and delays in multi-agent environments.\n","authors":["Osama A. Hanna","Merve Karakas","Lin F. Yang","Christina Fragouli"],"pdf_url":"https://arxiv.org/pdf/2312.14259v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14254v1","updated":"2023-12-21T19:12:59Z","published":"2023-12-21T19:12:59Z","title":"Contextual Feature Selection with Conditional Stochastic Gates","summary":" We study the problem of contextual feature selection, where the goal is to\nlearn a predictive function while identifying subsets of informative features\nconditioned on specific contexts. Towards this goal, we generalize the recently\nproposed stochastic gates (STG) Yamada et al. [2020] by modeling the\nprobabilistic gates as conditional Bernoulli variables whose parameters are\npredicted based on the contextual variables. Our new scheme, termed\nconditional-STG (c-STG), comprises two networks: a hypernetwork that\nestablishes the mapping between contextual variables and probabilistic feature\nselection parameters and a prediction network that maps the selected feature to\nthe response variable. Training the two networks simultaneously ensures the\ncomprehensive incorporation of context and feature selection within a unified\nmodel. We provide a theoretical analysis to examine several properties of the\nproposed framework. Importantly, our model leads to improved flexibility and\nadaptability of feature selection and, therefore, can better capture the\nnuances and variations in the data. We apply c-STG to simulated and real-world\ndatasets, including healthcare, housing, and neuroscience, and demonstrate that\nit effectively selects contextually meaningful features, thereby enhancing\npredictive performance and interpretability.\n","authors":["Ram Dyuthi Sristi","Ofir Lindenbaum","Maria Lavzin","Jackie Schiller","Gal Mishne","Hadas Benisty"],"pdf_url":"https://arxiv.org/pdf/2312.14254v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14249v1","updated":"2023-12-21T19:06:34Z","published":"2023-12-21T19:06:34Z","title":"GenoCraft: A Comprehensive, User-Friendly Web-Based Platform for\n High-Throughput Omics Data Analysis and Visualization","summary":" The surge in high-throughput omics data has reshaped the landscape of\nbiological research, underlining the need for powerful, user-friendly data\nanalysis and interpretation tools. This paper presents GenoCraft, a web-based\ncomprehensive software solution designed to handle the entire pipeline of omics\ndata processing. GenoCraft offers a unified platform featuring advanced\nbioinformatics tools, covering all aspects of omics data analysis. It\nencompasses a range of functionalities, such as normalization, quality control,\ndifferential analysis, network analysis, pathway analysis, and diverse\nvisualization techniques. This software makes state-of-the-art omics data\nanalysis more accessible to a wider range of users. With GenoCraft, researchers\nand data scientists have access to an array of cutting-edge bioinformatics\ntools under a user-friendly interface, making it a valuable resource for\nmanaging and analyzing large-scale omics data. The API with an interactive web\ninterface is publicly available at https://genocraft.stanford. edu/. We also\nrelease all the codes in https://github.com/futianfan/GenoCraft.\n","authors":["Yingzhou Lu","Minjie Shen","Yue Zhao","Chenhao Li","Fan Meng","Xiao Wang","David Herrington","Yue Wang","Tim Fu","Capucine Van Rechem"],"pdf_url":"https://arxiv.org/pdf/2312.14249v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14247v1","updated":"2023-12-21T19:02:27Z","published":"2023-12-21T19:02:27Z","title":"Deep Reinforcement Learning Based Placement for Integrated Access\n Backhauling in UAV-Assisted Wireless Networks","summary":" The advent of fifth generation (5G) networks has opened new avenues for\nenhancing connectivity, particularly in challenging environments like remote\nareas or disaster-struck regions. Unmanned aerial vehicles (UAVs) have been\nidentified as a versatile tool in this context, particularly for improving\nnetwork performance through the Integrated access and backhaul (IAB) feature of\n5G. However, existing approaches to UAV-assisted network enhancement face\nlimitations in dynamically adapting to varying user locations and network\ndemands. This paper introduces a novel approach leveraging deep reinforcement\nlearning (DRL) to optimize UAV placement in real-time, dynamically adjusting to\nchanging network conditions and user requirements. Our method focuses on the\nintricate balance between fronthaul and backhaul links, a critical aspect often\noverlooked in current solutions. The unique contribution of this work lies in\nits ability to autonomously position UAVs in a way that not only ensures robust\nconnectivity to ground users but also maintains seamless integration with\ncentral network infrastructure. Through various simulated scenarios, we\ndemonstrate how our approach effectively addresses these challenges, enhancing\ncoverage and network performance in critical areas. This research fills a\nsignificant gap in UAV-assisted 5G networks, providing a scalable and adaptive\nsolution for future mobile networks.\n","authors":["Yuhui Wang","Junaid Farooq"],"pdf_url":"https://arxiv.org/pdf/2312.14247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.11267v2","updated":"2023-12-21T19:00:00Z","published":"2022-06-22T18:00:00Z","title":"Neural Implicit Manifold Learning for Topology-Aware Density Estimation","summary":" Natural data observed in $\\mathbb{R}^n$ is often constrained to an\n$m$-dimensional manifold $\\mathcal{M}$, where $m < n$. This work focuses on the\ntask of building theoretically principled generative models for such data.\nCurrent generative models learn $\\mathcal{M}$ by mapping an $m$-dimensional\nlatent variable through a neural network $f_\\theta: \\mathbb{R}^m \\to\n\\mathbb{R}^n$. These procedures, which we call pushforward models, incur a\nstraightforward limitation: manifolds cannot in general be represented with a\nsingle parameterization, meaning that attempts to do so will incur either\ncomputational instability or the inability to learn probability densities\nwithin the manifold. To remedy this problem, we propose to model $\\mathcal{M}$\nas a neural implicit manifold: the set of zeros of a neural network. We then\nlearn the probability density within $\\mathcal{M}$ with a constrained\nenergy-based model, which employs a constrained variant of Langevin dynamics to\ntrain and sample from the learned manifold. In experiments on synthetic and\nnatural data, we show that our model can learn manifold-supported distributions\nwith complex topologies more accurately than pushforward models.\n","authors":["Brendan Leigh Ross","Gabriel Loaiza-Ganem","Anthony L. Caterini","Jesse C. Cresswell"],"pdf_url":"https://arxiv.org/pdf/2206.11267v2.pdf","comment":"Accepted to TMLR in 2023. Code:\n https://github.com/layer6ai-labs/implicit-manifolds"},{"id":"http://arxiv.org/abs/2312.14237v1","updated":"2023-12-21T18:58:41Z","published":"2023-12-21T18:58:41Z","title":"AI-Lorenz: A physics-data-driven framework for black-box and gray-box\n identification of chaotic systems with symbolic regression","summary":" Discovering mathematical models that characterize the observed behavior of\ndynamical systems remains a major challenge, especially for systems in a\nchaotic regime. The challenge is even greater when the physics underlying such\nsystems is not yet understood, and scientific inquiry must solely rely on\nempirical data. Driven by the need to fill this gap, we develop a framework\nthat learns mathematical expressions modeling complex dynamical behaviors by\nidentifying differential equations from noisy and sparse observable data. We\ntrain a small neural network to learn the dynamics of a system, its rate of\nchange in time, and missing model terms, which are used as input for a symbolic\nregression algorithm to autonomously distill the explicit mathematical terms.\nThis, in turn, enables us to predict the future evolution of the dynamical\nbehavior. The performance of this framework is validated by recovering the\nright-hand sides and unknown terms of certain complex, chaotic systems such as\nthe well-known Lorenz system, a six-dimensional hyperchaotic system, and the\nnon-autonomous Sprott chaotic system, and comparing them with their known\nanalytical expressions.\n","authors":["Mario De Florio","Ioannis G. Kevrekidis","George Em Karniadakis"],"pdf_url":"https://arxiv.org/pdf/2312.14237v1.pdf","comment":"28 pages, 15 figures, 9 tables"}],"Multimedia":[{"id":"http://arxiv.org/abs/2312.13567v1","updated":"2023-12-21T04:31:18Z","published":"2023-12-21T04:31:18Z","title":"Fine-grained Disentangled Representation Learning for Multimodal Emotion\n Recognition","summary":" Multimodal emotion recognition (MMER) is an active research field that aims\nto accurately recognize human emotions by fusing multiple perceptual\nmodalities. However, inherent heterogeneity across modalities introduces\ndistribution gaps and information redundancy, posing significant challenges for\nMMER. In this paper, we propose a novel fine-grained disentangled\nrepresentation learning (FDRL) framework to address these challenges.\nSpecifically, we design modality-shared and modality-private encoders to\nproject each modality into modality-shared and modality-private subspaces,\nrespectively. In the shared subspace, we introduce a fine-grained alignment\ncomponent to learn modality-shared representations, thus capturing modal\nconsistency. Subsequently, we tailor a fine-grained disparity component to\nconstrain the private subspaces, thereby learning modality-private\nrepresentations and enhancing their diversity. Lastly, we introduce a\nfine-grained predictor component to ensure that the labels of the output\nrepresentations from the encoders remain unchanged. Experimental results on the\nIEMOCAP dataset show that FDRL outperforms the state-of-the-art methods,\nachieving 78.34% and 79.44% on WAR and UAR, respectively.\n","authors":["Haoqin Sun","Shiwan Zhao","Xuechen Wang","Wenjia Zeng","Yong Chen","Yong Qin"],"pdf_url":"https://arxiv.org/pdf/2312.13567v1.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.07661v2","updated":"2023-12-21T12:08:55Z","published":"2023-12-12T19:00:04Z","title":"CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor","summary":" Existing open-vocabulary image segmentation methods require a fine-tuning\nstep on mask annotations and/or image-text datasets. Mask labels are\nlabor-intensive, which limits the number of categories in segmentation\ndatasets. As a result, the open-vocabulary capacity of pre-trained VLMs is\nseverely reduced after fine-tuning. However, without fine-tuning, VLMs trained\nunder weak image-text supervision tend to make suboptimal mask predictions when\nthere are text queries referring to non-existing concepts in the image. To\nalleviate these issues, we introduce a novel recurrent framework that\nprogressively filters out irrelevant texts and enhances mask quality without\ntraining efforts. The recurrent unit is a two-stage segmenter built upon a VLM\nwith frozen weights. Thus, our model retains the VLM's broad vocabulary space\nand strengthens its segmentation capability. Experimental results show that our\nmethod outperforms not only the training-free counterparts, but also those\nfine-tuned with millions of additional data samples, and sets new\nstate-of-the-art records for both zero-shot semantic and referring image\nsegmentation tasks. Specifically, we improve the current record by 28.8, 16.0,\nand 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.\n","authors":["Shuyang Sun","Runjia Li","Philip Torr","Xiuye Gu","Siyang Li"],"pdf_url":"https://arxiv.org/pdf/2312.07661v2.pdf","comment":"Project page: https://torrvision.com/clip_as_rnn/"}]},"2023-12-22T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2312.14890v1","updated":"2023-12-22T18:07:44Z","published":"2023-12-22T18:07:44Z","title":"NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language\n Models via Complexity Classes","summary":" Complex reasoning ability is one of the most important features of current\nLLMs, which has also been leveraged to play an integral role in complex\ndecision-making tasks. Therefore, the investigation into the reasoning\ncapabilities of Large Language Models (LLMs) is critical: numerous benchmarks\nhave been established to assess the reasoning abilities of LLMs. However,\ncurrent benchmarks are inadequate in offering a rigorous evaluation of the full\nextent of reasoning abilities that LLMs are capable of achieving. They are also\nprone to the risk of overfitting, as these benchmarks, being publicly\naccessible and static, allow models to potentially tailor their responses to\nspecific benchmark metrics, thereby inflating their performance. Addressing\nthese limitations, our research introduces a new benchmark, named NPHardEval.\nThis benchmark is designed to evaluate the reasoning abilities of LLMs across a\nbroad spectrum of 900 algorithmic questions, extending up to the NP-Hard\ncomplexity class. These questions are meticulously chosen to represent a wide\nrange of complexity class below the NP-hard complexity class, offering a\nrigorous measure of the reasoning ability of LLMs. Through this study, we shed\nlight on the current state of reasoning in LLMs, providing an objective and\nrigorous perspective through the comparison of LLMs' performance across complex\nclasses. Moreover, this benchmark is designed with a dynamic update mechanism,\nwhere the datapoints are refreshed on a monthly basis. Such regular updates\nplay a crucial role in mitigating the risk of LLMs overfitting to the\nbenchmark, promoting a more accurate and reliable assessment of their reasoning\ncapabilities. The benchmark dataset and code of NPHardEval are available at\nhttps://github.com/casmlab/NPHardEval.\n","authors":["Lizhou Fan","Wenyue Hua","Lingyao Li","Haoyang Ling","Yongfeng Zhang","Libby Hemphill"],"pdf_url":"https://arxiv.org/pdf/2312.14890v1.pdf","comment":"22 pages, 6 figures, 2 tables"},{"id":"http://arxiv.org/abs/2312.14877v1","updated":"2023-12-22T17:57:29Z","published":"2023-12-22T17:57:29Z","title":"Robust Knowledge Extraction from Large Language Models using Social\n Choice Theory","summary":" Large-language models (LLMs) have the potential to support a wide range of\napplications like conversational agents, creative writing, text improvement,\nand general query answering. However, they are ill-suited for query answering\nin high-stake domains like medicine because they generate answers at random and\ntheir answers are typically not robust - even the same query can result in\ndifferent answers when prompted multiple times. In order to improve the\nrobustness of LLM queries, we propose using ranking queries repeatedly and to\naggregate the queries using methods from social choice theory. We study ranking\nqueries in diagnostic settings like medical and fault diagnosis and discuss how\nthe Partial Borda Choice function from the literature can be applied to merge\nmultiple query results. We discuss some additional interesting properties in\nour setting and evaluate the robustness of our approach empirically.\n","authors":["Nico Potyka","Yuqicheng Zhu","Yunjie He","Evgeny Kharlamov","Steffen Staab"],"pdf_url":"https://arxiv.org/pdf/2312.14877v1.pdf","comment":"Accepted by AAMAS 2024 as a full paper"},{"id":"http://arxiv.org/abs/2306.15774v2","updated":"2023-12-22T17:53:02Z","published":"2023-06-27T19:54:30Z","title":"Next Steps for Human-Centered Generative AI: A Technical Perspective","summary":" Through iterative, cross-disciplinary discussions, we define and propose\nnext-steps for Human-centered Generative AI (HGAI). We contribute a\ncomprehensive research agenda that lays out future directions of Generative AI\nspanning three levels: aligning with human values; assimilating human intents;\nand augmenting human abilities. By identifying these next-steps, we intend to\ndraw interdisciplinary research teams to pursue a coherent set of emergent\nideas in HGAI, focusing on their interested topics while maintaining a coherent\nbig picture of the future work landscape.\n","authors":["Xiang 'Anthony' Chen","Jeff Burke","Ruofei Du","Matthew K. Hong","Jennifer Jacobs","Philippe Laban","Dingzeyu Li","Nanyun Peng","Karl D. D. Willis","Chien-Sheng Wu","Bolei Zhou"],"pdf_url":"https://arxiv.org/pdf/2306.15774v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14870v1","updated":"2023-12-22T17:46:36Z","published":"2023-12-22T17:46:36Z","title":"Numerical Reasoning for Financial Reports","summary":" Financial reports offer critical insights into a company's operations, yet\ntheir extensive length typically spanning 30 40 pages poses challenges for\nswift decision making in dynamic markets. To address this, we leveraged\nfinetuned Large Language Models (LLMs) to distill key indicators and\noperational metrics from these reports basis questions from the user. We\ndevised a method to locate critical data, and leverage the FinQA dataset to\nfine-tune both Llama-2 7B and T5 models for customized question answering. We\nachieved results comparable to baseline on the final numerical answer, a\ncompetitive accuracy in numerical reasoning and calculation.\n","authors":["Abhinav Arun","Ashish Dhiman","Mehul Soni","Yibei Hu"],"pdf_url":"https://arxiv.org/pdf/2312.14870v1.pdf","comment":"10 pages, 11 figures, 6 tables"},{"id":"http://arxiv.org/abs/2312.14867v1","updated":"2023-12-22T17:45:19Z","published":"2023-12-22T17:45:19Z","title":"VIEScore: Towards Explainable Metrics for Conditional Image Synthesis\n Evaluation","summary":" In the rapidly advancing field of conditional image generation research,\nchallenges such as limited explainability lie in effectively evaluating the\nperformance and capabilities of various models. This paper introduces VIESCORE,\na Visual Instruction-guided Explainable metric for evaluating any conditional\nimage generation tasks. VIESCORE leverages general knowledge from Multimodal\nLarge Language Models (MLLMs) as the backbone and does not require training or\nfine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image\ntasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of\n0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)\nVIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in\nevaluating synthetic images. (3) VIESCORE achieves a correlation on par with\nhuman ratings in the generation tasks but struggles in editing tasks. With\nthese results, we believe VIESCORE shows its great potential to replace human\njudges in evaluating image synthesis tasks.\n","authors":["Max Ku","Dongfu Jiang","Cong Wei","Xiang Yue","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14867v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14862v1","updated":"2023-12-22T17:34:47Z","published":"2023-12-22T17:34:47Z","title":"YAYI 2: Multilingual Open-Source Large Language Models","summary":" As the latest advancements in natural language processing, large language\nmodels (LLMs) have achieved human-level language understanding and generation\nabilities in many real-world tasks, and even have been regarded as a potential\npath to the artificial general intelligence. To better facilitate research on\nLLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been\nproposed and gained comparable performances to proprietary models. However,\nthese models are primarily designed for English scenarios and exhibit poor\nperformances in Chinese contexts. In this technical report, we propose YAYI 2,\nincluding both base and chat models, with 30 billion parameters. YAYI 2 is\npre-trained from scratch on a multilingual corpus which contains 2.65 trillion\ntokens filtered by our pre-training data processing pipeline. The base model is\naligned with human values through supervised fine-tuning with millions of\ninstructions and reinforcement learning from human feedback. Extensive\nexperiments on multiple benchmarks, such as MMLU and CMMLU, consistently\ndemonstrate that the proposed YAYI 2 outperforms other similar sized\nopen-source models.\n","authors":["Yin Luo","Qingchao Kong","Nan Xu","Jia Cao","Bao Hao","Baoyu Qu","Bo Chen","Chao Zhu","Chenyang Zhao","Donglei Zhang","Fan Feng","Feifei Zhao","Hailong Sun","Hanxuan Yang","Haojun Pan","Hongyu Liu","Jianbin Guo","Jiangtao Du","Jingyi Wang","Junfeng Li","Lei Sun","Liduo Liu","Lifeng Dong","Lili Liu","Lin Wang","Liwen Zhang","Minzheng Wang","Pin Wang","Ping Yu","Qingxiao Li","Rui Yan","Rui Zou","Ruiqun Li","Taiwen Huang","Xiaodong Wang","Xiaofei Wu","Xin Peng","Xina Zhang","Xing Fang","Xinglin Xiao","Yanni Hao","Yao Dong","Yigang Wang","Ying Liu","Yongyu Jiang","Yungan Wang","Yuqi Wang","Zhangsheng Wang","Zhaoxin Yu","Zhen Luo","Wenji Mao","Lei Wang","Dajun Zeng"],"pdf_url":"https://arxiv.org/pdf/2312.14862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14845v1","updated":"2023-12-22T17:19:33Z","published":"2023-12-22T17:19:33Z","title":"On the Use of Metaphor Translation in Psychiatry","summary":" Providing mental healthcare to individuals with limited English proficiency\n(LEP) remains a pressing problem within psychiatry. Because the majority of\nindividuals trained in providing psychiatric care are English speakers, the\nquality of mental healthcare given to LEP patients is significantly lower than\nthat provided for English speakers. The provision of mental healthcare is\ncontingent on communication and understanding between the patient and\nhealthcare provider, much more so than in the realm of physical healthcare, and\nEnglish speakers are often unable to comprehend figurative language such as\nmetaphors used by LEPs. Hence, Figurative Language Translation is invaluable to\nproviding equitable psychiatric care. Now, metaphor has been shown to be\nparamount in both identifying individuals struggling with mental problems and\nhelping those individuals understand and communicate their experiences.\nTherefore, this paper aims to survey the potential of Machine Translation for\nproviding equitable psychiatric healthcare and highlights the need for further\nresearch on the transferability of existing machine and metaphor translation\nresearch in the domain of psychiatry.\n","authors":["Lois Wong"],"pdf_url":"https://arxiv.org/pdf/2312.14845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14798v1","updated":"2023-12-22T16:16:15Z","published":"2023-12-22T16:16:15Z","title":"Semantic Parsing for Complex Data Retrieval: Targeting Query Plans vs.\n SQL for No-Code Access to Relational Databases","summary":" Large Language Models (LLMs) have spurred progress in text-to-SQL, the task\nof generating SQL queries from natural language questions based on a given\ndatabase schema. Despite the declarative nature of SQL, it continues to be a\ncomplex programming language. In this paper, we investigate the potential of an\nalternative query language with simpler syntax and modular specification of\ncomplex queries. The purpose is to create a query language that can be learned\nmore easily by modern neural semantic parsing architectures while also enabling\nnon-programmers to better assess the validity of the query plans produced by an\ninteractive query plan assistant.\n The proposed alternative query language is called Query Plan Language (QPL).\nIt is designed to be modular and can be translated into a restricted form of\nSQL Common Table Expressions (CTEs). The aim of QPL is to make complex data\nretrieval accessible to non-programmers by allowing users to express their\nquestions in natural language while also providing an easier-to-verify target\nlanguage. The paper demonstrates how neural LLMs can benefit from QPL's\nmodularity to generate complex query plans in a compositional manner. This\ninvolves a question decomposition strategy and a planning stage.\n We conduct experiments on a version of the Spider text-to-SQL dataset that\nhas been converted to QPL. The hierarchical structure of QPL programs enables\nus to measure query complexity naturally. Based on this assessment, we identify\nthe low accuracy of existing text-to-SQL systems on complex compositional\nqueries. We present ways to address the challenge of complex queries in an\niterative, user-controlled manner, using fine-tuned LLMs and a variety of\nprompting strategies in a compositional manner.\n","authors":["Ben Eyal","Amir Bachar","Ophir Haroche","Michael Elhadad"],"pdf_url":"https://arxiv.org/pdf/2312.14798v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2310.13575"},{"id":"http://arxiv.org/abs/2312.14769v1","updated":"2023-12-22T15:38:13Z","published":"2023-12-22T15:38:13Z","title":"Large Language Model (LLM) Bias Index -- LLMBI","summary":" The Large Language Model Bias Index (LLMBI) is a pioneering approach designed\nto quantify and address biases inherent in large language models (LLMs), such\nas GPT-4. We recognise the increasing prevalence and impact of LLMs across\ndiverse sectors. This research introduces a novel metric, LLMBI, to\nsystematically measure and mitigate biases potentially skewing model responses.\nWe formulated LLMBI using a composite scoring system incorporating multiple\ndimensions of bias, including but not limited to age, gender, and racial\nbiases.\n To operationalise this metric, we engaged in a multi-step process involving\ncollecting and annotating LLM responses, applying sophisticated Natural\nLanguage Processing (NLP) techniques for bias detection, and computing the\nLLMBI score through a specially crafted mathematical formula. The formula\nintegrates weighted averages of various bias dimensions, a penalty for dataset\ndiversity deficiencies, and a correction for sentiment biases. Our empirical\nanalysis, conducted using responses from OpenAI's API, employs advanced\nsentiment analysis as a representative method for bias detection.\n The research reveals LLMs, whilst demonstrating impressive capabilities in\ntext generation, exhibit varying degrees of bias across different dimensions.\nLLMBI provides a quantifiable measure to compare biases across models and over\ntime, offering a vital tool for systems engineers, researchers and regulators\nin enhancing the fairness and reliability of LLMs. It highlights the potential\nof LLMs in mimicking unbiased human-like responses. Additionally, it\nunderscores the necessity of continuously monitoring and recalibrating such\nmodels to align with evolving societal norms and ethical standards.\n","authors":["Abiodun Finbarrs Oketunji","Muhammad Anas","Deepthi Saina"],"pdf_url":"https://arxiv.org/pdf/2312.14769v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12794v2","updated":"2023-12-22T15:00:26Z","published":"2023-10-19T14:50:51Z","title":"Are Structural Concepts Universal in Transformer Language Models?\n Towards Interpretable Cross-Lingual Generalization","summary":" Large language models (LLMs) have exhibited considerable cross-lingual\ngeneralization abilities, whereby they implicitly transfer knowledge across\nlanguages. However, the transfer is not equally successful for all languages,\nespecially for low-resource ones, which poses an ongoing challenge. It is\nunclear whether we have reached the limits of implicit cross-lingual\ngeneralization and if explicit knowledge transfer is viable. In this paper, we\ninvestigate the potential for explicitly aligning conceptual correspondence\nbetween languages to enhance cross-lingual generalization. Using the syntactic\naspect of language as a testbed, our analyses of 43 languages reveal a high\ndegree of alignability among the spaces of structural concepts within each\nlanguage for both encoder-only and decoder-only LLMs. We then propose a\nmeta-learning-based method to learn to align conceptual spaces of different\nlanguages, which facilitates zero-shot and few-shot generalization in concept\nclassification and also offers insights into the cross-lingual in-context\nlearning phenomenon. Experiments on syntactic analysis tasks show that our\napproach achieves competitive results with state-of-the-art methods and narrows\nthe performance gap between languages, particularly benefiting those with\nlimited resources.\n","authors":["Ningyu Xu","Qi Zhang","Jingting Ye","Menghan Zhang","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.12794v2.pdf","comment":"Findings of EMNLP 2023 (Camera-Ready)"},{"id":"http://arxiv.org/abs/2305.19228v2","updated":"2023-12-22T14:49:34Z","published":"2023-05-30T17:20:25Z","title":"Unsupervised Melody-to-Lyric Generation","summary":" Automatic melody-to-lyric generation is a task in which song lyrics are\ngenerated to go with a given melody. It is of significant practical interest\nand more challenging than unconstrained lyric generation as the music imposes\nadditional constraints onto the lyrics. The training data is limited as most\nsongs are copyrighted, resulting in models that underfit the complicated\ncross-modal relationship between melody and lyrics. In this work, we propose a\nmethod for generating high-quality lyrics without training on any aligned\nmelody-lyric data. Specifically, we design a hierarchical lyric generation\nframework that first generates a song outline and second the complete lyrics.\nThe framework enables disentanglement of training (based purely on text) from\ninference (melody-guided text generation) to circumvent the shortage of\nparallel data.\n We leverage the segmentation and rhythm alignment between melody and lyrics\nto compile the given melody into decoding constraints as guidance during\ninference. The two-step hierarchical design also enables content control via\nthe lyric outline, a much-desired feature for democratizing collaborative song\ncreation. Experimental results show that our model can generate high-quality\nlyrics that are more on-topic, singable, intelligible, and coherent than strong\nbaselines, for example SongMASS, a SOTA model trained on a parallel dataset,\nwith a 24% relative overall quality improvement based on human ratings.\n","authors":["Yufei Tian","Anjali Narayan-Chen","Shereen Oraby","Alessandra Cervone","Gunnar Sigurdsson","Chenyang Tao","Wenbo Zhao","Yiwen Chen","Tagyoung Chung","Jing Huang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2305.19228v2.pdf","comment":"ACL 2023. arXiv admin note: substantial text overlap with\n arXiv:2305.07760"},{"id":"http://arxiv.org/abs/2312.14737v1","updated":"2023-12-22T14:46:02Z","published":"2023-12-22T14:46:02Z","title":"Computational Semantics and Evaluation Benchmark for Interrogative\n Sentences via Combinatory Categorial Grammar","summary":" We present a compositional semantics for various types of polar questions and\nwh-questions within the framework of Combinatory Categorial Grammar (CCG). To\nassess the explanatory power of our proposed analysis, we introduce a\nquestion-answering dataset QSEM specifically designed to evaluate the semantics\nof interrogative sentences. We implement our analysis using existing CCG\nparsers and conduct evaluations using the dataset. Through the evaluation, we\nhave obtained annotated data with CCG trees and semantic representations for\nabout half of the samples included in QSEM. Furthermore, we discuss the\ndiscrepancy between the theoretical capacity of CCG and the capabilities of\nexisting CCG parsers.\n","authors":["Hayate Funakura","Koji Mineshima"],"pdf_url":"https://arxiv.org/pdf/2312.14737v1.pdf","comment":"11 pages, to appear in the Proceedings of PACLIC37"},{"id":"http://arxiv.org/abs/2311.12420v3","updated":"2023-12-22T14:07:16Z","published":"2023-11-21T08:20:39Z","title":"How Far Have We Gone in Vulnerability Detection Using Large Language\n Models","summary":" As software becomes increasingly complex and prone to vulnerabilities,\nautomated vulnerability detection is critically important, yet challenging.\nGiven the significant successes of large language models (LLMs) in various\ntasks, there is growing anticipation of their efficacy in vulnerability\ndetection. However, a quantitative understanding of their potential in\nvulnerability detection is still missing. To bridge this gap, we introduce a\ncomprehensive vulnerability benchmark VulBench. This benchmark aggregates\nhigh-quality data from a wide range of CTF (Capture-the-Flag) challenges and\nreal-world applications, with annotations for each vulnerable function\ndetailing the vulnerability type and its root cause. Through our experiments\nencompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models\nand static analyzers, we find that several LLMs outperform traditional deep\nlearning approaches in vulnerability detection, revealing an untapped potential\nin LLMs. This work contributes to the understanding and utilization of LLMs for\nenhanced software security.\n","authors":["Zeyu Gao","Hao Wang","Yuchen Zhou","Wenyu Zhu","Chao Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.12420v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14708v1","updated":"2023-12-22T14:06:54Z","published":"2023-12-22T14:06:54Z","title":"Balancing the Style-Content Trade-Off in Sentiment Transfer Using\n Polarity-Aware Denoising","summary":" Text sentiment transfer aims to flip the sentiment polarity of a sentence\n(positive to negative or vice versa) while preserving its sentiment-independent\ncontent. Although current models show good results at changing the sentiment,\ncontent preservation in transferred sentences is insufficient. In this paper,\nwe present a sentiment transfer model based on polarity-aware denoising, which\naccurately controls the sentiment attributes in generated text, preserving the\ncontent to a great extent and helping to balance the style-content trade-off.\nOur proposed model is structured around two key stages in the sentiment\ntransfer process: better representation learning using a shared encoder and\nsentiment-controlled generation using separate sentiment-specific decoders.\nEmpirical results show that our methods outperforms state-of-the-art baselines\nin terms of content preservation while staying competitive in terms of style\ntransfer accuracy and fluency.\n","authors":["Sourabrata Mukherjee","Zdeněk Kasner","Ondřej Dušek"],"pdf_url":"https://arxiv.org/pdf/2312.14708v1.pdf","comment":"Published in 25th International Conference on Text, Speech and\n Dialogue (TSD 2022)"},{"id":"http://arxiv.org/abs/2305.14171v3","updated":"2023-12-22T13:27:11Z","published":"2023-05-23T15:43:04Z","title":"In-Context Probing: Toward Building Robust Classifiers via Probing Large\n Language Models","summary":" Large language models are able to learn new tasks in context, where they are\nprovided with instructions and a few annotated examples. However, the\neffectiveness of in-context learning is dependent on the provided context, and\nthe performance on a downstream task can vary considerably, depending on the\ninstruction. Importantly, such dependency on the context can surface in\nunpredictable ways, e.g., a seemingly more informative instruction might lead\nto a worse performance. In this paper, we propose an alternative approach,\nwhich we term In-Context Probing (ICP). Similar to in-context learning, we\ncontextualize the representation of the input with an instruction, but instead\nof decoding the output prediction, we probe the contextualized representation\nto predict the label. Through a series of experiments on a diverse set of\nclassification tasks, we show that in-context probing is significantly more\nrobust to changes in instructions. We further show that ICP performs\ncompetitive or superior to finetuning and can be particularly helpful to build\nclassifiers on top of smaller models, with less than a hundred training\nexamples.\n","authors":["Afra Amini","Massimiliano Ciaramita"],"pdf_url":"https://arxiv.org/pdf/2305.14171v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05782v2","updated":"2023-12-22T13:04:48Z","published":"2023-10-09T15:15:05Z","title":"Aligning Language Models with Human Preferences via a Bayesian Approach","summary":" In the quest to advance human-centric natural language generation (NLG)\nsystems, ensuring alignment between NLG models and human preferences is\ncrucial. For this alignment, current popular methods leverage a reinforcement\nlearning (RL) approach with a reward model trained on feedback from humans.\nHowever, inherent disagreements due to the subjective nature of human\npreferences pose a significant challenge for training the reward model,\nresulting in a deterioration of the NLG performance. To tackle this issue,\nprevious approaches typically rely on majority voting or averaging to\nconsolidate multiple inconsistent preferences into a merged one. Although\nstraightforward to understand and execute, such methods suffer from an\ninability to capture the nuanced degrees of disaggregation among humans and may\nonly represent a specialized subset of individuals, thereby lacking the ability\nto quantitatively disclose the universality of human preferences. To address\nthis challenge, this paper proposes a novel approach, which employs a Bayesian\nframework to account for the distribution of disagreements among human\npreferences as training a preference model, and names it as d-PM. Besides,\nconsidering the RL strategy's inefficient and complex training process over the\ntraining efficiency, we further propose utilizing the contrastive learning\nstrategy to train the NLG model with the preference scores derived from the\nd-PM model. Extensive experiments on two human-centric NLG tasks, i.e.,\nemotional support conversation and integrity \"Rule-of-Thumb\" generation, show\nthat our method consistently exceeds previous SOTA models in both automatic and\nhuman evaluations.\n","authors":["Jiashuo Wang","Haozhao Wang","Shichao Sun","Wenjie Li"],"pdf_url":"https://arxiv.org/pdf/2310.05782v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14646v1","updated":"2023-12-22T12:28:29Z","published":"2023-12-22T12:28:29Z","title":"Collaborative Synthesis of Patient Records through Multi-Visit Health\n State Inference","summary":" Electronic health records (EHRs) have become the foundation of machine\nlearning applications in healthcare, while the utility of real patient records\nis often limited by privacy and security concerns. Synthetic EHR generation\nprovides an additional perspective to compensate for this limitation. Most\nexisting methods synthesize new records based on real EHR data, without\nconsideration of different types of events in EHR data, which cannot control\nthe event combinations in line with medical common sense. In this paper, we\npropose MSIC, a Multi-visit health Status Inference model for Collaborative EHR\nsynthesis to address these limitations. First, we formulate the synthetic EHR\ngeneration process as a probabilistic graphical model and tightly connect\ndifferent types of events by modeling the latent health states. Then, we derive\na health state inference method tailored for the multi-visit scenario to\neffectively utilize previous records to synthesize current and future records.\nFurthermore, we propose to generate medical reports to add textual descriptions\nfor each medical event, providing broader applications for synthesized EHR\ndata. For generating different paragraphs in each visit, we incorporate a\nmulti-generator deliberation framework to collaborate the message passing of\nmultiple generators and employ a two-phase decoding strategy to generate\nhigh-quality reports. Our extensive experiments on the widely used benchmarks,\nMIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results\non the quality of synthetic data while maintaining low privacy risks.\n","authors":["Hongda Sun","Hongzhan Lin","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2312.14646v1.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14609v1","updated":"2023-12-22T11:12:45Z","published":"2023-12-22T11:12:45Z","title":"BLSTM-Based Confidence Estimation for End-to-End Speech Recognition","summary":" Confidence estimation, in which we estimate the reliability of each\nrecognized token (e.g., word, sub-word, and character) in automatic speech\nrecognition (ASR) hypotheses and detect incorrectly recognized tokens, is an\nimportant function for developing ASR applications. In this study, we perform\nconfidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR\nsystems show high performance (e.g., around 5% token error rates) for various\nASR tasks. In such situations, confidence estimation becomes difficult since we\nneed to detect infrequent incorrect tokens from mostly correct token sequences.\nTo tackle this imbalanced dataset problem, we employ a bidirectional long\nshort-term memory (BLSTM)-based model as a strong binary-class\n(correct/incorrect) sequence labeler that is trained with a class balancing\nobjective. We experimentally confirmed that, by utilizing several types of ASR\ndecoding scores as its auxiliary features, the model steadily shows high\nconfidence estimation performance under highly imbalanced settings. We also\nconfirmed that the BLSTM-based model outperforms Transformer-based confidence\nestimation models, which greatly underestimate incorrect tokens.\n","authors":["Atsunori Ogawa","Naohiro Tawara","Takatomo Kano","Marc Delcroix"],"pdf_url":"https://arxiv.org/pdf/2312.14609v1.pdf","comment":"Accepted to ICASSP 2021"},{"id":"http://arxiv.org/abs/2312.14591v1","updated":"2023-12-22T10:29:43Z","published":"2023-12-22T10:29:43Z","title":"Reasons to Reject? Aligning Language Models with Judgments","summary":" As humans, we consistently engage in interactions with our peers and receive\nfeedback in the form of natural language. This language feedback allows us to\nreflect on our actions, maintain appropriate behavior, and rectify our errors.\nThe question arises naturally: can we use language feedback to align large\nlanguage models (LLMs)? In contrast to previous research that aligns LLMs with\nreward or preference data, we present the first systematic exploration of\nalignment through the lens of language feedback (i.e., judgment). We commence\nwith an in-depth investigation of potential methods that can be adapted for\naligning LLMs with judgments, revealing that these methods are unable to fully\ncapitalize on the judgments. To facilitate more effective utilization of\njudgments, we propose a novel framework, Contrastive Unlikelihood Training\n(CUT), that allows for fine-grained inappropriate content detection and\ncorrection based on judgments. Our offline alignment results show that, with\nmerely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B\nDaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The\nonline alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b)\nin an iterative fashion using model-specific judgment data, with a steady\nperformance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis\nfurther suggests that judgments exhibit greater potential than rewards for LLM\nalignment and warrant future research.\n","authors":["Weiwen Xu","Deng Cai","Zhisong Zhang","Wai Lam","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2312.14591v1.pdf","comment":"Our source codes and models are publicly available at\n https://github.com/wwxu21/CUT"},{"id":"http://arxiv.org/abs/2312.14590v1","updated":"2023-12-22T10:29:18Z","published":"2023-12-22T10:29:18Z","title":"SIG: Speaker Identification in Literature via Prompt-Based Generation","summary":" Identifying speakers of quotations in narratives is an important task in\nliterary analysis, with challenging scenarios including the out-of-domain\ninference for unseen speakers, and non-explicit cases where there are no\nspeaker mentions in surrounding context. In this work, we propose a simple and\neffective approach SIG, a generation-based method that verbalizes the task and\nquotation input based on designed prompt templates, which also enables easy\nintegration of other auxiliary tasks that further bolster the speaker\nidentification performance. The prediction can either come from direct\ngeneration by the model, or be determined by the highest generation probability\nof each speaker candidate. Based on our approach design, SIG supports\nout-of-domain evaluation, and achieves open-world classification paradigm that\nis able to accept any forms of candidate input. We perform both cross-domain\nevaluation and in-domain evaluation on PDNC, the largest dataset of this task,\nwhere empirical results suggest that SIG outperforms previous baselines of\ncomplicated designs, as well as the zero-shot ChatGPT, especially excelling at\nthose hard non-explicit scenarios by up to 17% improvement. Additional\nexperiments on another dataset WP further corroborate the efficacy of SIG.\n","authors":["Zhenlin Su","Liyan Xu","Jin Xu","Jiangnan Li","Mingdu Huangfu"],"pdf_url":"https://arxiv.org/pdf/2312.14590v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14557v1","updated":"2023-12-22T09:30:41Z","published":"2023-12-22T09:30:41Z","title":"Aurora:Activating Chinese chat capability for Mistral-8x7B sparse\n Mixture-of-Experts through Instruction-Tuning","summary":" Existing research has demonstrated that refining large language models (LLMs)\nthrough the utilization of machine-generated instruction-following data\nempowers these models to exhibit impressive zero-shot capabilities for novel\ntasks, without requiring human-authored instructions. In this paper, we\nsystematically investigate, preprocess, and integrate three Chinese\ninstruction-following datasets with the aim of enhancing the Chinese\nconversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model.\nThrough instruction fine-tuning on this carefully processed dataset, we\nsuccessfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named\n\"Aurora.\" To assess the performance of Aurora, we utilize three widely\nrecognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate\nthe effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse\nMixture-of-Experts model. This work is pioneering in the execution of\ninstruction fine-tuning on a sparse expert-mixed model, marking a significant\nbreakthrough in enhancing the capabilities of this model architecture. Our\ncode, data and model are publicly available at:\nhttps://github.com/WangRongsheng/Aurora\n","authors":["Rongsheng Wang","Haoming Chen","Ruizhe Zhou","Yaofei Duan","Kunyan Cai","Han Ma","Jiaxi Cui","Jian Li","Patrick Cheong-Iao Pang","Yapeng Wang","Tao Tan"],"pdf_url":"https://arxiv.org/pdf/2312.14557v1.pdf","comment":"10 pages, 2 figures"},{"id":"http://arxiv.org/abs/2312.14542v1","updated":"2023-12-22T09:13:24Z","published":"2023-12-22T09:13:24Z","title":"Automatic Data Retrieval for Cross Lingual Summarization","summary":" Cross-lingual summarization involves the summarization of text written in one\nlanguage to a different one. There is a body of research addressing\ncross-lingual summarization from English to other European languages. In this\nwork, we aim to perform cross-lingual summarization from English to Hindi. We\npropose pairing up the coverage of newsworthy events in textual and video\nformat can prove to be helpful for data acquisition for cross lingual\nsummarization. We analyze the data and propose methods to match articles to\nvideo descriptions that serve as document and summary pairs. We also outline\nfiltering methods over reasonable thresholds to ensure the correctness of the\nsummaries. Further, we make available 28,583 mono and cross-lingual\narticle-summary pairs https://github.com/tingc9/Cross-Sum-News-Aligned. We also\nbuild and analyze multiple baselines on the collected data and report error\nanalysis.\n","authors":["Nikhilesh Bhatnagar","Ashok Urlana","Vandan Mujadia","Pruthwik Mishra","Dipti Misra Sharma"],"pdf_url":"https://arxiv.org/pdf/2312.14542v1.pdf","comment":"6 pages, 6 tables, 2 figures, conference: ICON 2023"},{"id":"http://arxiv.org/abs/2312.14504v1","updated":"2023-12-22T08:08:45Z","published":"2023-12-22T08:08:45Z","title":"Theory of Hallucinations based on Equivariance","summary":" Equivariance is an important feature in machine learning, including language\nmodels. It ensures that any sequences of phrases with the same meanings are\ninterpreted consistently. For example, the sentence 'There is a cat on the\ntable' should be interpreted by language models as it is, regardless of\nvariations in its token-level expression. Building on this insight, I propose a\nnew theory suggesting that insufficient equivariance in language models can\nlead to hallucinations. According to this theory, which is both intuitive and\nnovel, language models trained on relatively small datasets tend to\nmisinterpret input texts and/or generate incorrect texts (i.e.,\nhallucinations). To test this theory, I developed a toy model known as 'dancing\nmen', which is a character-level substitution cipher. Additionally, I propose a\nnovel technique based on the T5 (Text To Text Transfer Transformer) model to\nefficiently decipher these codes without relying on frequency analysis. I have\nfound that this T5 model can almost completely solve the cipher, demonstrating\nits ability to acquire equivariance in this frame. This method could be scaled\nup to word-level and sentence-level substitution ciphers, analogous to large\nlanguage models without tokenizers or dictionaries. This scalability makes it\nsuitable for investigating the proposed link between inadequate equivariance\nacquisition and the emergence of hallucinations.\n","authors":["Hisaichi Shibata"],"pdf_url":"https://arxiv.org/pdf/2312.14504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14488v1","updated":"2023-12-22T07:32:47Z","published":"2023-12-22T07:32:47Z","title":"Language Model is a Branch Predictor for Simultaneous Machine\n Translation","summary":" The primary objective of simultaneous machine translation (SiMT) is to\nminimize latency while preserving the quality of the final translation. Drawing\ninspiration from CPU branch prediction techniques, we propose incorporating\nbranch prediction techniques in SiMT tasks to reduce translation latency.\nSpecifically, we utilize a language model as a branch predictor to predict\npotential branch directions, namely, future source words. Subsequently, we\nutilize the predicted source words to decode the output in advance. When the\nactual source word deviates from the predicted source word, we use the real\nsource word to decode the output again, replacing the predicted output. To\nfurther reduce computational costs, we share the parameters of the encoder and\nthe branch predictor, and utilize a pre-trained language model for\ninitialization. Our proposed method can be seamlessly integrated with any SiMT\nmodel. Extensive experimental results demonstrate that our approach can improve\ntranslation quality and latency at the same time. Our code is available at\nhttps://github.com/YinAoXiong/simt_branch_predictor .\n","authors":["Aoxiong Yin","Tianyun Zhong","Haoyuan Li","Siliang Tang","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.14488v1.pdf","comment":"Accepted by IEEE ICASSP 2024"},{"id":"http://arxiv.org/abs/2312.14480v1","updated":"2023-12-22T07:15:55Z","published":"2023-12-22T07:15:55Z","title":"MetaAID 2.5: A Secure Framework for Developing Metaverse Applications\n via Large Language Models","summary":" Large language models (LLMs) are increasingly being used in Metaverse\nenvironments to generate dynamic and realistic content and to control the\nbehavior of non-player characters (NPCs). However, the cybersecurity concerns\nassociated with LLMs have become increasingly prominent. Previous research has\nprimarily focused on patching system vulnerabilities to enhance cybersecurity,\nbut these approaches are not well-suited to the Metaverse, where the virtual\nspace is more complex, LLMs are vulnerable, and ethical user interaction is\ncritical. Moreover, the scope of cybersecurity in the Metaverse is expected to\nexpand significantly. This paper proposes a method for enhancing cybersecurity\nthrough the simulation of user interaction with LLMs. Our goal is to educate\nusers and strengthen their defense capabilities through exposure to a\ncomprehensive simulation system. This system includes extensive Metaverse\ncybersecurity Q&A and attack simulation scenarios. By engaging with these,\nusers will improve their ability to recognize and withstand risks.\nAdditionally, to address the ethical implications of user input, we propose\nusing LLMs as evaluators to assess user content across five dimensions. We\nfurther adapt the models through vocabulary expansion training to better\nunderstand personalized inputs and emoticons. We conduct experiments on\nmultiple LLMs and find that our approach is effective.\n","authors":["Hongyin Zhu"],"pdf_url":"https://arxiv.org/pdf/2312.14480v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.07861v2","updated":"2023-12-22T06:33:04Z","published":"2022-06-16T00:37:55Z","title":"Text normalization for low-resource languages: the case of Ligurian","summary":" Text normalization is a crucial technology for low-resource languages which\nlack rigid spelling conventions or that have undergone multiple spelling\nreforms. Low-resource text normalization has so far relied upon hand-crafted\nrules, which are perceived to be more data efficient than neural methods. In\nthis paper we examine the case of text normalization for Ligurian, an\nendangered Romance language. We collect 4,394 Ligurian sentences paired with\ntheir normalized versions, as well as the first open source monolingual corpus\nfor Ligurian. We show that, in spite of the small amounts of data available, a\ncompact transformer-based model can be trained to achieve very low error rates\nby the use of backtranslation and appropriate tokenization.\n","authors":["Stefano Lusito","Edoardo Ferrante","Jean Maillard"],"pdf_url":"https://arxiv.org/pdf/2206.07861v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.11997v2","updated":"2023-12-22T05:49:33Z","published":"2023-01-27T21:31:14Z","title":"Prompt-Based Editing for Text Style Transfer","summary":" Prompting approaches have been recently explored in text style transfer,\nwhere a textual prompt is used to query a pretrained language model to generate\nstyle-transferred texts word by word in an autoregressive manner. However, such\na generation process is less controllable and early prediction errors may\naffect future word predictions. In this paper, we present a prompt-based\nediting approach for text style transfer. Specifically, we prompt a pretrained\nlanguage model for style classification and use the classification probability\nto compute a style score. Then, we perform discrete search with word-level\nediting to maximize a comprehensive scoring function for the style-transfer\ntask. In this way, we transform a prompt-based generation problem into a\nclassification one, which is a training-free process and more controllable than\nthe autoregressive generation of sentences. In our experiments, we performed\nboth automatic and human evaluation on three style-transfer benchmark datasets,\nand show that our approach largely outperforms the state-of-the-art systems\nthat have 20 times more parameters. Additional empirical analyses further\ndemonstrate the effectiveness of our approach.\n","authors":["Guoqing Luo","Yu Tong Han","Lili Mou","Mauajama Firdaus"],"pdf_url":"https://arxiv.org/pdf/2301.11997v2.pdf","comment":"Accepted by EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2303.13001v3","updated":"2023-12-22T05:18:25Z","published":"2023-03-23T02:50:38Z","title":"Is ChatGPT A Good Keyphrase Generator? A Preliminary Study","summary":" The emergence of ChatGPT has recently garnered significant attention from the\ncomputational linguistics community. To demonstrate its capabilities as a\nkeyphrase generator, we conduct a preliminary evaluation of ChatGPT for the\nkeyphrase generation task. We evaluate its performance in various aspects,\nincluding keyphrase generation prompts, keyphrase generation diversity, and\nlong document understanding. Our evaluation is based on six benchmark datasets,\nand we adopt the prompt suggested by OpenAI while extending it to six candidate\nprompts. We find that ChatGPT performs exceptionally well on all six candidate\nprompts, with minor performance differences observed across the datasets. Based\non our findings, we conclude that ChatGPT has great potential for keyphrase\ngeneration. Moreover, we discover that ChatGPT still faces challenges when it\ncomes to generating absent keyphrases. Meanwhile, in the final section, we also\npresent some limitations and future expansions of this report.\n","authors":["Mingyang Song","Haiyun Jiang","Shuming Shi","Songfang Yao","Shilong Lu","Yi Feng","Huafeng Liu","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2303.13001v3.pdf","comment":"Technical Report, 6 pages"},{"id":"http://arxiv.org/abs/2310.05707v2","updated":"2023-12-22T04:31:49Z","published":"2023-10-09T13:29:37Z","title":"Guiding Language Model Reasoning with Planning Tokens","summary":" Large language models (LLMs) have recently attracted considerable interest\nfor their ability to perform complex reasoning tasks, such as chain-of-thought\nreasoning. However, most of the existing approaches to enhance this ability\nrely heavily on data-driven methods, while neglecting the structural aspects of\nthe model's reasoning capacity. We find that while LLMs can manage individual\nreasoning steps well, they struggle with maintaining consistency across an\nentire reasoning chain. To solve this, we introduce 'planning tokens' at the\nstart of each reasoning step, serving as a guide for the model. These token\nembeddings are then fine-tuned along with the rest of the model parameters. Our\napproach requires a negligible increase in trainable parameters (just 0.001%)\nand can be applied through either full fine-tuning or a more\nparameter-efficient scheme. We demonstrate our method's effectiveness by\napplying it to three different LLMs, showing notable accuracy improvements\nacross three math word problem datasets w.r.t. plain chain-of-thought\nfine-tuning baselines.\n","authors":["Xinyi Wang","Lucas Caccia","Oleksiy Ostapenko","Xingdi Yuan","Alessandro Sordoni"],"pdf_url":"https://arxiv.org/pdf/2310.05707v2.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.13545v2","updated":"2023-12-22T04:13:51Z","published":"2023-12-21T03:09:38Z","title":"Developing Interactive Tourism Planning: A Dialogue Robot System Powered\n by a Large Language Model","summary":" In recent years, large language models (LLMs) have rapidly proliferated and\nhave been utilized in various tasks, including research in dialogue systems. We\naimed to construct a system that not only leverages the flexible conversational\nabilities of LLMs but also their advanced planning capabilities to reduce the\nspeaking load on human interlocutors and efficiently plan trips. Furthermore,\nwe propose a method that divides the complex task of a travel agency into\nmultiple subtasks, managing each as a separate phase to effectively accomplish\nthe task. Our proposed system confirmed a certain level of success by achieving\nfourth place in the Dialogue Robot Competition 2023 preliminaries rounds. We\nreport on the challenges identified through the competition.\n","authors":["Katsumasa Yoshikawa","Takato Yamazaki","Masaya Ohagi","Tomoya Mizumoto","Keiya Sato"],"pdf_url":"https://arxiv.org/pdf/2312.13545v2.pdf","comment":"This paper is part of the proceedings of the Dialogue Robot\n Competition 2023"},{"id":"http://arxiv.org/abs/2312.14423v1","updated":"2023-12-22T04:01:30Z","published":"2023-12-22T04:01:30Z","title":"Efficacy of Machine-Generated Instructions","summary":" Large \"instruction-tuned\" language models (i.e., finetuned to respond to\ninstructions) have demonstrated a remarkable ability to generalize zero-shot to\nnew tasks. Nevertheless, they depend heavily on human-written instruction data\nthat is often limited in quantity, diversity, and creativity, therefore\nhindering the generality of the tuned model. We conducted a quantitative study\nto figure out the efficacy of machine-generated annotations, where we compare\nthe results of a fine-tuned BERT model with human v/s machine-generated\nannotations. Applying our methods to the vanilla GPT-3 model, we saw that\nmachine generated annotations were 78.54% correct and the fine-tuned model\nachieved a 96.01% model performance compared to the performance with\nhuman-labelled annotations. This result shows that machine-generated\nannotations are a resource and cost effective way to fine-tune down-stream\nmodels.\n","authors":["Samaksh Gulati","Anshit Verma","Manoj Parmar","Palash Chaudhary"],"pdf_url":"https://arxiv.org/pdf/2312.14423v1.pdf","comment":"8 pages, 2 pages references, 6 Tables, 8 Figures"},{"id":"http://arxiv.org/abs/2209.07662v4","updated":"2023-12-22T03:21:35Z","published":"2022-09-16T00:54:44Z","title":"NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional,\n and Explainable Reasoning","summary":" Our goal is a modern approach to answering questions via systematic reasoning\nwhere answers are supported by human interpretable proof trees grounded in an\nNL corpus of authoritative facts. Such a system would help alleviate the\nchallenges of interpretability and hallucination with modern LMs, and the lack\nof grounding of current explanation methods (e.g., Chain-of-Thought). This\npaper proposes a new take on Prolog-based inference engines, where we replace\nhandcrafted rules with a combination of neural language modeling, guided\ngeneration, and semiparametric dense retrieval. Our implementation, NELLIE, is\nthe first system to demonstrate fully interpretable, end-to-end grounded QA as\nentailment tree proof search, going beyond earlier work explaining\nknown-to-be-true facts from text. In experiments, NELLIE outperforms a\nsimilar-sized state-of-the-art reasoner [Tafjord et al., 2022] while producing\nknowledge-grounded explanations. We also find NELLIE can exploit both\nsemi-structured and NL text corpora to guide reasoning. Together these suggest\na new way to jointly reap the benefits of both modern neural methods and\ntraditional symbolic reasoning.\n","authors":["Nathaniel Weir","Peter Clark","Benjamin Van Durme"],"pdf_url":"https://arxiv.org/pdf/2209.07662v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14346v1","updated":"2023-12-22T00:31:46Z","published":"2023-12-22T00:31:46Z","title":"Don't Believe Everything You Read: Enhancing Summarization\n Interpretability through Automatic Identification of Hallucinations in Large\n Language Models","summary":" Large Language Models (LLMs) are adept at text manipulation -- tasks such as\nmachine translation and text summarization. However, these models can also be\nprone to hallucination, which can be detrimental to the faithfulness of any\nanswers that the model provides. Recent works in combating hallucinations in\nLLMs deal with identifying hallucinated sentences and categorizing the\ndifferent ways in which models hallucinate. This paper takes a deep dive into\nLLM behavior with respect to hallucinations, defines a token-level approach to\nidentifying different kinds of hallucinations, and further utilizes this\ntoken-level tagging to improve the interpretability and faithfulness of LLMs in\ndialogue summarization tasks. Through this, the paper presents a new, enhanced\ndataset and a new training paradigm.\n","authors":["Priyesh Vakharia","Devavrat Joshi","Meenal Chavan","Dhananjay Sonawane","Bhrigu Garg","Parsa Mazaheri","Ian Lane"],"pdf_url":"https://arxiv.org/pdf/2312.14346v1.pdf","comment":"All authors contributed equally to this work"},{"id":"http://arxiv.org/abs/2312.14345v1","updated":"2023-12-22T00:30:10Z","published":"2023-12-22T00:30:10Z","title":"Logic-Scaffolding: Personalized Aspect-Instructed Recommendation\n Explanation Generation using LLMs","summary":" The unique capabilities of Large Language Models (LLMs), such as the natural\nlanguage text generation ability, position them as strong candidates for\nproviding explanation for recommendations. However, despite the size of the\nLLM, most existing models struggle to produce zero-shot explanations reliably.\nTo address this issue, we propose a framework called Logic-Scaffolding, that\ncombines the ideas of aspect-based explanation and chain-of-thought prompting\nto generate explanations through intermediate reasoning steps. In this paper,\nwe share our experience in building the framework and present an interactive\ndemonstration for exploring our results.\n","authors":["Behnam Rahdari","Hao Ding","Ziwei Fan","Yifei Ma","Zhuotong Chen","Anoop Deoras","Branislav Kveton"],"pdf_url":"https://arxiv.org/pdf/2312.14345v1.pdf","comment":"The 17th ACM International Conference on Web Search and Data Mining\n (WSDM 2024)"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2312.14929v1","updated":"2023-12-22T18:59:54Z","published":"2023-12-22T18:59:54Z","title":"MACS: Mass Conditioned 3D Hand and Object Motion Synthesis","summary":" The physical properties of an object, such as mass, significantly affect how\nwe manipulate it with our hands. Surprisingly, this aspect has so far been\nneglected in prior work on 3D motion synthesis. To improve the naturalness of\nthe synthesized 3D hand object motions, this work proposes MACS the first MAss\nConditioned 3D hand and object motion Synthesis approach. Our approach is based\non cascaded diffusion models and generates interactions that plausibly adjust\nbased on the object mass and interaction type. MACS also accepts a manually\ndrawn 3D object trajectory as input and synthesizes the natural 3D hand motions\nconditioned by the object mass. This flexibility enables MACS to be used for\nvarious downstream applications, such as generating synthetic training data for\nML tasks, fast animation of hands for graphics workflows, and generating\ncharacter interactions for computer games. We show experimentally that a\nsmall-scale dataset is sufficient for MACS to reasonably generalize across\ninterpolated and extrapolated object masses unseen during the training.\nFurthermore, MACS shows moderate generalization to unseen objects, thanks to\nthe mass-conditioned contact labels generated by our surface contact synthesis\nmodel ConNet. Our comprehensive user study confirms that the synthesized 3D\nhand-object interactions are highly plausible and realistic.\n","authors":["Soshi Shimada","Franziska Mueller","Jan Bednarik","Bardia Doosti","Bernd Bickel","Danhang Tang","Vladislav Golyanik","Jonathan Taylor","Christian Theobalt","Thabo Beeler"],"pdf_url":"https://arxiv.org/pdf/2312.14929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14924v1","updated":"2023-12-22T18:56:35Z","published":"2023-12-22T18:56:35Z","title":"Training Convolutional Neural Networks with the Forward-Forward\n algorithm","summary":" The recent successes in analyzing images with deep neural networks are almost\nexclusively achieved with Convolutional Neural Networks (CNNs). The training of\nthese CNNs, and in fact of all deep neural network architectures, uses the\nbackpropagation algorithm where the output of the network is compared with the\ndesired result and the difference is then used to tune the weights of the\nnetwork towards the desired outcome. In a 2022 preprint, Geoffrey Hinton\nsuggested an alternative way of training which passes the desired results\ntogether with the images at the input of the network. This so called Forward\nForward (FF) algorithm has up to now only been used in fully connected\nnetworks. In this paper, we show how the FF paradigm can be extended to CNNs.\nOur FF-trained CNN, featuring a novel spatially-extended labeling technique,\nachieves a classification accuracy of 99.0% on the MNIST hand-written digits\ndataset. We show how different hyperparameters affect the performance of the\nproposed algorithm and compare the results with CNN trained with the standard\nbackpropagation approach. Furthermore, we use Class Activation Maps to\ninvestigate which type of features are learnt by the FF algorithm.\n","authors":["Riccardo Scodellaro","Ajinkya Kulkarni","Frauke Alves","Matthias Schröter"],"pdf_url":"https://arxiv.org/pdf/2312.14924v1.pdf","comment":"21 pages, 9 figures"},{"id":"http://arxiv.org/abs/2312.14919v1","updated":"2023-12-22T18:51:50Z","published":"2023-12-22T18:51:50Z","title":"Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using\n transformers","summary":" Combining complementary sensor modalities is crucial to providing robust\nperception for safety-critical robotics applications such as autonomous driving\n(AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on\nmonocular depth estimation which is a notoriously difficult task compared to\nusing depth information from the lidar directly. Here, we find that this\napproach does not leverage depth as expected and show that naively improving\ndepth estimation does not lead to improvements in object detection performance\nand that, strikingly, removing depth estimation altogether does not degrade\nobject detection performance. This suggests that relying on monocular depth\ncould be an unnecessary architectural bottleneck during camera-lidar fusion. In\nthis work, we introduce a novel fusion method that bypasses monocular depth\nestimation altogether and instead selects and fuses camera and lidar features\nin a bird's-eye-view grid using a simple attention mechanism. We show that our\nmodel can modulate its use of camera features based on the availability of\nlidar features and that it yields better 3D object detection on the nuScenes\ndataset than baselines relying on monocular depth estimation.\n","authors":["James Gunn","Zygmunt Lenyk","Anuj Sharma","Andrea Donati","Alexandru Buburuzan","John Redford","Romain Mueller"],"pdf_url":"https://arxiv.org/pdf/2312.14919v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14915v1","updated":"2023-12-22T18:50:15Z","published":"2023-12-22T18:50:15Z","title":"PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF","summary":" This paper proposes an end-to-end framework for generating 3D human pose\ndatasets using Neural Radiance Fields (NeRF). Public datasets generally have\nlimited diversity in terms of human poses and camera viewpoints, largely due to\nthe resource-intensive nature of collecting 3D human pose data. As a result,\npose estimators trained on public datasets significantly underperform when\napplied to unseen out-of-distribution samples. Previous works proposed\naugmenting public datasets by generating 2D-3D pose pairs or rendering a large\namount of random data. Such approaches either overlook image rendering or\nresult in suboptimal datasets for pre-trained models. Here we propose PoseGen,\nwhich learns to generate a dataset (human 3D poses and images) with a feedback\nloss from a given pre-trained pose estimator. In contrast to prior art, our\ngenerated data is optimized to improve the robustness of the pre-trained model.\nThe objective of PoseGen is to learn a distribution of data that maximizes the\nprediction error of a given pre-trained model. As the learned data distribution\ncontains OOD samples of the pre-trained model, sampling data from such a\ndistribution for further fine-tuning a pre-trained model improves the\ngeneralizability of the model. This is the first work that proposes NeRFs for\n3D human data generation. NeRFs are data-driven and do not require 3D scans of\nhumans. Therefore, using NeRF for data generation is a new direction for\nconvenient user-specific data generation. Our extensive experiments show that\nthe proposed PoseGen improves two baseline models (SPIN and HybrIK) on four\ndatasets with an average 6% relative improvement.\n","authors":["Mohsen Gholami","Rabab Ward","Z. Jane Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17349v2","updated":"2023-12-22T18:26:13Z","published":"2023-05-27T03:05:07Z","title":"Condition-Invariant Semantic Segmentation","summary":" Adaptation of semantic segmentation networks to different visual conditions\nis vital for robust perception in autonomous cars and robots. However, previous\nwork has shown that most feature-level adaptation methods, which employ\nadversarial training and are validated on synthetic-to-real adaptation, provide\nmarginal gains in condition-level adaptation, being outperformed by simple\npixel-level adaptation via stylization. Motivated by these findings, we propose\nto leverage stylization in performing feature-level adaptation by aligning the\ninternal network features extracted by the encoder of the network from the\noriginal and the stylized view of each input image with a novel feature\ninvariance loss. In this way, we encourage the encoder to extract features that\nare already invariant to the style of the input, allowing the decoder to focus\non parsing these features and not on further abstracting from the specific\nstyle of the input. We implement our method, named Condition-Invariant Semantic\nSegmentation (CISS), on the current state-of-the-art domain adaptation\narchitecture and achieve outstanding results on condition-level adaptation. In\nparticular, CISS sets the new state of the art in the popular\ndaytime-to-nighttime Cityscapes$\\to$Dark Zurich benchmark. Furthermore, our\nmethod achieves the second-best performance on the normal-to-adverse\nCityscapes$\\to$ACDC benchmark. CISS is shown to generalize well to domains\nunseen during training, such as BDD100K-night. Code is publicly available at\nhttps://github.com/SysCV/CISS .\n","authors":["Christos Sakaridis","David Bruggemann","Fisher Yu","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2305.17349v2.pdf","comment":"Submitted for review to IEEE T-PAMI"},{"id":"http://arxiv.org/abs/2312.14891v1","updated":"2023-12-22T18:09:20Z","published":"2023-12-22T18:09:20Z","title":"DRStageNet: Deep Learning for Diabetic Retinopathy Staging from Fundus\n Images","summary":" Diabetic retinopathy (DR) is a prevalent complication of diabetes associated\nwith a significant risk of vision loss. Timely identification is critical to\ncurb vision impairment. Algorithms for DR staging from digital fundus images\n(DFIs) have been recently proposed. However, models often fail to generalize\ndue to distribution shifts between the source domain on which the model was\ntrained and the target domain where it is deployed. A common and particularly\nchallenging shift is often encountered when the source- and target-domain\nsupports do not fully overlap. In this research, we introduce DRStageNet, a\ndeep learning model designed to mitigate this challenge. We used seven publicly\navailable datasets, comprising a total of 93,534 DFIs that cover a variety of\npatient demographics, ethnicities, geographic origins and comorbidities. We\nfine-tune DINOv2, a pretrained model of self-supervised vision transformer, and\nimplement a multi-source domain fine-tuning strategy to enhance generalization\nperformance. We benchmark and demonstrate the superiority of our method to two\nstate-of-the-art benchmarks, including a recently published foundation model.\nWe adapted the grad-rollout method to our regression task in order to provide\nhigh-resolution explainability heatmaps. The error analysis showed that 59\\% of\nthe main errors had incorrect reference labels. DRStageNet is accessible at URL\n[upon acceptance of the manuscript].\n","authors":["Yevgeniy Men","Jonathan Fhima","Leo Anthony Celi","Lucas Zago Ribeiro","Luis Filipe Nakayama","Joachim A. Behar"],"pdf_url":"https://arxiv.org/pdf/2312.14891v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.16184v2","updated":"2023-12-22T18:07:41Z","published":"2023-07-30T09:48:36Z","title":"UnIVAL: Unified Model for Image, Video, Audio and Language Tasks","summary":" Large Language Models (LLMs) have made the ambitious quest for generalist\nagents significantly far from being a fantasy. A key hurdle for building such\ngeneral models is the diversity and heterogeneity of tasks and modalities. A\npromising solution is unification, allowing the support of a myriad of tasks\nand modalities within one unified framework. While few large models (e.g.,\nFlamingo (Alayrac et al., 2022), trained on massive datasets, can support more\nthan two modalities, current small to mid-scale unified models are still\nlimited to 2 modalities, usually image-text or video-text. The question that we\nask is: is it possible to build efficiently a unified model that can support\nall modalities? To answer this, we propose UnIVAL, a step further towards this\nambitious goal. Without relying on fancy datasets sizes or models with billions\nof parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities\nand unifies text, images, video, and audio into a single model. Our model is\nefficiently pretrained on many tasks, based on task balancing and multimodal\ncurriculum learning. UnIVAL shows competitive performance to existing\nstate-of-the-art approaches, across image and video-text tasks. The feature\nrepresentations learned from image and video-text modalities, allows the model\nto achieve competitive performance when finetuned on audio-text tasks, despite\nnot being pretrained on audio. Thanks to the unified model, we propose a novel\nstudy on multimodal model merging via weight interpolation of models trained on\ndifferent multimodal tasks, showing their benefits in particular for\nout-of-distribution generalization. Finally, we motivate unification by showing\nthe synergy between tasks. The model weights and code are released here:\nhttps://github.com/mshukor/UnIVAL.\n","authors":["Mustafa Shukor","Corentin Dancette","Alexandre Rame","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2307.16184v2.pdf","comment":"Accepted at TMLR 2023. 40 pages. Project page:\n https://unival-model.github.io/"},{"id":"http://arxiv.org/abs/2306.15774v2","updated":"2023-12-22T17:53:02Z","published":"2023-06-27T19:54:30Z","title":"Next Steps for Human-Centered Generative AI: A Technical Perspective","summary":" Through iterative, cross-disciplinary discussions, we define and propose\nnext-steps for Human-centered Generative AI (HGAI). We contribute a\ncomprehensive research agenda that lays out future directions of Generative AI\nspanning three levels: aligning with human values; assimilating human intents;\nand augmenting human abilities. By identifying these next-steps, we intend to\ndraw interdisciplinary research teams to pursue a coherent set of emergent\nideas in HGAI, focusing on their interested topics while maintaining a coherent\nbig picture of the future work landscape.\n","authors":["Xiang 'Anthony' Chen","Jeff Burke","Ruofei Du","Matthew K. Hong","Jennifer Jacobs","Philippe Laban","Dingzeyu Li","Nanyun Peng","Karl D. D. Willis","Chien-Sheng Wu","Bolei Zhou"],"pdf_url":"https://arxiv.org/pdf/2306.15774v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14871v1","updated":"2023-12-22T17:49:11Z","published":"2023-12-22T17:49:11Z","title":"BrainVis: Exploring the Bridge between Brain and Visual Signals via\n Image Reconstruction","summary":" Analyzing and reconstructing visual stimuli from brain signals effectively\nadvances understanding of the human visual system. However, the EEG signals are\ncomplex and contain a amount of noise. This leads to substantial limitations in\nexisting works of visual stimuli reconstruction from EEG, such as difficulties\nin aligning EEG embeddings with the fine-grained semantic information and a\nheavy reliance on additional large self-collected dataset for training. To\naddress these challenges, we propose a novel approach called BrainVis. Firstly,\nwe divide the EEG signals into various units and apply a self-supervised\napproach on them to obtain EEG time-domain features, in an attempt to ease the\ntraining difficulty. Additionally, we also propose to utilize the\nfrequency-domain features to enhance the EEG representations. Then, we\nsimultaneously align EEG time-frequency embeddings with the interpolation of\nthe coarse and fine-grained semantics in the CLIP space, to highlight the\nprimary visual components and reduce the cross-modal alignment difficulty.\nFinally, we adopt the cascaded diffusion models to reconstruct images. Our\nproposed BrainVis outperforms state of the arts in both semantic fidelity\nreconstruction and generation quality. Notably, we reduce the training data\nscale to 10% of the previous work.\n","authors":["Honghao Fu","Zhiqi Shen","Jing Jih Chin","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14871v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14867v1","updated":"2023-12-22T17:45:19Z","published":"2023-12-22T17:45:19Z","title":"VIEScore: Towards Explainable Metrics for Conditional Image Synthesis\n Evaluation","summary":" In the rapidly advancing field of conditional image generation research,\nchallenges such as limited explainability lie in effectively evaluating the\nperformance and capabilities of various models. This paper introduces VIESCORE,\na Visual Instruction-guided Explainable metric for evaluating any conditional\nimage generation tasks. VIESCORE leverages general knowledge from Multimodal\nLarge Language Models (MLLMs) as the backbone and does not require training or\nfine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image\ntasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of\n0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)\nVIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in\nevaluating synthetic images. (3) VIESCORE achieves a correlation on par with\nhuman ratings in the generation tasks but struggles in editing tasks. With\nthese results, we believe VIESCORE shows its great potential to replace human\njudges in evaluating image synthesis tasks.\n","authors":["Max Ku","Dongfu Jiang","Cong Wei","Xiang Yue","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14867v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14834v1","updated":"2023-12-22T17:08:14Z","published":"2023-12-22T17:08:14Z","title":"Prototype-Guided Text-based Person Search based on Rich Chinese\n Descriptions","summary":" Text-based person search aims to simultaneously localize and identify the\ntarget person based on query text from uncropped scene images, which can be\nregarded as the unified task of person detection and text-based person\nretrieval task. In this work, we propose a large-scale benchmark dataset named\nPRW-TPS-CN based on the widely used person search dataset PRW. Our dataset\ncontains 47,102 sentences, which means there is quite more information than\nexisting dataset. These texts precisely describe the person images from top to\nbottom, which in line with the natural description order. We also provide both\nChinese and English descriptions in our dataset for more comprehensive\nevaluation. These characteristics make our dataset more applicable. To\nalleviate the inconsistency between person detection and text-based person\nretrieval, we take advantage of the rich texts in PRW-TPS-CN dataset. We\npropose to aggregate multiple texts as text prototypes to maintain the\nprominent text features of a person, which can better reflect the whole\ncharacter of a person. The overall prototypes lead to generating the image\nattention map to eliminate the detection misalignment causing the decrease of\ntext-based person retrieval. Thus, the inconsistency between person detection\nand text-based person retrieval is largely alleviated. We conduct extensive\nexperiments on the PRW-TPS-CN dataset. The experimental results show the\nPRW-TPS-CN dataset's effectiveness and the state-of-the-art performance of our\napproach.\n","authors":["Ziqiang Wu","Bingpeng Ma"],"pdf_url":"https://arxiv.org/pdf/2312.14834v1.pdf","comment":"11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.14830v1","updated":"2023-12-22T17:06:08Z","published":"2023-12-22T17:06:08Z","title":"Dreaming of Electrical Waves: Generative Modeling of Cardiac Excitation\n Waves using Diffusion Models","summary":" Electrical waves in the heart form rotating spiral or scroll waves during\nlife-threatening arrhythmias such as atrial or ventricular fibrillation. The\nwave dynamics are typically modeled using coupled partial differential\nequations, which describe reaction-diffusion dynamics in excitable media. More\nrecently, data-driven generative modeling has emerged as an alternative to\ngenerate spatio-temporal patterns in physical and biological systems. Here, we\nexplore denoising diffusion probabilistic models for the generative modeling of\nelectrical wave patterns in cardiac tissue. We trained diffusion models with\nsimulated electrical wave patterns to be able to generate such wave patterns in\nunconditional and conditional generation tasks. For instance, we explored\ninpainting tasks, such as reconstructing three-dimensional wave dynamics from\nsuperficial two-dimensional measurements, and evolving and generating\nparameter-specific dynamics. We characterized and compared the\ndiffusion-generated solutions to solutions obtained with biophysical models and\nfound that diffusion models learn to replicate spiral and scroll waves dynamics\nso well that they could serve as an alternative data-driven approach for the\nmodeling of excitation waves in cardiac tissue. For instance, we found that it\nis possible to initiate ventricular fibrillation (VF) dynamics instantaneously\nwithout having to apply pacing protocols in order to induce wavebreak. The VF\ndynamics can be created in arbitrary ventricular geometries and can be evolved\nover time. However, we also found that diffusion models `hallucinate' wave\npatterns when given insufficient constraints. Regardless of these limitations,\ndiffusion models are an interesting and powerful tool with many potential\napplications in cardiac arrhythmia research and diagnostics.\n","authors":["Tanish Baranwal","Jan Lebert","Jan Christoph"],"pdf_url":"https://arxiv.org/pdf/2312.14830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14828v1","updated":"2023-12-22T17:02:45Z","published":"2023-12-22T17:02:45Z","title":"Plan, Posture and Go: Towards Open-World Text-to-Motion Generation","summary":" Conventional text-to-motion generation methods are usually trained on limited\ntext-motion pairs, making them hard to generalize to open-world scenarios. Some\nworks use the CLIP model to align the motion space and the text space, aiming\nto enable motion generation from natural language motion descriptions. However,\nthey are still constrained to generate limited and unrealistic in-place\nmotions. To address these issues, we present a divide-and-conquer framework\nnamed PRO-Motion, which consists of three modules as motion planner,\nposture-diffuser and go-diffuser. The motion planner instructs Large Language\nModels (LLMs) to generate a sequence of scripts describing the key postures in\nthe target motion. Differing from natural languages, the scripts can describe\nall possible postures following very simple text templates. This significantly\nreduces the complexity of posture-diffuser, which transforms a script to a\nposture, paving the way for open-world generation. Finally, go-diffuser,\nimplemented as another diffusion model, estimates whole-body translations and\nrotations for all postures, resulting in realistic motions. Experimental\nresults have shown the superiority of our method with other counterparts, and\ndemonstrated its capability of generating diverse and realistic motions from\ncomplex open-world prompts such as \"Experiencing a profound sense of joy\". The\nproject page is available at https://moonsliu.github.io/Pro-Motion.\n","authors":["Jinpeng Liu","Wenxun Dai","Chunyu Wang","Yiji Cheng","Yansong Tang","Xin Tong"],"pdf_url":"https://arxiv.org/pdf/2312.14828v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14812v1","updated":"2023-12-22T16:33:45Z","published":"2023-12-22T16:33:45Z","title":"PARDINUS: Weakly supervised discarding of photo-trapping empty images\n based on autoencoders","summary":" Photo-trapping cameras are widely employed for wildlife monitoring. Those\ncameras take photographs when motion is detected to capture images where\nanimals appear. A significant portion of these images are empty - no wildlife\nappears in the image. Filtering out those images is not a trivial task since it\nrequires hours of manual work from biologists. Therefore, there is a notable\ninterest in automating this task. Automatic discarding of empty photo-trapping\nimages is still an open field in the area of Machine Learning. Existing\nsolutions often rely on state-of-the-art supervised convolutional neural\nnetworks that require the annotation of the images in the training phase.\nPARDINUS (Weakly suPervised discARDINg of photo-trapping empty images based on\naUtoencoderS) is constructed on the foundation of weakly supervised learning\nand proves that this approach equals or even surpasses other fully supervised\nmethods that require further labeling work.\n","authors":["David de la Rosa","Antonio J Rivera","María J del Jesus","Francisco Charte"],"pdf_url":"https://arxiv.org/pdf/2312.14812v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14792v1","updated":"2023-12-22T16:06:43Z","published":"2023-12-22T16:06:43Z","title":"The Rate-Distortion-Perception-Classification Tradeoff: Joint Source\n Coding and Modulation via Inverse-Domain GANs","summary":" The joint source coding and modulation (JSCM) framework was enabled by recent\ndevelopments in deep learning, which allows to automatically learn from data,\nand in an end-to-end fashion, the best compression codes and modulation\nschemes. In this paper, we show the existence of a strict tradeoff between\nchannel rate, distortion, perception, and classification accuracy in a JSCM\nscenario. We then propose two image compression methods to navigate that\ntradeoff: an inverse-domain generative adversarial network (ID-GAN), which\nachieves extreme compression, and a simpler, heuristic method that reveals\ninsights about the performance of ID-GAN. Experiment results not only\ncorroborate the theoretical findings, but also demonstrate that the proposed\nID-GAN algorithm significantly improves system performance compared to\ntraditional separation-based methods and recent deep JSCM architectures.\n","authors":["Junli Fang","João F. C. Mota","Baoshan Lu","Weicheng Zhang","Xuemin Hong"],"pdf_url":"https://arxiv.org/pdf/2312.14792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13016v3","updated":"2023-12-22T15:56:46Z","published":"2023-12-20T13:31:11Z","title":"DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View\n Synthesis","summary":" We present DiffPortrait3D, a conditional diffusion model that is capable of\nsynthesizing 3D-consistent photo-realistic novel views from as few as a single\nin-the-wild portrait. Specifically, given a single RGB input, we aim to\nsynthesize plausible but consistent facial details rendered from novel camera\nviews with retained both identity and facial expression. In lieu of\ntime-consuming optimization and fine-tuning, our zero-shot method generalizes\nwell to arbitrary face portraits with unposed camera views, extreme facial\nexpressions, and diverse artistic depictions. At its core, we leverage the\ngenerative prior of 2D diffusion models pre-trained on large-scale image\ndatasets as our rendering backbone, while the denoising is guided with\ndisentangled attentive control of appearance and camera pose. To achieve this,\nwe first inject the appearance context from the reference image into the\nself-attention layers of the frozen UNets. The rendering view is then\nmanipulated with a novel conditional control module that interprets the camera\npose by watching a condition image of a crossed subject from the same view.\nFurthermore, we insert a trainable cross-view attention module to enhance view\nconsistency, which is further strengthened with a novel 3D-aware noise\ngeneration process during inference. We demonstrate state-of-the-art results\nboth qualitatively and quantitatively on our challenging in-the-wild and\nmulti-view benchmarks.\n","authors":["Yuming Gu","You Xie","Hongyi Xu","Guoxian Song","Yichun Shi","Di Chang","Jing Yang","Linjie Luo"],"pdf_url":"https://arxiv.org/pdf/2312.13016v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11146v2","updated":"2023-12-22T15:44:10Z","published":"2023-12-18T12:39:48Z","title":"OsmLocator: locating overlapping scatter marks with a non-training\n generative perspective","summary":" Automated mark localization in scatter images, greatly helpful for\ndiscovering knowledge and understanding enormous document images and reasoning\nin visual question answering AI systems, is a highly challenging problem\nbecause of the ubiquity of overlapping marks. Locating overlapping marks faces\nmany difficulties such as no texture, less contextual information, hallow shape\nand tiny size. Here, we formulate it as a combinatorial optimization problem on\nclustering-based re-visualization from a non-training generative perspective,\nto locate scatter marks by finding the status of multi-variables when an\nobjective function reaches a minimum. The objective function is constructed on\ndifference between binarized scatter images and corresponding generated\nre-visualization based on their clustering. Fundamentally, re-visualization\ntries to generate a new scatter graph only taking a rasterized scatter image as\nan input, and clustering is employed to provide the information for such\nre-visualization. This method could stably locate severely-overlapping,\nvariable-size and variable-shape marks in scatter images without dependence of\nany training dataset or reference. Meanwhile, we propose an adaptive variant of\nsimulated annealing which can works on various connected regions. In addition,\nwe especially built a dataset named SML2023 containing hundreds of scatter\nimages with different markers and various levels of overlapping severity, and\ntested the proposed method and compared it to existing methods. The results\nshow that it can accurately locate most marks in scatter images with different\noverlapping severity and marker types, with about 0.3 absolute increase on an\nassignment-cost-based metric in comparison with state-of-the-art methods. This\nwork is of value to data mining on massive web pages and literatures, and\nshedding new light on image measurement such as bubble counting.\n","authors":["Yuming Qiu","Aleksandra Pizurica","Qi Ming","Nicolas Nadisic"],"pdf_url":"https://arxiv.org/pdf/2312.11146v2.pdf","comment":"22pages"},{"id":"http://arxiv.org/abs/2312.14776v1","updated":"2023-12-22T15:43:12Z","published":"2023-12-22T15:43:12Z","title":"Compressing Image-to-Image Translation GANs Using Local Density\n Structures on Their Learned Manifold","summary":" Generative Adversarial Networks (GANs) have shown remarkable success in\nmodeling complex data distributions for image-to-image translation. Still,\ntheir high computational demands prohibit their deployment in practical\nscenarios like edge devices. Existing GAN compression methods mainly rely on\nknowledge distillation or convolutional classifiers' pruning techniques. Thus,\nthey neglect the critical characteristic of GANs: their local density structure\nover their learned manifold. Accordingly, we approach GAN compression from a\nnew perspective by explicitly encouraging the pruned model to preserve the\ndensity structure of the original parameter-heavy model on its learned\nmanifold. We facilitate this objective for the pruned model by partitioning the\nlearned manifold of the original generator into local neighborhoods around its\ngenerated samples. Then, we propose a novel pruning objective to regularize the\npruned model to preserve the local density structure over each neighborhood,\nresembling the kernel density estimation method. Also, we develop a\ncollaborative pruning scheme in which the discriminator and generator are\npruned by two pruning agents. We design the agents to capture interactions\nbetween the generator and discriminator by exchanging their peer's feedback\nwhen determining corresponding models' architectures. Thanks to such a design,\nour pruning method can efficiently find performant sub-networks and can\nmaintain the balance between the generator and discriminator more effectively\ncompared to baselines during pruning, thereby showing more stable pruning\ndynamics. Our experiments on image translation GAN models, Pix2Pix and\nCycleGAN, with various benchmark datasets and architectures demonstrate our\nmethod's effectiveness.\n","authors":["Alireza Ganjdanesh","Shangqian Gao","Hirad Alipanah","Heng Huang"],"pdf_url":"https://arxiv.org/pdf/2312.14776v1.pdf","comment":"The 38th Annual AAAI Conference on Artificial Intelligence, AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14773v1","updated":"2023-12-22T15:39:37Z","published":"2023-12-22T15:39:37Z","title":"Cross-Age and Cross-Site Domain Shift Impacts on Deep Learning-Based\n White Matter Fiber Estimation in Newborn and Baby Brains","summary":" Deep learning models have shown great promise in estimating tissue\nmicrostructure from limited diffusion magnetic resonance imaging data. However,\nthese models face domain shift challenges when test and train data are from\ndifferent scanners and protocols, or when the models are applied to data with\ninherent variations such as the developing brains of infants and children\nscanned at various ages. Several techniques have been proposed to address some\nof these challenges, such as data harmonization or domain adaptation in the\nadult brain. However, those techniques remain unexplored for the estimation of\nfiber orientation distribution functions in the rapidly developing brains of\ninfants. In this work, we extensively investigate the age effect and domain\nshift within and across two different cohorts of 201 newborns and 165 babies\nusing the Method of Moments and fine-tuning strategies. Our results show that\nreduced variations in the microstructural development of babies in comparison\nto newborns directly impact the deep learning models' cross-age performance. We\nalso demonstrate that a small number of target domain samples can significantly\nmitigate domain shift problems.\n","authors":["Rizhong Lin","Ali Gholipour","Jean-Philippe Thiran","Davood Karimi","Hamza Kebiri","Meritxell Bach Cuadra"],"pdf_url":"https://arxiv.org/pdf/2312.14773v1.pdf","comment":"5 pages, 5 figures, submitted to ISBI 2024"},{"id":"http://arxiv.org/abs/2312.14733v1","updated":"2023-12-22T14:40:55Z","published":"2023-12-22T14:40:55Z","title":"Harnessing Diffusion Models for Visual Perception with Meta Prompts","summary":" The issue of generative pretraining for vision models has persisted as a\nlong-standing conundrum. At present, the text-to-image (T2I) diffusion model\ndemonstrates remarkable proficiency in generating high-definition images\nmatching textual inputs, a feat made possible through its pre-training on\nlarge-scale image-text pairs. This leads to a natural inquiry: can diffusion\nmodels be utilized to tackle visual perception tasks? In this paper, we propose\na simple yet effective scheme to harness a diffusion model for visual\nperception tasks. Our key insight is to introduce learnable embeddings (meta\nprompts) to the pre-trained diffusion models to extract proper features for\nperception. The effect of meta prompts are two-fold. First, as a direct\nreplacement of the text embeddings in the T2I models, it can activate\ntask-relevant features during feature extraction. Second, it will be used to\nre-arrange the extracted features to ensures that the model focuses on the most\npertinent features for the task on hand. Additionally, we design a recurrent\nrefinement training strategy that fully leverages the property of diffusion\nmodels, thereby yielding stronger visual features. Extensive experiments across\nvarious benchmarks validate the effectiveness of our approach. Our approach\nachieves new performance records in depth estimation tasks on NYU depth V2 and\nKITTI, and in semantic segmentation task on CityScapes. Concurrently, the\nproposed method attains results comparable to the current state-of-the-art in\nsemantic segmentation on ADE20K and pose estimation on COCO datasets, further\nexemplifying its robustness and versatility.\n","authors":["Qiang Wan","Zilong Huang","Bingyi Kang","Jiashi Feng","Li Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.14733v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14724v1","updated":"2023-12-22T14:33:54Z","published":"2023-12-22T14:33:54Z","title":"Images in Discrete Choice Modeling: Addressing Data Isomorphism in\n Multi-Modality Inputs","summary":" This paper explores the intersection of Discrete Choice Modeling (DCM) and\nmachine learning, focusing on the integration of image data into DCM's utility\nfunctions and its impact on model interpretability. We investigate the\nconsequences of embedding high-dimensional image data that shares isomorphic\ninformation with traditional tabular inputs within a DCM framework. Our study\nreveals that neural network (NN) components learn and replicate tabular\nvariable representations from images when co-occurrences exist, thereby\ncompromising the interpretability of DCM parameters. We propose and benchmark\ntwo methodologies to address this challenge: architectural design adjustments\nto segregate redundant information, and isomorphic information mitigation\nthrough source information masking and inpainting. Our experiments, conducted\non a semi-synthetic dataset, demonstrate that while architectural modifications\nprove inconclusive, direct mitigation at the data source shows to be a more\neffective strategy in maintaining the integrity of DCM's interpretable\nparameters. The paper concludes with insights into the applicability of our\nfindings in real-world settings and discusses the implications for future\nresearch in hybrid modeling that combines complex data modalities. Full control\nof tabular and image data congruence is attained by using the MIT moral machine\ndataset, and both inputs are merged into a choice model by deploying the\nLearning Multinomial Logit (L-MNL) framework.\n","authors":["Brian Sifringer","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2312.14724v1.pdf","comment":"17 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2309.06978v4","updated":"2023-12-22T14:16:59Z","published":"2023-09-13T14:13:08Z","title":"Differentiable JPEG: The Devil is in the Details","summary":" JPEG remains one of the most widespread lossy image coding methods. However,\nthe non-differentiable nature of JPEG restricts the application in deep\nlearning pipelines. Several differentiable approximations of JPEG have recently\nbeen proposed to address this issue. This paper conducts a comprehensive review\nof existing diff. JPEG approaches and identifies critical details that have\nbeen missed by previous methods. To this end, we propose a novel diff. JPEG\napproach, overcoming previous limitations. Our approach is differentiable\nw.r.t. the input image, the JPEG quality, the quantization tables, and the\ncolor conversion parameters. We evaluate the forward and backward performance\nof our diff. JPEG approach against existing methods. Additionally, extensive\nablations are performed to evaluate crucial design choices. Our proposed diff.\nJPEG resembles the (non-diff.) reference implementation best, significantly\nsurpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For\nstrong compression rates, we can even improve PSNR by $9.51$dB. Strong\nadversarial attack results are yielded by our diff. JPEG, demonstrating the\neffective gradient approximation. Our code is available at\nhttps://github.com/necla-ml/Diff-JPEG.\n","authors":["Christoph Reich","Biplob Debnath","Deep Patel","Srimat Chakradhar"],"pdf_url":"https://arxiv.org/pdf/2309.06978v4.pdf","comment":"Accepted at WACV 2024. Project page:\n https://christophreich1996.github.io/differentiable_jpeg/ WACV paper:\n https://openaccess.thecvf.com/content/WACV2024/html/Reich_Differentiable_JPEG_The_Devil_Is_in_the_Details_WACV_2024_paper.html"},{"id":"http://arxiv.org/abs/2312.09854v2","updated":"2023-12-22T14:11:38Z","published":"2023-12-15T15:01:41Z","title":"Q-Segment: Segmenting Images In-Sensor for Vessel-Based Medical\n Diagnosis","summary":" This paper addresses the growing interest in deploying deep learning models\ndirectly in-sensor. We present \"Q-Segment\", a quantized real-time segmentation\nalgorithm, and conduct a comprehensive evaluation on a low-power edge vision\nplatform with an in-sensors processor, the Sony IMX500. One of the main goals\nof the model is to achieve end-to-end image segmentation for vessel-based\nmedical diagnosis. Deployed on the IMX500 platform, Q-Segment achieves\nultra-low inference time in-sensor only 0.23 ms and power consumption of only\n72mW. We compare the proposed network with state-of-the-art models, both float\nand quantized, demonstrating that the proposed solution outperforms existing\nnetworks on various platforms in computing efficiency, e.g., by a factor of 75x\ncompared to ERFNet. The network employs an encoder-decoder structure with skip\nconnections, and results in a binary accuracy of 97.25% and an Area Under the\nReceiver Operating Characteristic Curve (AUC) of 96.97% on the CHASE dataset.\nWe also present a comparison of the IMX500 processing core with the Sony\nSpresense, a low-power multi-core ARM Cortex-M microcontroller, and a\nsingle-core ARM Cortex-M4 showing that it can achieve in-sensor processing with\nend-to-end low latency (17 ms) and power concumption (254mW). This research\ncontributes valuable insights into edge-based image segmentation, laying the\nfoundation for efficient algorithms tailored to low-power environments.\n","authors":["Pietro Bonazzi","Julian Moosmann","Yawei Li","Sizhen Bian","Michele Magno"],"pdf_url":"https://arxiv.org/pdf/2312.09854v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14706v1","updated":"2023-12-22T14:06:44Z","published":"2023-12-22T14:06:44Z","title":"BonnBeetClouds3D: A Dataset Towards Point Cloud-based Organ-level\n Phenotyping of Sugar Beet Plants under Field Conditions","summary":" Agricultural production is facing severe challenges in the next decades\ninduced by climate change and the need for sustainability, reducing its impact\non the environment. Advancements in field management through non-chemical\nweeding by robots in combination with monitoring of crops by autonomous\nunmanned aerial vehicles (UAVs) and breeding of novel and more resilient crop\nvarieties are helpful to address these challenges. The analysis of plant\ntraits, called phenotyping, is an essential activity in plant breeding, it\nhowever involves a great amount of manual labor. With this paper, we address\nthe problem of automatic fine-grained organ-level geometric analysis needed for\nprecision phenotyping. As the availability of real-world data in this domain is\nrelatively scarce, we propose a novel dataset that was acquired using UAVs\ncapturing high-resolution images of a real breeding trial containing 48 plant\nvarieties and therefore covering great morphological and appearance diversity.\nThis enables the development of approaches for autonomous phenotyping that\ngeneralize well to different varieties. Based on overlapping high-resolution\nimages from multiple viewing angles, we compute photogrammetric dense point\nclouds and provide detailed and accurate point-wise labels for plants, leaves,\nand salient points as the tip and the base. Additionally, we include\nmeasurements of phenotypic traits performed by experts from the German Federal\nPlant Variety Office on the real plants, allowing the evaluation of new\napproaches not only on segmentation and keypoint detection but also directly on\nthe downstream tasks. The provided labeled point clouds enable fine-grained\nplant analysis and support further progress in the development of automatic\nphenotyping approaches, but also enable further research in surface\nreconstruction, point cloud completion, and semantic interpretation of point\nclouds.\n","authors":["Elias Marks","Jonas Bömer","Federico Magistri","Anurag Sah","Jens Behley","Cyrill Stachniss"],"pdf_url":"https://arxiv.org/pdf/2312.14706v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14705v1","updated":"2023-12-22T14:06:03Z","published":"2023-12-22T14:06:03Z","title":"SCUNet++: Assessment of Pulmonary Embolism CT Image Segmentation\n Leveraging Swin-UNet and CNN Bottleneck Hybrid Architecture with Multi-Fusion\n Dense Skip Connection","summary":" Pulmonary embolism (PE) is a prevalent lung disease that can lead to right\nventricular hypertrophy and failure in severe cases, ranking second in severity\nonly to myocardial infarction and sudden death. Pulmonary artery CT angiography\n(CTPA) is a widely used diagnostic method for PE. However, PE detection\npresents challenges in clinical practice due to limitations in imaging\ntechnology. CTPA can produce noises similar to PE, making confirmation of its\npresence time-consuming and prone to overdiagnosis. Nevertheless, the\ntraditional segmentation method of PE can not fully consider the hierarchical\nstructure of features, local and global spatial features of PE CT images. In\nthis paper, we propose an automatic PE segmentation method called SCUNet++\n(Swin Conv UNet++). This method incorporates multiple fusion dense skip\nconnections between the encoder and decoder, utilizing the Swin Transformer as\nthe encoder. And fuses features of different scales in the decoder subnetwork\nto compensate for spatial information loss caused by the inevitable\ndownsampling in Swin-UNet or other state-of-the-art methods, effectively\nsolving the above problem. We provide a theoretical analysis of this method in\ndetail and validate it on publicly available PE CT image datasets FUMPE and\nCAD-PE. The experimental results indicate that our proposed method achieved a\nDice similarity coefficient (DSC) of 83.47% and a Hausdorff distance 95th\npercentile (HD95) of 3.83 on the FUMPE dataset, as well as a DSC of 83.42% and\nan HD95 of 5.10 on the CAD-PE dataset. These findings demonstrate that our\nmethod exhibits strong performance in PE segmentation tasks, potentially\nenhancing the accuracy of automatic segmentation of PE and providing a powerful\ndiagnostic tool for clinical physicians. Our source code and new FUMPE dataset\nare available at https://github.com/JustlfC03/SCUNet-plusplus.\n","authors":["Yifei Chen","Binfeng Zou","Zhaoxin Guo","Yiyu Huang","Yifan Huang","Feiwei Qin","Qinhai Li","Changmiao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14705v1.pdf","comment":"10 pages, 7 figures, accept wacv2024"},{"id":"http://arxiv.org/abs/2312.14697v1","updated":"2023-12-22T13:56:53Z","published":"2023-12-22T13:56:53Z","title":"Pola4All: survey of polarimetric applications and an open-source toolkit\n to analyze polarization","summary":" Polarization information of the light can provide rich cues for computer\nvision and scene understanding tasks, such as the type of material, pose, and\nshape of the objects. With the advent of new and cheap polarimetric sensors,\nthis imaging modality is becoming accessible to a wider public for solving\nproblems such as pose estimation, 3D reconstruction, underwater navigation, and\ndepth estimation. However, we observe several limitations regarding the usage\nof this sensorial modality, as well as a lack of standards and publicly\navailable tools to analyze polarization images. Furthermore, although\npolarization camera manufacturers usually provide acquisition tools to\ninterface with their cameras, they rarely include processing algorithms that\nmake use of the polarization information. In this paper, we review recent\nadvances in applications that involve polarization imaging, including a\ncomprehensive survey of recent advances on polarization for vision and robotics\nperception tasks. We also introduce a complete software toolkit that provides\ncommon standards to communicate with and process information from most of the\nexisting micro-grid polarization cameras on the market. The toolkit also\nimplements several image processing algorithms for this modality, and it is\npublicly available on GitHub: https://github.com/vibot-lab/Pola4all_JEI_2023.\n","authors":["Joaquin Rodriguez","Lew-Fock-Chong Lew-Yan-Voon","Renato Martins","Olivier Morel"],"pdf_url":"https://arxiv.org/pdf/2312.14697v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11241v2","updated":"2023-12-22T13:55:53Z","published":"2023-04-21T20:22:17Z","title":"AutoNeRF: Training Implicit Scene Representations with Autonomous Agents","summary":" Implicit representations such as Neural Radiance Fields (NeRF) have been\nshown to be very effective at novel view synthesis. However, these models\ntypically require manual and careful human data collection for training. In\nthis paper, we present AutoNeRF, a method to collect data required to train\nNeRFs using autonomous embodied agents. Our method allows an agent to explore\nan unseen environment efficiently and use the experience to build an implicit\nmap representation autonomously. We compare the impact of different exploration\nstrategies including handcrafted frontier-based exploration, end-to-end and\nmodular approaches composed of trained high-level planners and classical\nlow-level path followers. We train these models with different reward functions\ntailored to this problem and evaluate the quality of the learned\nrepresentations on four different downstream tasks: classical viewpoint\nrendering, map reconstruction, planning, and pose refinement. Empirical results\nshow that NeRFs can be trained on actively collected data using just a single\nepisode of experience in an unseen environment, and can be used for several\ndownstream robotic tasks, and that modular trained exploration models\noutperform other classical and end-to-end baselines. Finally, we show that\nAutoNeRF can reconstruct large-scale scenes, and is thus a useful tool to\nperform scene-specific adaptation as the produced 3D environment models can be\nloaded into a simulator to fine-tune a policy of interest.\n","authors":["Pierre Marza","Laetitia Matignon","Olivier Simonin","Dhruv Batra","Christian Wolf","Devendra Singh Chaplot"],"pdf_url":"https://arxiv.org/pdf/2304.11241v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.17093v2","updated":"2023-12-22T13:37:21Z","published":"2023-09-29T09:41:19Z","title":"Prototype-based Aleatoric Uncertainty Quantification for Cross-modal\n Retrieval","summary":" Cross-modal Retrieval methods build similarity relations between vision and\nlanguage modalities by jointly learning a common representation space. However,\nthe predictions are often unreliable due to the Aleatoric uncertainty, which is\ninduced by low-quality data, e.g., corrupt images, fast-paced videos, and\nnon-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric\nUncertainty Quantification (PAU) framework to provide trustworthy predictions\nby quantifying the uncertainty arisen from the inherent data ambiguity.\nConcretely, we first construct a set of various learnable prototypes for each\nmodality to represent the entire semantics subspace. Then Dempster-Shafer\nTheory and Subjective Logic Theory are utilized to build an evidential\ntheoretical framework by associating evidence with Dirichlet Distribution\nparameters. The PAU model induces accurate uncertainty and reliable predictions\nfor cross-modal retrieval. Extensive experiments are performed on four major\nbenchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the\neffectiveness of our method. The code is accessible at\nhttps://github.com/leolee99/PAU.\n","authors":["Hao Li","Jingkuan Song","Lianli Gao","Xiaosu Zhu","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2309.17093v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14664v1","updated":"2023-12-22T13:01:21Z","published":"2023-12-22T13:01:21Z","title":"Density Uncertainty Quantification with NeRF-Ensembles: Impact of Data\n and Scene Constraints","summary":" In the fields of computer graphics, computer vision and photogrammetry,\nNeural Radiance Fields (NeRFs) are a major topic driving current research and\ndevelopment. However, the quality of NeRF-generated 3D scene reconstructions\nand subsequent surface reconstructions, heavily relies on the network output,\nparticularly the density. Regarding this critical aspect, we propose to utilize\nNeRF-Ensembles that provide a density uncertainty estimate alongside the mean\ndensity. We demonstrate that data constraints such as low-quality images and\nposes lead to a degradation of the training process, increased density\nuncertainty and decreased predicted density. Even with high-quality input data,\nthe density uncertainty varies based on scene constraints such as acquisition\nconstellations, occlusions and material properties. NeRF-Ensembles not only\nprovide a tool for quantifying the uncertainty but exhibit two promising\nadvantages: Enhanced robustness and artifact removal. Through the utilization\nof NeRF-Ensembles instead of single NeRFs, small outliers are removed, yielding\na smoother output with improved completeness of structures. Furthermore,\napplying percentile-based thresholds on density uncertainty outliers proves to\nbe effective for the removal of large (foggy) artifacts in post-processing. We\nconduct our methodology on 3 different datasets: (i) synthetic benchmark\ndataset, (ii) real benchmark dataset, (iii) real data under realistic recording\nconditions and sensors.\n","authors":["Miriam Jäger","Steven Landgraf","Boris Jutzi"],"pdf_url":"https://arxiv.org/pdf/2312.14664v1.pdf","comment":"21 pages, 12 figures, 5 tables"},{"id":"http://arxiv.org/abs/2312.06275v2","updated":"2023-12-22T13:01:13Z","published":"2023-12-11T10:26:21Z","title":"DG-TTA: Out-of-domain medical image segmentation through Domain\n Generalization and Test-Time Adaptation","summary":" Applying pre-trained medical segmentation models on out-of-domain images\noften yields predictions of insufficient quality. Several strategies have been\nproposed to maintain model performance, such as finetuning or unsupervised- and\nsource-free domain adaptation. These strategies set restrictive requirements\nfor data availability. In this study, we propose to combine domain\ngeneralization and test-time adaptation to create a highly effective approach\nfor reusing pre-trained models in unseen target domains. Domain-generalized\npre-training on source data is used to obtain the best initial performance in\nthe target domain. We introduce the MIND descriptor previously used in image\nregistration tasks as a further technique to achieve generalization and present\nsuperior performance for small-scale datasets compared to existing approaches.\nAt test-time, high-quality segmentation for every single unseen scan is ensured\nby optimizing the model weights for consistency given different image\naugmentations. That way, our method enables separate use of source and target\ndata and thus removes current data availability barriers. Moreover, the\npresented method is highly modular as it does not require specific model\narchitectures or prior knowledge of involved domains and labels. We demonstrate\nthis by integrating it into the nnUNet, which is currently the most popular and\naccurate framework for medical image segmentation. We employ multiple datasets\ncovering abdominal, cardiac, and lumbar spine scans and compose several\nout-of-domain scenarios in this study. We demonstrate that our method, combined\nwith pre-trained whole-body CT models, can effectively segment MR images with\nhigh accuracy in all of the aforementioned scenarios. Open-source code can be\nfound here: https://github.com/multimodallearning/DG-TTA\n","authors":["Christian Weihsbach","Christian N. Kruse","Alexander Bigalke","Mattias P. Heinrich"],"pdf_url":"https://arxiv.org/pdf/2312.06275v2.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2312.14650v1","updated":"2023-12-22T12:34:58Z","published":"2023-12-22T12:34:58Z","title":"Global Occlusion-Aware Transformer for Robust Stereo Matching","summary":" Despite the remarkable progress facilitated by learning-based stereo-matching\nalgorithms, the performance in the ill-conditioned regions, such as the\noccluded regions, remains a bottleneck. Due to the limited receptive field,\nexisting CNN-based methods struggle to handle these ill-conditioned regions\neffectively. To address this issue, this paper introduces a novel\nattention-based stereo-matching network called Global Occlusion-Aware\nTransformer (GOAT) to exploit long-range dependency and occlusion-awareness\nglobal context for disparity estimation. In the GOAT architecture, a parallel\ndisparity and occlusion estimation module PDO is proposed to estimate the\ninitial disparity map and the occlusion mask using a parallel attention\nmechanism. To further enhance the disparity estimates in the occluded regions,\nan occlusion-aware global aggregation module (OGA) is proposed. This module\naims to refine the disparity in the occluded regions by leveraging restricted\nglobal correlation within the focus scope of the occluded areas. Extensive\nexperiments were conducted on several public benchmark datasets including\nSceneFlow, KITTI 2015, and Middlebury. The results show that the proposed GOAT\ndemonstrates outstanding performance among all benchmarks, particularly in the\noccluded regions.\n","authors":["Zihua Liu","Yizhou Li","Masatoshi Okutomi"],"pdf_url":"https://arxiv.org/pdf/2312.14650v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14635v1","updated":"2023-12-22T12:13:19Z","published":"2023-12-22T12:13:19Z","title":"Fluid Simulation on Neural Flow Maps","summary":" We introduce Neural Flow Maps, a novel simulation method bridging the\nemerging paradigm of implicit neural representations with fluid simulation\nbased on the theory of flow maps, to achieve state-of-the-art simulation of\ninviscid fluid phenomena. We devise a novel hybrid neural field representation,\nSpatially Sparse Neural Fields (SSNF), which fuses small neural networks with a\npyramid of overlapping, multi-resolution, and spatially sparse grids, to\ncompactly represent long-term spatiotemporal velocity fields at high accuracy.\nWith this neural velocity buffer in hand, we compute long-term, bidirectional\nflow maps and their Jacobians in a mechanistically symmetric manner, to\nfacilitate drastic accuracy improvement over existing solutions. These\nlong-range, bidirectional flow maps enable high advection accuracy with low\ndissipation, which in turn facilitates high-fidelity incompressible flow\nsimulations that manifest intricate vortical structures. We demonstrate the\nefficacy of our neural fluid simulation in a variety of challenging simulation\nscenarios, including leapfrogging vortices, colliding vortices, vortex\nreconnections, as well as vortex generation from moving obstacles and density\ndifferences. Our examples show increased performance over existing methods in\nterms of energy conservation, visual complexity, adherence to experimental\nobservations, and preservation of detailed vortical structures.\n","authors":["Yitong Deng","Hong-Xing Yu","Diyang Zhang","Jiajun Wu","Bo Zhu"],"pdf_url":"https://arxiv.org/pdf/2312.14635v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14630v1","updated":"2023-12-22T12:08:08Z","published":"2023-12-22T12:08:08Z","title":"A Language-based solution to enable Metaverse Retrieval","summary":" Recently, the Metaverse is becoming increasingly attractive, with millions of\nusers accessing the many available virtual worlds. However, how do users find\nthe one Metaverse which best fits their current interests? So far, the search\nprocess is mostly done by word of mouth, or by advertisement on\ntechnology-oriented websites. However, the lack of search engines similar to\nthose available for other multimedia formats (e.g., YouTube for videos) is\nshowing its limitations, since it is often cumbersome to find a Metaverse based\non some specific interests using the available methods, while also making it\ndifficult to discover user-created ones which lack strong advertisement. To\naddress this limitation, we propose to use language to naturally describe the\ndesired contents of the Metaverse a user wishes to find. Second, we highlight\nthat, differently from more conventional 3D scenes, Metaverse scenarios\nrepresent a more complex data format since they often contain one or more types\nof multimedia which influence the relevance of the scenario itself to a user\nquery. Therefore, in this work, we create a novel task, called\nText-to-Metaverse retrieval, which aims at modeling these aspects while also\ntaking the cross-modal relations with the textual data into account. Since we\nare the first ones to tackle this problem, we also collect a dataset of 33000\nMetaverses, each of which consists of a 3D scene enriched with multimedia\ncontent. Finally, we design and implement a deep learning framework based on\ncontrastive learning, resulting in a thorough experimental setup.\n","authors":["Ali Abdari","Alex Falcon","Giuseppe Serra"],"pdf_url":"https://arxiv.org/pdf/2312.14630v1.pdf","comment":"Accepted at 30th International Conference on Multimedia Modeling-\n MMM2024"},{"id":"http://arxiv.org/abs/2306.05832v2","updated":"2023-12-22T12:07:26Z","published":"2023-06-09T12:04:13Z","title":"Sketch Beautification: Learning Part Beautification and Structure\n Refinement for Sketches of Man-made Objects","summary":" We present a novel freehand sketch beautification method, which takes as\ninput a freely drawn sketch of a man-made object and automatically beautifies\nit both geometrically and structurally. Beautifying a sketch is challenging\nbecause of its highly abstract and heavily diverse drawing manner. Existing\nmethods are usually confined to the distribution of their limited training\nsamples and thus cannot beautify freely drawn sketches with rich variations. To\naddress this challenge, we adopt a divide-and-combine strategy. Specifically,\nwe first parse an input sketch into semantic components, beautify individual\ncomponents by a learned part beautification module based on part-level implicit\nmanifolds, and then reassemble the beautified components through a structure\nbeautification module. With this strategy, our method can go beyond the\ntraining samples and handle novel freehand sketches. We demonstrate the\neffectiveness of our system with extensive experiments and a perceptive study.\n","authors":["Deng Yu","Manfred Lau","Lin Gao","Hongbo Fu"],"pdf_url":"https://arxiv.org/pdf/2306.05832v2.pdf","comment":"Accepted by IEEE Transactions on Visualization and Computer Graphics"},{"id":"http://arxiv.org/abs/2309.02139v2","updated":"2023-12-22T11:56:53Z","published":"2023-09-05T11:29:30Z","title":"Self-Supervised Pre-Training Boosts Semantic Scene Segmentation on LiDAR\n Data","summary":" Airborne LiDAR systems have the capability to capture the Earth's surface by\ngenerating extensive point cloud data comprised of points mainly defined by 3D\ncoordinates. However, labeling such points for supervised learning tasks is\ntime-consuming. As a result, there is a need to investigate techniques that can\nlearn from unlabeled data to significantly reduce the number of annotated\nsamples. In this work, we propose to train a self-supervised encoder with\nBarlow Twins and use it as a pre-trained network in the task of semantic scene\nsegmentation. The experimental results demonstrate that our unsupervised\npre-training boosts performance once fine-tuned on the supervised task,\nespecially for under-represented categories.\n","authors":["Mariona Carós","Ariadna Just","Santi Seguí","Jordi Vitrià"],"pdf_url":"https://arxiv.org/pdf/2309.02139v2.pdf","comment":"International conference Machine Vision Applications 2023"},{"id":"http://arxiv.org/abs/2312.14626v1","updated":"2023-12-22T11:51:20Z","published":"2023-12-22T11:51:20Z","title":"DSAP: Analyzing Bias Through Demographic Comparison of Datasets","summary":" In the last few years, Artificial Intelligence systems have become\nincreasingly widespread. Unfortunately, these systems can share many biases\nwith human decision-making, including demographic biases. Often, these biases\ncan be traced back to the data used for training, where large uncurated\ndatasets have become the norm. Despite our knowledge of these biases, we still\nlack general tools to detect and quantify them, as well as to compare the\nbiases in different datasets. Thus, in this work, we propose DSAP (Demographic\nSimilarity from Auxiliary Profiles), a two-step methodology for comparing the\ndemographic composition of two datasets. DSAP can be deployed in three key\napplications: to detect and characterize demographic blind spots and bias\nissues across datasets, to measure dataset demographic bias in single datasets,\nand to measure dataset demographic shift in deployment scenarios. An essential\nfeature of DSAP is its ability to robustly analyze datasets without explicit\ndemographic labels, offering simplicity and interpretability for a wide range\nof situations. To show the usefulness of the proposed methodology, we consider\nthe Facial Expression Recognition task, where demographic bias has previously\nbeen found. The three applications are studied over a set of twenty datasets\nwith varying properties. The code is available at\nhttps://github.com/irisdominguez/DSAP.\n","authors":["Iris Dominguez-Catena","Daniel Paternain","Mikel Galar"],"pdf_url":"https://arxiv.org/pdf/2312.14626v1.pdf","comment":"18 pages, 11 figures"},{"id":"http://arxiv.org/abs/2305.05400v3","updated":"2023-12-22T11:30:28Z","published":"2023-05-09T12:45:43Z","title":"Investigating the Corruption Robustness of Image Classifiers with Random\n Lp-norm Corruptions","summary":" Robustness is a fundamental property of machine learning classifiers required\nto achieve safety and reliability. In the field of adversarial robustness of\nimage classifiers, robustness is commonly defined as the stability of a model\nto all input changes within a p-norm distance. However, in the field of random\ncorruption robustness, variations observed in the real world are used, while\np-norm corruptions are rarely considered. This study investigates the use of\nrandom p-norm corruptions to augment the training and test data of image\nclassifiers. We evaluate the model robustness against imperceptible random\np-norm corruptions and propose a novel robustness metric. We empirically\ninvestigate whether robustness transfers across different p-norms and derive\nconclusions on which p-norm corruptions a model should be trained and\nevaluated. We find that training data augmentation with a combination of p-norm\ncorruptions significantly improves corruption robustness, even on top of\nstate-of-the-art data augmentation schemes.\n","authors":["Georg Siedel","Weijia Shao","Silvia Vock","Andrey Morozov"],"pdf_url":"https://arxiv.org/pdf/2305.05400v3.pdf","comment":"Camera-ready version submitted to VISAPP 2024"},{"id":"http://arxiv.org/abs/2312.14619v1","updated":"2023-12-22T11:26:51Z","published":"2023-12-22T11:26:51Z","title":"Towards Loose-Fitting Garment Animation via Generative Model of\n Deformation Decomposition","summary":" Existing data-driven methods for garment animation, usually driven by linear\nskinning, although effective on tight garments, do not handle loose-fitting\ngarments with complex deformations well. To address these limitations, we\ndevelop a garment generative model based on deformation decomposition to\nefficiently simulate loose garment deformation without directly using linear\nskinning. Specifically, we learn a garment generative space with the proposed\ngenerative model, where we decouple the latent representation into unposed\ndeformed garments and dynamic offsets during the decoding stage. With explicit\ngarment deformations decomposition, our generative model is able to generate\ncomplex pose-driven deformations on canonical garment shapes. Furthermore, we\nlearn to transfer the body motions and previous state of the garment to the\nlatent space to regenerate dynamic results. In addition, we introduce a detail\nenhancement module in an adversarial training setup to learn high-frequency\nwrinkles. We demonstrate our method outperforms state-of-the-art data-driven\nalternatives through extensive experiments and show qualitative and\nquantitative analysis of results.\n","authors":["Yifu Liu","Xiaoxia Li","Zhiling Luo","Wei Zhou"],"pdf_url":"https://arxiv.org/pdf/2312.14619v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14611v1","updated":"2023-12-22T11:13:22Z","published":"2023-12-22T11:13:22Z","title":"Tuning-Free Inversion-Enhanced Control for Consistent Image Editing","summary":" Consistent editing of real images is a challenging task, as it requires\nperforming non-rigid edits (e.g., changing postures) to the main objects in the\ninput image without changing their identity or attributes. To guarantee\nconsistent attributes, some existing methods fine-tune the entire model or the\ntextual embedding for structural consistency, but they are time-consuming and\nfail to perform non-rigid edits. Other works are tuning-free, but their\nperformances are weakened by the quality of Denoising Diffusion Implicit Model\n(DDIM) reconstruction, which often fails in real-world scenarios. In this\npaper, we present a novel approach called Tuning-free Inversion-enhanced\nControl (TIC), which directly correlates features from the inversion process\nwith those from the sampling process to mitigate the inconsistency in DDIM\nreconstruction. Specifically, our method effectively obtains inversion features\nfrom the key and value features in the self-attention layers, and enhances the\nsampling process by these inversion features, thus achieving accurate\nreconstruction and content-consistent editing. To extend the applicability of\nour method to general editing scenarios, we also propose a mask-guided\nattention concatenation strategy that combines contents from both the inversion\nand the naive DDIM editing processes. Experiments show that the proposed method\noutperforms previous works in reconstruction and consistent editing, and\nproduces impressive results in various settings.\n","authors":["Xiaoyue Duan","Shuhao Cui","Guoliang Kang","Baochang Zhang","Zhengcong Fei","Mingyuan Fan","Junshi Huang"],"pdf_url":"https://arxiv.org/pdf/2312.14611v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14606v1","updated":"2023-12-22T11:03:12Z","published":"2023-12-22T11:03:12Z","title":"Explainable Multi-Camera 3D Object Detection with Transformer-Based\n Saliency Maps","summary":" Vision Transformers (ViTs) have achieved state-of-the-art results on various\ncomputer vision tasks, including 3D object detection. However, their end-to-end\nimplementation also makes ViTs less explainable, which can be a challenge for\ndeploying them in safety-critical applications, such as autonomous driving,\nwhere it is important for authorities, developers, and users to understand the\nmodel's reasoning behind its predictions. In this paper, we propose a novel\nmethod for generating saliency maps for a DetR-like ViT with multiple camera\ninputs used for 3D object detection. Our method is based on the raw attention\nand is more efficient than gradient-based methods. We evaluate the proposed\nmethod on the nuScenes dataset using extensive perturbation tests and show that\nit outperforms other explainability methods in terms of visual quality and\nquantitative metrics. We also demonstrate the importance of aggregating\nattention across different layers of the transformer. Our work contributes to\nthe development of explainable AI for ViTs, which can help increase trust in AI\napplications by establishing more transparency regarding the inner workings of\nAI models.\n","authors":["Till Beemelmanns","Wassim Zahr","Lutz Eckstein"],"pdf_url":"https://arxiv.org/pdf/2312.14606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.17602v2","updated":"2023-12-22T10:56:14Z","published":"2023-06-30T12:22:41Z","title":"S.T.A.R.-Track: Latent Motion Models for End-to-End 3D Object Tracking\n with Adaptive Spatio-Temporal Appearance Representations","summary":" Following the tracking-by-attention paradigm, this paper introduces an\nobject-centric, transformer-based framework for tracking in 3D. Traditional\nmodel-based tracking approaches incorporate the geometric effect of object- and\nego motion between frames with a geometric motion model. Inspired by this, we\npropose S.T.A.R.-Track, which uses a novel latent motion model (LMM) to\nadditionally adjust object queries to account for changes in viewing direction\nand lighting conditions directly in the latent space, while still modeling the\ngeometric motion explicitly. Combined with a novel learnable track embedding\nthat aids in modeling the existence probability of tracks, this results in a\ngeneric tracking framework that can be integrated with any query-based\ndetector. Extensive experiments on the nuScenes benchmark demonstrate the\nbenefits of our approach, showing \\ac{sota} performance for DETR3D-based\ntrackers while drastically reducing the number of identity switches of tracks\nat the same time.\n","authors":["Simon Doll","Niklas Hanselmann","Lukas Schneider","Richard Schulz","Markus Enzweiler","Hendrik P. A. Lensch"],"pdf_url":"https://arxiv.org/pdf/2306.17602v2.pdf","comment":"\\c{opyright} 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2312.14579v1","updated":"2023-12-22T10:15:15Z","published":"2023-12-22T10:15:15Z","title":"Environment-Specific People","summary":" Despite significant progress in generative image synthesis and full-body\ngeneration in particular, state-of-the-art methods are either\ncontext-independent, overly reliant to text prompts, or bound to the curated\ntraining datasets, such as fashion images with monotonous backgrounds. Here,\nour goal is to generate people in clothing that is semantically appropriate for\na given scene. To this end, we present ESP, a novel method for context-aware\nfull-body generation, that enables photo-realistic inpainting of people into\nexisting \"in-the-wild\" photographs. ESP is conditioned on a 2D pose and\ncontextual cues that are extracted from the environment photograph and\nintegrated into the generation process. Our models are trained on a dataset\ncontaining a set of in-the-wild photographs of people covering a wide range of\ndifferent environments. The method is analyzed quantitatively and\nqualitatively, and we show that ESP outperforms state-of-the-art on the task of\ncontextual full-body generation.\n","authors":["Mirela Ostrek","Soubhik Sanyal","Carol O'Sullivan","Michael J. Black","Justus Thies"],"pdf_url":"https://arxiv.org/pdf/2312.14579v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14577v1","updated":"2023-12-22T10:13:10Z","published":"2023-12-22T10:13:10Z","title":"PoseViNet: Distracted Driver Action Recognition Framework Using\n Multi-View Pose Estimation and Vision Transformer","summary":" Driver distraction is a principal cause of traffic accidents. In a study\nconducted by the National Highway Traffic Safety Administration, engaging in\nactivities such as interacting with in-car menus, consuming food or beverages,\nor engaging in telephonic conversations while operating a vehicle can be\nsignificant sources of driver distraction. From this viewpoint, this paper\nintroduces a novel method for detection of driver distraction using multi-view\ndriver action images. The proposed method is a vision transformer-based\nframework with pose estimation and action inference, namely PoseViNet. The\nmotivation for adding posture information is to enable the transformer to focus\nmore on key features. As a result, the framework is more adept at identifying\ncritical actions. The proposed framework is compared with various\nstate-of-the-art models using SFD3 dataset representing 10 behaviors of\ndrivers. It is found from the comparison that the PoseViNet outperforms these\nmodels. The proposed framework is also evaluated with the SynDD1 dataset\nrepresenting 16 behaviors of driver. As a result, the PoseViNet achieves 97.55%\nvalidation accuracy and 90.92% testing accuracy with the challenging dataset.\n","authors":["Neha Sengar","Indra Kumari","Jihui Lee","Dongsoo Har"],"pdf_url":"https://arxiv.org/pdf/2312.14577v1.pdf","comment":"This is revised draft submitted to IEEE Sensors Journal"},{"id":"http://arxiv.org/abs/2312.14574v1","updated":"2023-12-22T10:10:50Z","published":"2023-12-22T10:10:50Z","title":"MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning","summary":" Prompt learning has demonstrated impressive efficacy in the fine-tuning of\nmultimodal large models to a wide range of downstream tasks. Nonetheless,\napplying existing prompt learning methods for the diagnosis of neurological\ndisorder still suffers from two issues: (i) existing methods typically treat\nall patches equally, despite the fact that only a small number of patches in\nneuroimaging are relevant to the disease, and (ii) they ignore the structural\ninformation inherent in the brain connection network which is crucial for\nunderstanding and diagnosing neurological disorders. To tackle these issues, we\nintroduce a novel prompt learning model by learning graph prompts during the\nfine-tuning process of multimodal large models for diagnosing neurological\ndisorders. Specifically, we first leverage GPT-4 to obtain relevant disease\nconcepts and compute semantic similarity between these concepts and all\npatches. Secondly, we reduce the weight of irrelevant patches according to the\nsemantic similarity between each patch and disease-related concepts. Moreover,\nwe construct a graph among tokens based on these concepts and employ a graph\nconvolutional network layer to extract the structural information of the graph,\nwhich is used to prompt the pre-trained multimodal large models for diagnosing\nneurological disorders. Extensive experiments demonstrate that our method\nachieves superior performance for neurological disorder diagnosis compared with\nstate-of-the-art methods and validated by clinicians.\n","authors":["Liang Peng","Songyue Cai","Zongqian Wu","Huifang Shang","Xiaofeng Zhu","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2312.14574v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.15216v4","updated":"2023-12-22T10:06:48Z","published":"2023-08-29T11:12:53Z","title":"On-the-Fly Guidance Training for Medical Image Registration","summary":" This research explores a novel approach in the realm of learning-based image\nregistration, addressing the limitations inherent in weakly-supervised and\nunsupervised methods. Weakly-supervised techniques depend heavily on scarce\nlabeled data, while unsupervised strategies rely on indirect measures of\naccuracy through image similarity. Notably, traditional supervised learning is\nnot utilized due to the lack of precise deformation ground-truth in medical\nimaging. Our study introduces a unique training framework with On-the-Fly\nGuidance (OFG) to enhance existing models. This framework, during training,\ngenerates pseudo-ground truth a few steps ahead by refining the current\ndeformation prediction with our custom optimizer. This pseudo-ground truth then\nserves to directly supervise the model in a supervised learning context. The\nprocess involves optimizing the predicted deformation with a limited number of\nsteps, ensuring training efficiency and setting achievable goals for each\ntraining phase. OFG notably boosts the precision of existing image registration\ntechniques while maintaining the speed of learning-based methods. We assessed\nour approach using various pseudo-ground truth generation strategies, including\npredictions and optimized outputs from established registration models. Our\nexperiments spanned three benchmark datasets and three cutting-edge models,\nwith OFG demonstrating significant and consistent enhancements, surpassing\nprevious state-of-the-arts in the field. OFG offers an easily integrable\nplug-and-play solution to enhance the training effectiveness of learning-based\nimage registration models. Code at\nhttps://github.com/miraclefactory/on-the-fly-guidance.\n","authors":["Yicheng Chen","Shengxiang Ji","Yuelin Xin","Kun Han","Xiaohui Xie"],"pdf_url":"https://arxiv.org/pdf/2308.15216v4.pdf","comment":"12 pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2311.06000v3","updated":"2023-12-22T10:04:48Z","published":"2023-11-10T11:23:28Z","title":"Keystroke Verification Challenge (KVC): Biometric and Fairness Benchmark\n Evaluation","summary":" Analyzing keystroke dynamics (KD) for biometric verification has several\nadvantages: it is among the most discriminative behavioral traits; keyboards\nare among the most common human-computer interfaces, being the primary means\nfor users to enter textual data; its acquisition does not require additional\nhardware, and its processing is relatively lightweight; and it allows for\ntransparently recognizing subjects. However, the heterogeneity of experimental\nprotocols and metrics, and the limited size of the databases adopted in the\nliterature impede direct comparisons between different systems, thus\nrepresenting an obstacle in the advancement of keystroke biometrics. To\nalleviate this aspect, we present a new experimental framework to benchmark\nKD-based biometric verification performance and fairness based on tweet-long\nsequences of variable transcript text from over 185,000 subjects, acquired\nthrough desktop and mobile keyboards, extracted from the Aalto Keystroke\nDatabases. The framework runs on CodaLab in the form of the Keystroke\nVerification Challenge (KVC). Moreover, we also introduce a novel fairness\nmetric, the Skewed Impostor Ratio (SIR), to capture inter- and\nintra-demographic group bias patterns in the verification scores. We\ndemonstrate the usefulness of the proposed framework by employing two\nstate-of-the-art keystroke verification systems, TypeNet and TypeFormer, to\ncompare different sets of input features, achieving a less privacy-invasive\nsystem, by discarding the analysis of text content (ASCII codes of the keys\npressed) in favor of extended features in the time domain. Our experiments show\nthat this approach allows to maintain satisfactory performance.\n","authors":["Giuseppe Stragapede","Ruben Vera-Rodriguez","Ruben Tolosana","Aythami Morales","Naser Damer","Julian Fierrez","Javier Ortega-Garcia"],"pdf_url":"https://arxiv.org/pdf/2311.06000v3.pdf","comment":"13 pages, 4 figure, 5 pages"},{"id":"http://arxiv.org/abs/2312.14570v1","updated":"2023-12-22T10:00:32Z","published":"2023-12-22T10:00:32Z","title":"BSS-Bench: Towards Reproducible and Effective Band Selection Search","summary":" The key technology to overcome the drawbacks of hyperspectral imaging\n(expensive, high capture delay, and low spatial resolution) and make it widely\napplicable is to select only a few representative bands from hundreds of bands.\nHowever, current band selection (BS) methods face challenges in fair\ncomparisons due to inconsistent train/validation settings, including the number\nof bands, dataset splits, and retraining settings. To make BS methods easy and\nreproducible, this paper presents the first band selection search benchmark\n(BSS-Bench) containing 52k training and evaluation records of numerous band\ncombinations (BC) with different backbones for various hyperspectral analysis\ntasks. The creation of BSS-Bench required a significant computational effort of\n1.26k GPU days. By querying BSS-Bench, BS experiments can be performed easily\nand reproducibly, and the gap between the searched result and the best\nachievable performance can be measured. Based on BSS-Bench, we further discuss\nthe impact of various factors on BS, such as the number of bands, unsupervised\nstatistics, and different backbones. In addition to BSS-Bench, we present an\neffective one-shot BS method called Single Combination One Shot (SCOS), which\nlearns the priority of any BCs through one-time training, eliminating the need\nfor repetitive retraining on different BCs. Furthermore, the search process of\nSCOS is flexible and does not require training, making it efficient and\neffective. Our extensive evaluations demonstrate that SCOS outperforms current\nBS methods on multiple tasks, even with much fewer bands. Our BSS-Bench and\ncodes are available in the supplementary material and will be publicly\navailable.\n","authors":["Wenshuai Xu","Zhenbo Xu"],"pdf_url":"https://arxiv.org/pdf/2312.14570v1.pdf","comment":"11 pages,6 figures"},{"id":"http://arxiv.org/abs/2311.09759v2","updated":"2023-12-22T09:30:39Z","published":"2023-11-16T10:32:18Z","title":"Scene Text Image Super-resolution based on Text-conditional Diffusion\n Models","summary":" Scene Text Image Super-resolution (STISR) has recently achieved great success\nas a preprocessing method for scene text recognition. STISR aims to transform\nblurred and noisy low-resolution (LR) text images in real-world settings into\nclear high-resolution (HR) text images suitable for scene text recognition. In\nthis study, we leverage text-conditional diffusion models (DMs), known for\ntheir impressive text-to-image synthesis capabilities, for STISR tasks. Our\nexperimental results revealed that text-conditional DMs notably surpass\nexisting STISR methods. Especially when texts from LR text images are given as\ninput, the text-conditional DMs are able to produce superior quality\nsuper-resolution text images. Utilizing this capability, we propose a novel\nframework for synthesizing LR-HR paired text image datasets. This framework\nconsists of three specialized text-conditional DMs, each dedicated to text\nimage synthesis, super-resolution, and image degradation. These three modules\nare vital for synthesizing distinct LR and HR paired images, which are more\nsuitable for training STISR methods. Our experiments confirmed that these\nsynthesized image pairs significantly enhance the performance of STISR methods\nin the TextZoom evaluation.\n","authors":["Chihiro Noguchi","Shun Fukuda","Masao Yamanaka"],"pdf_url":"https://arxiv.org/pdf/2311.09759v2.pdf","comment":"WACV 2024"},{"id":"http://arxiv.org/abs/2312.14556v1","updated":"2023-12-22T09:29:45Z","published":"2023-12-22T09:29:45Z","title":"CaptainCook4D: A dataset for understanding errors in procedural\n activities","summary":" Following step-by-step procedures is an essential component of various\nactivities carried out by individuals in their daily lives. These procedures\nserve as a guiding framework that helps to achieve goals efficiently, whether\nit is assembling furniture or preparing a recipe. However, the complexity and\nduration of procedural activities inherently increase the likelihood of making\nerrors. Understanding such procedural activities from a sequence of frames is a\nchallenging task that demands an accurate interpretation of visual information\nand the ability to reason about the structure of the activity. To this end, we\ncollect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings\n(94.5 hours) of people performing recipes in real kitchen environments. This\ndataset consists of two distinct types of activity: one in which participants\nadhere to the provided recipe instructions and another in which they deviate\nand induce errors. We provide 5.3K step annotations and 10K fine-grained action\nannotations and benchmark the dataset for the following tasks: supervised error\nrecognition, multistep localization, and procedure learning\n","authors":["Rohith Peddi","Shivvrat Arya","Bharath Challa","Likhitha Pallapothula","Akshay Vyas","Jikai Wang","Qifan Zhang","Vasundhara Komaragiri","Eric Ragan","Nicholas Ruozzi","Yu Xiang","Vibhav Gogate"],"pdf_url":"https://arxiv.org/pdf/2312.14556v1.pdf","comment":"Accepted to the 2023 International Conference on Machine\n Learning(ICML) workshop on Data-centric Machine Learning Research(DMLR),\n Project Page: https://captaincook4d.github.io/captain-cook/"},{"id":"http://arxiv.org/abs/2312.07199v2","updated":"2023-12-22T09:27:38Z","published":"2023-12-12T12:07:34Z","title":"SeasFire as a Multivariate Earth System Datacube for Wildfire Dynamics","summary":" The global occurrence, scale, and frequency of wildfires pose significant\nthreats to ecosystem services and human livelihoods. To effectively quantify\nand attribute the antecedent conditions for wildfires, a thorough understanding\nof Earth system dynamics is imperative. In response, we introduce the SeasFire\ndatacube, a meticulously curated spatiotemporal dataset tailored for global\nsub-seasonal to seasonal wildfire modeling via Earth observation. The SeasFire\ndatacube comprises of 59 variables encompassing climate, vegetation, oceanic\nindices, and human factors, has an 8-day temporal resolution and a spatial\nresolution of 0.25$^{\\circ}$, and spans from 2001 to 2021. We showcase the\nversatility of SeasFire for exploring the variability and seasonality of\nwildfire drivers, modeling causal links between ocean-climate teleconnections\nand wildfires, and predicting sub-seasonal wildfire patterns across multiple\ntimescales with a Deep Learning model. We publicly release the SeasFire\ndatacube and appeal to Earth system scientists and Machine Learning\npractitioners to use it for an improved understanding and anticipation of\nwildfires.\n","authors":["Ilektra Karasante","Lazaro Alonso","Ioannis Prapas","Akanksha Ahuja","Nuno Carvalhais","Ioannis Papoutsis"],"pdf_url":"https://arxiv.org/pdf/2312.07199v2.pdf","comment":"20 pages, 9 figures, and 5 tables. Typos corrected"},{"id":"http://arxiv.org/abs/2312.13729v2","updated":"2023-12-22T09:19:03Z","published":"2023-12-21T10:52:59Z","title":"Gaussian Splatting with NeRF-based Color and Opacity","summary":" Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of\nneural networks to capture the intricacies of 3D objects. By encoding the shape\nand color information within neural network weights, NeRFs excel at producing\nstrikingly sharp novel views of 3D objects. Recently, numerous generalizations\nof NeRFs utilizing generative models have emerged, expanding its versatility.\nIn contrast, Gaussian Splatting (GS) offers a similar renders quality with\nfaster training and inference as it does not need neural networks to work. We\nencode information about the 3D objects in the set of Gaussian distributions\nthat can be rendered in 3D similarly to classical meshes. Unfortunately, GS are\ndifficult to condition since they usually require circa hundred thousand\nGaussian components. To mitigate the caveats of both models, we propose a\nhybrid model that uses GS representation of the 3D object's shape and\nNeRF-based encoding of color and opacity. Our model uses Gaussian distributions\nwith trainable positions (i.e. means of Gaussian), shape (i.e. covariance of\nGaussian), color and opacity, and neural network, which takes parameters of\nGaussian and viewing direction to produce changes in color and opacity.\nConsequently, our model better describes shadows, light reflections, and\ntransparency of 3D objects.\n","authors":["Dawid Malarz","Weronika Smolak","Jacek Tabor","Sławomir Tadeja","Przemysław Spurek"],"pdf_url":"https://arxiv.org/pdf/2312.13729v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14544v1","updated":"2023-12-22T09:15:33Z","published":"2023-12-22T09:15:33Z","title":"Inclusive normalization of face images to passport format","summary":" Face recognition has been used more and more in real world applications in\nrecent years. However, when the skin color bias is coupled with intra-personal\nvariations like harsh illumination, the face recognition task is more likely to\nfail, even during human inspection. Face normalization methods try to deal with\nsuch challenges by removing intra-personal variations from an input image while\nkeeping the identity the same. However, most face normalization methods can\nonly remove one or two variations and ignore dataset biases such as skin color\nbias. The outputs of many face normalization methods are also not realistic to\nhuman observers. In this work, a style based face normalization model\n(StyleFNM) is proposed to remove most intra-personal variations including large\nchanges in pose, bad or harsh illumination, low resolution, blur, facial\nexpressions, and accessories like sunglasses among others. The dataset bias is\nalso dealt with in this paper by controlling a pretrained GAN to generate a\nbalanced dataset of passport-like images. The experimental results show that\nStyleFNM can generate more realistic outputs and can improve significantly the\naccuracy and fairness of face recognition systems.\n","authors":["Hongliu Cao","Minh Nhat Do","Alexis Ravanel","Eoin Thomas"],"pdf_url":"https://arxiv.org/pdf/2312.14544v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.05915v3","updated":"2023-12-22T09:00:03Z","published":"2023-03-09T13:52:28Z","title":"Convolutional Cross-View Pose Estimation","summary":" We propose a novel end-to-end method for cross-view pose estimation. Given a\nground-level query image and an aerial image that covers the query's local\nneighborhood, the 3 Degrees-of-Freedom camera pose of the query is estimated by\nmatching its image descriptor to descriptors of local regions within the aerial\nimage. The orientation-aware descriptors are obtained by using a\ntranslationally equivariant convolutional ground image encoder and contrastive\nlearning. The Localization Decoder produces a dense probability distribution in\na coarse-to-fine manner with a novel Localization Matching Upsampling module. A\nsmaller Orientation Decoder produces a vector field to condition the\norientation estimate on the localization. Our method is validated on the VIGOR\nand KITTI datasets, where it surpasses the state-of-the-art baseline by 72% and\n36% in median localization error for comparable orientation estimation\naccuracy. The predicted probability distribution can represent localization\nambiguity, and enables rejecting possible erroneous predictions. Without\nre-training, the model can infer on ground images with different field of views\nand utilize orientation priors if available. On the Oxford RobotCar dataset,\nour method can reliably estimate the ego-vehicle's pose over time, achieving a\nmedian localization error under 1 meter and a median orientation error of\naround 1 degree at 14 FPS.\n","authors":["Zimin Xia","Olaf Booij","Julian F. P. Kooij"],"pdf_url":"https://arxiv.org/pdf/2303.05915v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14518v1","updated":"2023-12-22T08:31:11Z","published":"2023-12-22T08:31:11Z","title":"Joint Learning Neuronal Skeleton and Brain Circuit Topology with\n Permutation Invariant Encoders for Neuron Classification","summary":" Determining the types of neurons within a nervous system plays a significant\nrole in the analysis of brain connectomics and the investigation of\nneurological diseases. However, the efficiency of utilizing anatomical,\nphysiological, or molecular characteristics of neurons is relatively low and\ncostly. With the advancements in electron microscopy imaging and analysis\ntechniques for brain tissue, we are able to obtain whole-brain connectome\nconsisting neuronal high-resolution morphology and connectivity information.\nHowever, few models are built based on such data for automated neuron\nclassification. In this paper, we propose NeuNet, a framework that combines\nmorphological information of neurons obtained from skeleton and topological\ninformation between neurons obtained from neural circuit. Specifically, NeuNet\nconsists of three components, namely Skeleton Encoder, Connectome Encoder, and\nReadout Layer. Skeleton Encoder integrates the local information of neurons in\na bottom-up manner, with a one-dimensional convolution in neural skeleton's\npoint data; Connectome Encoder uses a graph neural network to capture the\ntopological information of neural circuit; finally, Readout Layer fuses the\nabove two information and outputs classification results. We reprocess and\nrelease two new datasets for neuron classification task from volume electron\nmicroscopy(VEM) images of human brain cortex and Drosophila brain. Experiments\non these two datasets demonstrated the effectiveness of our model with accuracy\nof 0.9169 and 0.9363, respectively. Code and data are available at:\nhttps://github.com/WHUminghui/NeuNet.\n","authors":["Minghui Liao","Guojia Wan","Bo Du"],"pdf_url":"https://arxiv.org/pdf/2312.14518v1.pdf","comment":"18 pages,8 figures,"},{"id":"http://arxiv.org/abs/2308.08806v3","updated":"2023-12-22T08:14:14Z","published":"2023-08-17T06:32:57Z","title":"Self-distillation Regularized Connectionist Temporal Classification Loss\n for Text Recognition: A Simple Yet Effective Approach","summary":" Text recognition methods are gaining rapid development. Some advanced\ntechniques, e.g., powerful modules, language models, and un- and\nsemi-supervised learning schemes, consecutively push the performance on public\nbenchmarks forward. However, the problem of how to better optimize a text\nrecognition model from the perspective of loss functions is largely overlooked.\nCTC-based methods, widely used in practice due to their good balance between\nperformance and inference speed, still grapple with accuracy degradation. This\nis because CTC loss emphasizes the optimization of the entire sequence target\nwhile neglecting to learn individual characters. We propose a self-distillation\nscheme for CTC-based model to address this issue. It incorporates a framewise\nregularization term in CTC loss to emphasize individual supervision, and\nleverages the maximizing-a-posteriori of latent alignment to solve the\ninconsistency problem that arises in distillation between CTC-based models. We\nrefer to the regularized CTC loss as Distillation Connectionist Temporal\nClassification (DCTC) loss. DCTC loss is module-free, requiring no extra\nparameters, longer inference lag, or additional training data or phases.\nExtensive experiments on public benchmarks demonstrate that DCTC can boost text\nrecognition model accuracy by up to 2.6%, without any of these drawbacks.\n","authors":["Ziyin Zhang","Ning Lu","Minghui Liao","Yongshuai Huang","Cheng Li","Min Wang","Wei Peng"],"pdf_url":"https://arxiv.org/pdf/2308.08806v3.pdf","comment":"Ziyin Zhang and Ning Lu are co-first authors. Accepted by AAAI2024.\n Repo: https://github.com/zzyhlyoko/DCTC"},{"id":"http://arxiv.org/abs/2312.14502v1","updated":"2023-12-22T08:05:38Z","published":"2023-12-22T08:05:38Z","title":"ViStripformer: A Token-Efficient Transformer for Versatile Video\n Restoration","summary":" Video restoration is a low-level vision task that seeks to restore clean,\nsharp videos from quality-degraded frames. One would use the temporal\ninformation from adjacent frames to make video restoration successful.\nRecently, the success of the Transformer has raised awareness in the\ncomputer-vision community. However, its self-attention mechanism requires much\nmemory, which is unsuitable for high-resolution vision tasks like video\nrestoration. In this paper, we propose ViStripformer (Video Stripformer), which\nutilizes spatio-temporal strip attention to catch long-range data correlations,\nconsisting of intra-frame strip attention (Intra-SA) and inter-frame strip\nattention (Inter-SA) for extracting spatial and temporal information. It\ndecomposes video frames into strip-shaped features in horizontal and vertical\ndirections for Intra-SA and Inter-SA to address degradation patterns with\nvarious orientations and magnitudes. Besides, ViStripformer is an effective and\nefficient transformer architecture with much lower memory usage than the\nvanilla transformer. Extensive experiments show that the proposed model\nachieves superior results with fast inference time on video restoration tasks,\nincluding video deblurring, demoireing, and deraining.\n","authors":["Fu-Jen Tsai","Yan-Tsung Peng","Chen-Yu Chang","Chan-Yu Li","Yen-Yu Lin","Chung-Chi Tsai","Chia-Wen Lin"],"pdf_url":"https://arxiv.org/pdf/2312.14502v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.08655v2","updated":"2023-12-22T07:46:49Z","published":"2023-11-15T02:28:52Z","title":"Review of AlexNet for Medical Image Classification","summary":" In recent years, the rapid development of deep learning has led to a wide\nrange of applications in the field of medical image classification. The\nvariants of neural network models with ever-increasing performance share some\ncommonalities: to try to mitigate overfitting, improve generalization, avoid\ngradient vanishing and exploding, etc. AlexNet first utilizes the dropout\ntechnique to mitigate overfitting and the ReLU activation function to avoid\ngradient vanishing. Therefore, we focus our discussion on AlexNet, which has\ncontributed greatly to the development of CNNs in 2012. After reviewing over 40\npapers, including journal papers and conference papers, we give a narrative on\nthe technical details, advantages, and application areas of AlexNet.\n","authors":["Wenhao Tang","Junding Sun","Shuihua Wang","Yudong Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.08655v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14494v1","updated":"2023-12-22T07:42:00Z","published":"2023-12-22T07:42:00Z","title":"Revisiting Few-Shot Object Detection with Vision-Language Models","summary":" Few-shot object detection (FSOD) benchmarks have advanced techniques for\ndetecting new categories with limited annotations. Existing benchmarks\nrepurpose well-established datasets like COCO by partitioning categories into\nbase and novel classes for pre-training and fine-tuning respectively. However,\nthese benchmarks do not reflect how FSOD is deployed in practice. Rather than\nonly pre-training on a small number of base categories, we argue that it is\nmore practical to fine-tune a foundation model (e.g., a vision-language model\n(VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find\nthat zero-shot inference from VLMs like GroundingDINO significantly outperforms\nthe state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models\ncan still be misaligned to target concepts of interest. For example, trailers\non the web may be different from trailers in the context of autonomous\nvehicles. In this work, we propose Foundational FSOD, a new benchmark protocol\nthat evaluates detectors pre-trained on any external datasets and fine-tuned on\nK-shots per target class. Further, we note that current FSOD benchmarks are\nactually federated datasets containing exhaustive annotations for each category\non a subset of the data. We leverage this insight to propose simple strategies\nfor fine-tuning VLMs with federated losses. We demonstrate the effectiveness of\nour approach on LVIS and nuImages, improving over prior work by 5.9 AP.\n","authors":["Anish Madan","Neehar Peri","Shu Kong","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2312.14494v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14492v1","updated":"2023-12-22T07:40:43Z","published":"2023-12-22T07:40:43Z","title":"Context Enhanced Transformer for Single Image Object Detection","summary":" With the increasing importance of video data in real-world applications,\nthere is a rising need for efficient object detection methods that utilize\ntemporal information. While existing video object detection (VOD) techniques\nemploy various strategies to address this challenge, they typically depend on\nlocally adjacent frames or randomly sampled images within a clip. Although\nrecent Transformer-based VOD methods have shown promising results, their\nreliance on multiple inputs and additional network complexity to incorporate\ntemporal information limits their practical applicability. In this paper, we\npropose a novel approach to single image object detection, called Context\nEnhanced TRansformer (CETR), by incorporating temporal context into DETR using\na newly designed memory module. To efficiently store temporal information, we\nconstruct a class-wise memory that collects contextual information across data.\nAdditionally, we present a classification-based sampling technique to\nselectively utilize the relevant memory for the current image. In the testing,\nWe introduce a test-time memory adaptation method that updates individual\nmemory functions by considering the test distribution. Experiments with CityCam\nand ImageNet VID datasets exhibit the efficiency of the framework on various\nvideo systems. The project page and code will be made available at:\nhttps://ku-cvlab.github.io/CETR.\n","authors":["Seungjun An","Seonghoon Park","Gyeongnyeon Kim","Jeongyeol Baek","Byeongwon Lee","Seungryong Kim"],"pdf_url":"https://arxiv.org/pdf/2312.14492v1.pdf","comment":"The project page and code will be made available at:\n https://ku-cvlab.github.io/CETR"},{"id":"http://arxiv.org/abs/2312.14481v1","updated":"2023-12-22T07:17:51Z","published":"2023-12-22T07:17:51Z","title":"Part to Whole: Collaborative Prompting for Surgical Instrument\n Segmentation","summary":" Foundation models like the Segment Anything Model (SAM) have demonstrated\npromise in generic object segmentation. However, directly applying SAM to\nsurgical instrument segmentation presents key challenges. First, SAM relies on\nper-frame point-or-box prompts which complicate surgeon-computer interaction.\nAlso, SAM yields suboptimal performance on segmenting surgical instruments,\nowing to insufficient surgical data in its pre-training as well as the complex\nstructure and fine-grained details of various surgical instruments. To address\nthese challenges, in this paper, we investigate text promptable surgical\ninstrument segmentation and propose SP-SAM (SurgicalPart-SAM), a novel\nefficient-tuning approach that integrates surgical instrument structure\nknowledge with the generic segmentation knowledge of SAM. Specifically, we\nachieve this by proposing (1) collaborative prompts in the text form \"[part\nname] of [instrument category name]\" that decompose instruments into\nfine-grained parts; (2) a Cross-Modal Prompt Encoder that encodes text prompts\njointly with visual embeddings into discriminative part-level representations;\nand (3) a Part-to-Whole Selective Fusion and a Hierarchical Decoding strategy\nthat selectively assemble the part-level representations into a whole for\naccurate instrument segmentation. Built upon them, SP-SAM acquires a better\ncapability to comprehend surgical instrument structures and distinguish between\nvarious categories. Extensive experiments on both the EndoVis2018 and\nEndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with\nminimal tunable parameters. Code is at\nhttps://github.com/wenxi-yue/SurgicalPart-SAM.\n","authors":["Wenxi Yue","Jing Zhang","Kun Hu","Qiuxia Wu","Zongyuan Ge","Yong Xia","Jiebo Luo","Zhiyong Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14481v1.pdf","comment":"Technical Report. The source code will be released at\n https://github.com/wenxi-yue/SurgicalPart-SAM"},{"id":"http://arxiv.org/abs/2312.13646v2","updated":"2023-12-22T07:12:44Z","published":"2023-12-21T08:16:26Z","title":"Weakly Supervised Semantic Segmentation for Driving Scenes","summary":" State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS)\nusing image-level labels exhibit severe performance degradation on driving\nscene datasets such as Cityscapes. To address this challenge, we develop a new\nWSSS framework tailored to driving scene datasets. Based on extensive analysis\nof dataset characteristics, we employ Contrastive Language-Image Pre-training\n(CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key\nchallenges: (1) pseudo-masks from CLIP lack in representing small object\nclasses, and (2) these masks contain notable noise. We propose solutions for\neach issue as follows. (1) We devise Global-Local View Training that seamlessly\nincorporates small-scale patches during model training, thereby enhancing the\nmodel's capability to handle small-sized yet critical objects in driving scenes\n(e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing\n(CARB), a novel technique that discerns reliable and noisy regions through\nevaluating the consistency between CLIP masks and segmentation predictions. It\nprioritizes reliable pixels over noisy pixels via adaptive loss weighting.\nNotably, the proposed method achieves 51.8\\% mIoU on the Cityscapes test\ndataset, showcasing its potential as a strong WSSS baseline on driving scene\ndatasets. Experimental results on CamVid and WildDash2 demonstrate the\neffectiveness of our method across diverse datasets, even with small-scale\ndatasets or visually challenging conditions. The code is available at\nhttps://github.com/k0u-id/CARB.\n","authors":["Dongseob Kim","Seungho Lee","Junsuk Choe","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2312.13646v2.pdf","comment":"AAAI 2024 accepted. First two authors contributed equally"},{"id":"http://arxiv.org/abs/2312.14474v1","updated":"2023-12-22T06:53:49Z","published":"2023-12-22T06:53:49Z","title":"MonoLSS: Learnable Sample Selection For Monocular 3D Detection","summary":" In the field of autonomous driving, monocular 3D detection is a critical task\nwhich estimates 3D properties (depth, dimension, and orientation) of objects in\na single RGB image. Previous works have used features in a heuristic way to\nlearn 3D properties, without considering that inappropriate features could have\nadverse effects. In this paper, sample selection is introduced that only\nsuitable samples should be trained to regress the 3D properties. To select\nsamples adaptively, we propose a Learnable Sample Selection (LSS) module, which\nis based on Gumbel-Softmax and a relative-distance sample divider. The LSS\nmodule works under a warm-up strategy leading to an improvement in training\nstability. Additionally, since the LSS module dedicated to 3D property sample\nselection relies on object-level features, we further develop a data\naugmentation method named MixUp3D to enrich 3D property samples which conforms\nto imaging principles without introducing ambiguity. As two orthogonal methods,\nthe LSS module and MixUp3D can be utilized independently or in conjunction.\nSufficient experiments have shown that their combined use can lead to\nsynergistic effects, yielding improvements that transcend the mere sum of their\nindividual applications. Leveraging the LSS module and the MixUp3D, without any\nextra data, our method named MonoLSS ranks 1st in all three categories (Car,\nCyclist, and Pedestrian) on KITTI 3D object detection benchmark, and achieves\ncompetitive results on both the Waymo dataset and KITTI-nuScenes cross-dataset\nevaluation. The code is included in the supplementary material and will be\nreleased to facilitate related academic and industrial studies.\n","authors":["Zhenjia Li","Jinrang Jia","Yifeng Shi"],"pdf_url":"https://arxiv.org/pdf/2312.14474v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14471v1","updated":"2023-12-22T06:49:44Z","published":"2023-12-22T06:49:44Z","title":"Prototype-based Cross-Modal Object Tracking","summary":" Cross-modal object tracking is an important research topic in the field of\ninformation fusion, and it aims to address imaging limitations in challenging\nscenarios by integrating switchable visible and near-infrared modalities.\nHowever, existing tracking methods face some difficulties in adapting to\nsignificant target appearance variations in the presence of modality switch.\nFor instance, model update based tracking methods struggle to maintain stable\ntracking results during modality switching, leading to error accumulation and\nmodel drift. Template based tracking methods solely rely on the template\ninformation from first frame and/or last frame, which lacks sufficient\nrepresentation ability and poses challenges in handling significant target\nappearance changes. To address this problem, we propose a prototype-based\ncross-modal object tracker called ProtoTrack, which introduces a novel\nprototype learning scheme to adapt to significant target appearance variations,\nfor cross-modal object tracking. In particular, we design a multi-modal\nprototype to represent target information by multi-kind samples, including a\nfixed sample from the first frame and two representative samples from different\nmodalities. Moreover, we develop a prototype generation algorithm based on two\nnew modules to ensure the prototype representative in different\nchallenges......\n","authors":["Lei Liu","Chenglong Li","Futian Wang","Longfeng Shen","Jin Tang"],"pdf_url":"https://arxiv.org/pdf/2312.14471v1.pdf","comment":"In Peer Review"},{"id":"http://arxiv.org/abs/2306.06209v2","updated":"2023-12-22T06:43:18Z","published":"2023-05-11T10:05:57Z","title":"Backdoor Attack with Sparse and Invisible Trigger","summary":" Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the\nadversary manipulates a small portion of training data such that the victim\nmodel predicts normally on the benign samples but classifies the triggered\nsamples as the target class. The backdoor attack is an emerging yet threatening\ntraining-phase threat, leading to serious risks in DNN-based applications. In\nthis paper, we revisit the trigger patterns of existing backdoor attacks. We\nreveal that they are either visible or not sparse and therefore are not\nstealthy enough. More importantly, it is not feasible to simply combine\nexisting methods to design an effective sparse and invisible backdoor attack.\nTo address this problem, we formulate the trigger generation as a bi-level\noptimization problem with sparsity and invisibility constraints and propose an\neffective method to solve it. The proposed method is dubbed sparse and\ninvisible backdoor attack (SIBA). We conduct extensive experiments on benchmark\ndatasets under different settings, which verify the effectiveness of our attack\nand its resistance to existing backdoor defenses. The codes for reproducing\nmain experiments are available at \\url{https://github.com/YinghuaGao/SIBA}.\n","authors":["Yinghua Gao","Yiming Li","Xueluan Gong","Zhifeng Li","Shu-Tao Xia","Qian Wang"],"pdf_url":"https://arxiv.org/pdf/2306.06209v2.pdf","comment":"The first two authors contributed equally to this work. 13 pages"},{"id":"http://arxiv.org/abs/2312.14465v1","updated":"2023-12-22T06:34:23Z","published":"2023-12-22T06:34:23Z","title":"FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for\n Open-Vocabulary 3D Detection","summary":" The superior performances of pre-trained foundation models in various visual\ntasks underscore their potential to enhance the 2D models' open-vocabulary\nability. Existing methods explore analogous applications in the 3D space.\nHowever, most of them only center around knowledge extraction from singular\nfoundation models, which limits the open-vocabulary ability of 3D models. We\nhypothesize that leveraging complementary pre-trained knowledge from various\nfoundation models can improve knowledge transfer from 2D pre-trained visual\nlanguage models to the 3D space. In this work, we propose FM-OV3D, a method of\nFoundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D\nDetection, which improves the open-vocabulary localization and recognition\nabilities of 3D model by blending knowledge from multiple pre-trained\nfoundation models, achieving true open-vocabulary without facing constraints\nfrom original 3D datasets. Specifically, to learn the open-vocabulary 3D\nlocalization ability, we adopt the open-vocabulary localization knowledge of\nthe Grounded-Segment-Anything model. For open-vocabulary 3D recognition\nability, We leverage the knowledge of generative foundation models, including\nGPT-3 and Stable Diffusion models, and cross-modal discriminative models like\nCLIP. The experimental results on two popular benchmarks for open-vocabulary 3D\nobject detection show that our model efficiently learns knowledge from multiple\nfoundation models to enhance the open-vocabulary ability of the 3D model and\nsuccessfully achieves state-of-the-art performance in open-vocabulary 3D object\ndetection tasks. Code is released at\nhttps://github.com/dmzhang0425/FM-OV3D.git.\n","authors":["Dongmei Zhang","Chang Li","Ray Zhang","Shenghao Xie","Wei Xue","Xiaodong Xie","Shanghang Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.14465v1.pdf","comment":"Accepted by AAAI 2024. Code will be released at\n https://github.com/dmzhang0425/FM-OV3D.git"},{"id":"http://arxiv.org/abs/2312.13913v2","updated":"2023-12-22T06:27:43Z","published":"2023-12-21T15:01:47Z","title":"Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models","summary":" This paper presents Paint3D, a novel coarse-to-fine generative framework that\nis capable of producing high-resolution, lighting-less, and diverse 2K UV\ntexture maps for untextured 3D meshes conditioned on text or image inputs. The\nkey challenge addressed is generating high-quality textures without embedded\nillumination information, which allows the textures to be re-lighted or\nre-edited within modern graphics pipelines. To achieve this, our method first\nleverages a pre-trained depth-aware 2D diffusion model to generate\nview-conditional images and perform multi-view texture fusion, producing an\ninitial coarse texture map. However, as 2D models cannot fully represent 3D\nshapes and disable lighting effects, the coarse texture map exhibits incomplete\nareas and illumination artifacts. To resolve this, we train separate UV\nInpainting and UVHD diffusion models specialized for the shape-aware refinement\nof incomplete areas and the removal of illumination artifacts. Through this\ncoarse-to-fine process, Paint3D can produce high-quality 2K UV textures that\nmaintain semantic consistency while being lighting-less, significantly\nadvancing the state-of-the-art in texturing 3D objects.\n","authors":["Xianfang Zeng","Xin Chen","Zhongqi Qi","Wen Liu","Zibo Zhao","Zhibin Wang","Bin Fu","Yong Liu","Gang Yu"],"pdf_url":"https://arxiv.org/pdf/2312.13913v2.pdf","comment":"Project Website: https://github.com/OpenTexture/Paint3D"},{"id":"http://arxiv.org/abs/2312.14457v1","updated":"2023-12-22T06:15:03Z","published":"2023-12-22T06:15:03Z","title":"QUAR-VLA: Vision-Language-Action Model for Quadruped Robots","summary":" The important manifestation of robot intelligence is the ability to naturally\ninteract and autonomously make decisions. Traditional approaches to robot\ncontrol often compartmentalize perception, planning, and decision-making,\nsimplifying system design but limiting the synergy between different\ninformation streams. This compartmentalization poses challenges in achieving\nseamless autonomous reasoning, decision-making, and action execution. To\naddress these limitations, a novel paradigm, named Vision-Language-Action tasks\nfor QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This\napproach tightly integrates visual information and instructions to generate\nexecutable actions, effectively merging perception, planning, and\ndecision-making. The central idea is to elevate the overall intelligence of the\nrobot. Within this framework, a notable challenge lies in aligning fine-grained\ninstructions with visual perception information. This emphasizes the complexity\ninvolved in ensuring that the robot accurately interprets and acts upon\ndetailed instructions in harmony with its visual observations. Consequently, we\npropose QUAdruped Robotic Transformer (QUART), a family of VLA models to\nintegrate visual information and instructions from diverse modalities as input\nand generates executable actions for real-world robots and present QUAdruped\nRobot Dataset (QUARD), a large-scale multi-task dataset including navigation,\ncomplex terrain locomotion, and whole-body manipulation tasks for training\nQUART models. Our extensive evaluation (4000 evaluation trials) shows that our\napproach leads to performant robotic policies and enables QUART to obtain a\nrange of emergent capabilities.\n","authors":["Pengxiang Ding","Han Zhao","Zhitao Wang","Zhenyu Wei","Shangke Lyu","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.05684v2","updated":"2023-12-22T05:42:42Z","published":"2023-04-12T08:12:29Z","title":"InterGen: Diffusion-based Multi-human Motion Generation under Complex\n Interactions","summary":" We have recently seen tremendous progress in diffusion advances for\ngenerating realistic human motions. Yet, they largely disregard the multi-human\ninteractions. In this paper, we present InterGen, an effective diffusion-based\napproach that incorporates human-to-human interactions into the motion\ndiffusion process, which enables layman users to customize high-quality\ntwo-person interaction motions, with only text guidance. We first contribute a\nmultimodal dataset, named InterHuman. It consists of about 107M frames for\ndiverse two-person interactions, with accurate skeletal motions and 23,337\nnatural language descriptions. For the algorithm side, we carefully tailor the\nmotion diffusion model to our two-person interaction setting. To handle the\nsymmetry of human identities during interactions, we propose two cooperative\ntransformer-based denoisers that explicitly share weights, with a mutual\nattention mechanism to further connect the two denoising processes. Then, we\npropose a novel representation for motion input in our interaction diffusion\nmodel, which explicitly formulates the global relations between the two\nperformers in the world frame. We further introduce two novel regularization\nterms to encode spatial relations, equipped with a corresponding damping scheme\nduring the training of our interaction diffusion model. Extensive experiments\nvalidate the effectiveness and generalizability of InterGen. Notably, it can\ngenerate more diverse and compelling two-person motions than previous methods\nand enables various downstream applications for human interactions.\n","authors":["Han Liang","Wenqian Zhang","Wenxuan Li","Jingyi Yu","Lan Xu"],"pdf_url":"https://arxiv.org/pdf/2304.05684v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.02693v3","updated":"2023-12-22T05:39:11Z","published":"2023-05-04T10:09:30Z","title":"Semi-supervised Domain Adaptation via Prototype-based Multi-level\n Learning","summary":" In semi-supervised domain adaptation (SSDA), a few labeled target samples of\neach class help the model to transfer knowledge representation from the fully\nlabeled source domain to the target domain. Many existing methods ignore the\nbenefits of making full use of the labeled target samples from multi-level. To\nmake better use of this additional data, we propose a novel Prototype-based\nMulti-level Learning (ProML) framework to better tap the potential of labeled\ntarget samples. To achieve intra-domain adaptation, we first introduce a\npseudo-label aggregation based on the intra-domain optimal transport to help\nthe model align the feature distribution of unlabeled target samples and the\nprototype. At the inter-domain level, we propose a cross-domain alignment loss\nto help the model use the target prototype for cross-domain knowledge transfer.\nWe further propose a dual consistency based on prototype similarity and linear\nclassifier to promote discriminative learning of compact target feature\nrepresentation at the batch level. Extensive experiments on three datasets,\nincluding DomainNet, VisDA2017, and Office-Home demonstrate that our proposed\nmethod achieves state-of-the-art performance in SSDA.\n","authors":["Xinyang Huang","Chuang Zhu","Wenkai Chen"],"pdf_url":"https://arxiv.org/pdf/2305.02693v3.pdf","comment":"IJCAI 2023. To avoid confusion, update to a more complete version"},{"id":"http://arxiv.org/abs/2312.14446v1","updated":"2023-12-22T05:22:33Z","published":"2023-12-22T05:22:33Z","title":"Cross-Modal Object Tracking via Modality-Aware Fusion Network and A\n Large-Scale Dataset","summary":" Visual tracking often faces challenges such as invalid targets and decreased\nperformance in low-light conditions when relying solely on RGB image sequences.\nWhile incorporating additional modalities like depth and infrared data has\nproven effective, existing multi-modal imaging platforms are complex and lack\nreal-world applicability. In contrast, near-infrared (NIR) imaging, commonly\nused in surveillance cameras, can switch between RGB and NIR based on light\nintensity. However, tracking objects across these heterogeneous modalities\nposes significant challenges, particularly due to the absence of modality\nswitch signals during tracking. To address these challenges, we propose an\nadaptive cross-modal object tracking algorithm called Modality-Aware Fusion\nNetwork (MAFNet). MAFNet efficiently integrates information from both RGB and\nNIR modalities using an adaptive weighting mechanism, effectively bridging the\nappearance gap and enabling a modality-aware target representation. It consists\nof two key components: an adaptive weighting module and a modality-specific\nrepresentation module......\n","authors":["Lei Liu","Mengya Zhang","Cheng Li","Chenglong Li","Jin Tang"],"pdf_url":"https://arxiv.org/pdf/2312.14446v1.pdf","comment":"In Peer Review"},{"id":"http://arxiv.org/abs/2312.13977v2","updated":"2023-12-22T04:46:11Z","published":"2023-12-21T16:04:45Z","title":"NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse\n Input Views","summary":" Recently, neural implicit functions have demonstrated remarkable results in\nthe field of multi-view reconstruction. However, most existing methods are\ntailored for dense views and exhibit unsatisfactory performance when dealing\nwith sparse views. Several latest methods have been proposed for generalizing\nimplicit reconstruction to address the sparse view reconstruction task, but\nthey still suffer from high training costs and are merely valid under carefully\nselected perspectives. In this paper, we propose a novel sparse view\nreconstruction framework that leverages on-surface priors to achieve highly\nfaithful surface reconstruction. Specifically, we design several constraints on\nglobal geometry alignment and local geometry refinement for jointly optimizing\ncoarse shapes and fine details. To achieve this, we train a neural network to\nlearn a global implicit field from the on-surface points obtained from SfM and\nthen leverage it as a coarse geometric constraint. To exploit local geometric\nconsistency, we project on-surface points onto seen and unseen views, treating\nthe consistent loss of projected features as a fine geometric constraint. The\nexperimental results with DTU and BlendedMVS datasets in two prevalent sparse\nsettings demonstrate significant improvements over the state-of-the-art\nmethods.\n","authors":["Han Huang","Yulun Wu","Junsheng Zhou","Ge Gao","Ming Gu","Yu-Shen Liu"],"pdf_url":"https://arxiv.org/pdf/2312.13977v2.pdf","comment":"Accepted by AAAI 2024. Project page:\n https://alvin528.github.io/NeuSurf/"},{"id":"http://arxiv.org/abs/2312.14432v1","updated":"2023-12-22T04:41:31Z","published":"2023-12-22T04:41:31Z","title":"Scalable 3D Reconstruction From Single Particle X-Ray Diffraction Images\n Based on Online Machine Learning","summary":" X-ray free-electron lasers (XFELs) offer unique capabilities for measuring\nthe structure and dynamics of biomolecules, helping us understand the basic\nbuilding blocks of life. Notably, high-repetition-rate XFELs enable single\nparticle imaging (X-ray SPI) where individual, weakly scattering biomolecules\nare imaged under near-physiological conditions with the opportunity to access\nfleeting states that cannot be captured in cryogenic or crystallized\nconditions. Existing X-ray SPI reconstruction algorithms, which estimate the\nunknown orientation of a particle in each captured image as well as its shared\n3D structure, are inadequate in handling the massive datasets generated by\nthese emerging XFELs. Here, we introduce X-RAI, an online reconstruction\nframework that estimates the structure of a 3D macromolecule from large X-ray\nSPI datasets. X-RAI consists of a convolutional encoder, which amortizes pose\nestimation over large datasets, as well as a physics-based decoder, which\nemploys an implicit neural representation to enable high-quality 3D\nreconstruction in an end-to-end, self-supervised manner. We demonstrate that\nX-RAI achieves state-of-the-art performance for small-scale datasets in\nsimulation and challenging experimental settings and demonstrate its\nunprecedented ability to process large datasets containing millions of\ndiffraction images in an online fashion. These abilities signify a paradigm\nshift in X-ray SPI towards real-time capture and reconstruction.\n","authors":["Jay Shenoy","Axel Levy","Frédéric Poitevin","Gordon Wetzstein"],"pdf_url":"https://arxiv.org/pdf/2312.14432v1.pdf","comment":"Project page: http://jayshenoy.com/xrai"},{"id":"http://arxiv.org/abs/2312.14427v1","updated":"2023-12-22T04:28:43Z","published":"2023-12-22T04:28:43Z","title":"GROOD: GRadient-aware Out-Of-Distribution detection in interpolated\n manifolds","summary":" Deep neural networks (DNNs) often fail silently with over-confident\npredictions on out-of-distribution (OOD) samples, posing risks in real-world\ndeployments. Existing techniques predominantly emphasize either the feature\nrepresentation space or the gradient norms computed with respect to DNN\nparameters, yet they overlook the intricate gradient distribution and the\ntopology of classification regions. To address this gap, we introduce\nGRadient-aware Out-Of-Distribution detection in interpolated manifolds (GROOD),\na novel framework that relies on the discriminative power of gradient space to\ndistinguish between in-distribution (ID) and OOD samples. To build this space,\nGROOD relies on class prototypes together with a prototype that specifically\ncaptures OOD characteristics. Uniquely, our approach incorporates a targeted\nmix-up operation at an early intermediate layer of the DNN to refine the\nseparation of gradient spaces between ID and OOD samples. We quantify OOD\ndetection efficacy using the distance to the nearest neighbor gradients derived\nfrom the training set, yielding a robust OOD score. Experimental evaluations\nsubstantiate that the introduction of targeted input mix-upamplifies the\nseparation between ID and OOD in the gradient space, yielding impressive\nresults across diverse datasets. Notably, when benchmarked against ImageNet-1k,\nGROOD surpasses the established robustness of state-of-the-art baselines.\nThrough this work, we establish the utility of leveraging gradient spaces and\nclass prototypes for enhanced OOD detection for DNN in image classification.\n","authors":["Mostafa ElAraby","Sabyasachi Sahoo","Yann Pequignot","Paul Novello","Liam Paull"],"pdf_url":"https://arxiv.org/pdf/2312.14427v1.pdf","comment":"11 pages, 5 figures, preprint under review"},{"id":"http://arxiv.org/abs/2312.14410v1","updated":"2023-12-22T03:25:15Z","published":"2023-12-22T03:25:15Z","title":"A Multi-Stage Adaptive Feature Fusion Neural Network for Multimodal Gait\n Recognition","summary":" Gait recognition is a biometric technology that has received extensive\nattention. Most existing gait recognition algorithms are unimodal, and a few\nmultimodal gait recognition algorithms perform multimodal fusion only once.\nNone of these algorithms may fully exploit the complementary advantages of the\nmultiple modalities. In this paper, by considering the temporal and spatial\ncharacteristics of gait data, we propose a multi-stage feature fusion strategy\n(MSFFS), which performs multimodal fusions at different stages in the feature\nextraction process. Also, we propose an adaptive feature fusion module (AFFM)\nthat considers the semantic association between silhouettes and skeletons. The\nfusion process fuses different silhouette areas with their more related\nskeleton joints. Since visual appearance changes and time passage co-occur in a\ngait period, we propose a multiscale spatial-temporal feature extractor\n(MSSTFE) to learn the spatial-temporal linkage features thoroughly.\nSpecifically, MSSTFE extracts and aggregates spatial-temporal linkages\ninformation at different spatial scales. Combining the strategy and modules\nmentioned above, we propose a multi-stage adaptive feature fusion (MSAFF)\nneural network, which shows state-of-the-art performance in many experiments on\nthree datasets. Besides, MSAFF is equipped with feature dimensional pooling (FD\nPooling), which can significantly reduce the dimension of the gait\nrepresentations without hindering the accuracy.\nhttps://github.com/ShinanZou/MSAFF\n","authors":["Shinan Zou","Jianbo Xiong","Chao Fan","Shiqi Yu","Jin Tang"],"pdf_url":"https://arxiv.org/pdf/2312.14410v1.pdf","comment":"This paper has been accepted by IJCB2023"},{"id":"http://arxiv.org/abs/2312.14407v1","updated":"2023-12-22T03:18:04Z","published":"2023-12-22T03:18:04Z","title":"AdvCloak: Customized Adversarial Cloak for Privacy Protection","summary":" With extensive face images being shared on social media, there has been a\nnotable escalation in privacy concerns. In this paper, we propose AdvCloak, an\ninnovative framework for privacy protection using generative models. AdvCloak\nis designed to automatically customize class-wise adversarial masks that can\nmaintain superior image-level naturalness while providing enhanced\nfeature-level generalization ability. Specifically, AdvCloak sequentially\noptimizes the generative adversarial networks by employing a two-stage training\nstrategy. This strategy initially focuses on adapting the masks to the unique\nindividual faces via image-specific training and then enhances their\nfeature-level generalization ability to diverse facial variations of\nindividuals via person-specific training. To fully utilize the limited training\ndata, we combine AdvCloak with several general geometric modeling methods, to\nbetter describe the feature subspace of source identities. Extensive\nquantitative and qualitative evaluations on both common and celebrity datasets\ndemonstrate that AdvCloak outperforms existing state-of-the-art methods in\nterms of efficiency and effectiveness.\n","authors":["Xuannan Liu","Yaoyao Zhong","Xing Cui","Yuhang Zhang","Peipei Li","Weihong Deng"],"pdf_url":"https://arxiv.org/pdf/2312.14407v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14404v1","updated":"2023-12-22T03:09:11Z","published":"2023-12-22T03:09:11Z","title":"Cross-Covariate Gait Recognition: A Benchmark","summary":" Gait datasets are essential for gait research. However, this paper observes\nthat present benchmarks, whether conventional constrained or emerging\nreal-world datasets, fall short regarding covariate diversity. To bridge this\ngap, we undertake an arduous 20-month effort to collect a cross-covariate gait\nrecognition (CCGR) dataset. The CCGR dataset has 970 subjects and about 1.6\nmillion sequences; almost every subject has 33 views and 53 different\ncovariates. Compared to existing datasets, CCGR has both population and\nindividual-level diversity. In addition, the views and covariates are well\nlabeled, enabling the analysis of the effects of different factors. CCGR\nprovides multiple types of gait data, including RGB, parsing, silhouette, and\npose, offering researchers a comprehensive resource for exploration. In order\nto delve deeper into addressing cross-covariate gait recognition, we propose\nparsing-based gait recognition (ParsingGait) by utilizing the newly proposed\nparsing data. We have conducted extensive experiments. Our main results show:\n1) Cross-covariate emerges as a pivotal challenge for practical applications of\ngait recognition. 2) ParsingGait demonstrates remarkable potential for further\nadvancement. 3) Alarmingly, existing SOTA methods achieve less than 43%\naccuracy on the CCGR, highlighting the urgency of exploring cross-covariate\ngait recognition. Link: https://github.com/ShinanZou/CCGR.\n","authors":["Shinan Zou","Chao Fan","Jianbo Xiong","Chuanfu Shen","Shiqi Yu","Jin Tang"],"pdf_url":"https://arxiv.org/pdf/2312.14404v1.pdf","comment":"This paper has been accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.14400v1","updated":"2023-12-22T03:01:41Z","published":"2023-12-22T03:01:41Z","title":"Unveiling Backbone Effects in CLIP: Exploring Representational Synergies\n and Variances","summary":" Contrastive Language-Image Pretraining (CLIP) stands out as a prominent\nmethod for image representation learning. Various neural architectures,\nspanning Transformer-based models like Vision Transformers (ViTs) to\nConvolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve\nas universal backbones across diverse vision tasks. Despite utilizing the same\ndata and training objectives, the effectiveness of representations learned by\nthese architectures raises a critical question. Our investigation explores the\ndifferences in CLIP performance among these backbone architectures, revealing\nsignificant disparities in their classifications. Notably, normalizing these\nrepresentations results in substantial performance variations. Our findings\nshowcase a remarkable possible synergy between backbone predictions that could\nreach an improvement of over 20% through informed selection of the appropriate\nbackbone. Moreover, we propose a simple, yet effective approach to combine\npredictions from multiple backbones, leading to a notable performance boost of\nup to 6.34\\%. We will release the code for reproducing the results.\n","authors":["Cristian Rodriguez-Opazo","Edison Marrese-Taylor","Ehsan Abbasnejad","Hamed Damirchi","Ignacio M. Jara","Felipe Bravo-Marquez","Anton van den Hengel"],"pdf_url":"https://arxiv.org/pdf/2312.14400v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14395v1","updated":"2023-12-22T02:52:54Z","published":"2023-12-22T02:52:54Z","title":"Unsupervised Deep Learning Image Verification Method","summary":" Although deep learning are commonly employed for image recognition, usually\nhuge amount of labeled training data is required, which may not always be\nreadily available. This leads to a noticeable performance disparity when\ncompared to state-of-the-art unsupervised face verification techniques. In this\nwork, we propose a method to narrow this gap by leveraging an autoencoder to\nconvert the face image vector into a novel representation. Notably, the\nautoencoder is trained to reconstruct neighboring face image vectors rather\nthan the original input image vectors. These neighbor face image vectors are\nchosen through an unsupervised process based on the highest cosine scores with\nthe training face image vectors. The proposed method achieves a relative\nimprovement of 56\\% in terms of EER over the baseline system on Labeled Faces\nin the Wild (LFW) dataset. This has successfully narrowed down the performance\ngap between cosine and PLDA scoring systems.\n","authors":["Enoch Solomon","Abraham Woubie","Eyael Solomon Emiru"],"pdf_url":"https://arxiv.org/pdf/2312.14395v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14389v1","updated":"2023-12-22T02:32:19Z","published":"2023-12-22T02:32:19Z","title":"StyleRetoucher: Generalized Portrait Image Retouching with GAN Priors","summary":" Creating fine-retouched portrait images is tedious and time-consuming even\nfor professional artists. There exist automatic retouching methods, but they\neither suffer from over-smoothing artifacts or lack generalization ability. To\naddress such issues, we present StyleRetoucher, a novel automatic portrait\nimage retouching framework, leveraging StyleGAN's generation and generalization\nability to improve an input portrait image's skin condition while preserving\nits facial details. Harnessing the priors of pretrained StyleGAN, our method\nshows superior robustness: a). performing stably with fewer training samples\nand b). generalizing well on the out-domain data. Moreover, by blending the\nspatial features of the input image and intermediate features of the StyleGAN\nlayers, our method preserves the input characteristics to the largest extent.\nWe further propose a novel blemish-aware feature selection mechanism to\neffectively identify and remove the skin blemishes, improving the image skin\ncondition. Qualitative and quantitative evaluations validate the great\ngeneralization capability of our method. Further experiments show\nStyleRetoucher's superior performance to the alternative solutions in the image\nretouching task. We also conduct a user perceptive study to confirm the\nsuperior retouching performance of our method over the existing\nstate-of-the-art alternatives.\n","authors":["Wanchao Su","Can Wang","Chen Liu","Hangzhou Han","Hongbo Fu","Jing Liao"],"pdf_url":"https://arxiv.org/pdf/2312.14389v1.pdf","comment":"13 pages, 15 figures"},{"id":"http://arxiv.org/abs/2312.14387v1","updated":"2023-12-22T02:31:31Z","published":"2023-12-22T02:31:31Z","title":"Variance-insensitive and Target-preserving Mask Refinement for\n Interactive Image Segmentation","summary":" Point-based interactive image segmentation can ease the burden of mask\nannotation in applications such as semantic segmentation and image editing.\nHowever, fully extracting the target mask with limited user inputs remains\nchallenging. We introduce a novel method, Variance-Insensitive and\nTarget-Preserving Mask Refinement to enhance segmentation quality with fewer\nuser inputs. Regarding the last segmentation result as the initial mask, an\niterative refinement process is commonly employed to continually enhance the\ninitial mask. Nevertheless, conventional techniques suffer from sensitivity to\nthe variance in the initial mask. To circumvent this problem, our proposed\nmethod incorporates a mask matching algorithm for ensuring consistent\ninferences from different types of initial masks. We also introduce a\ntarget-aware zooming algorithm to preserve object information during\ndownsampling, balancing efficiency and accuracy. Experiments on GrabCut,\nBerkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art\nperformance in interactive image segmentation.\n","authors":["Chaowei Fang","Ziyin Zhou","Junye Chen","Hanjing Su","Qingyao Wu","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2312.14387v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.13771v2","updated":"2023-12-22T02:29:17Z","published":"2023-12-21T11:52:45Z","title":"AppAgent: Multimodal Agents as Smartphone Users","summary":" Recent advancements in large language models (LLMs) have led to the creation\nof intelligent agents capable of performing complex tasks. This paper\nintroduces a novel LLM-based multimodal agent framework designed to operate\nsmartphone applications. Our framework enables the agent to operate smartphone\napplications through a simplified action space, mimicking human-like\ninteractions such as tapping and swiping. This novel approach bypasses the need\nfor system back-end access, thereby broadening its applicability across diverse\napps. Central to our agent's functionality is its innovative learning method.\nThe agent learns to navigate and use new apps either through autonomous\nexploration or by observing human demonstrations. This process generates a\nknowledge base that the agent refers to for executing complex tasks across\ndifferent applications. To demonstrate the practicality of our agent, we\nconducted extensive testing over 50 tasks in 10 different applications,\nincluding social media, email, maps, shopping, and sophisticated image editing\ntools. The results affirm our agent's proficiency in handling a diverse array\nof high-level tasks.\n","authors":["Chi Zhang","Zhao Yang","Jiaxuan Liu","Yucheng Han","Xin Chen","Zebiao Huang","Bin Fu","Gang Yu"],"pdf_url":"https://arxiv.org/pdf/2312.13771v2.pdf","comment":"Project Page is https://appagent-official.github.io/"},{"id":"http://arxiv.org/abs/2312.14383v1","updated":"2023-12-22T02:19:23Z","published":"2023-12-22T02:19:23Z","title":"Removing Interference and Recovering Content Imaginatively for Visible\n Watermark Removal","summary":" Visible watermarks, while instrumental in protecting image copyrights,\nfrequently distort the underlying content, complicating tasks like scene\ninterpretation and image editing. Visible watermark removal aims to eliminate\nthe interference of watermarks and restore the background content. However,\nexisting methods often implement watermark component removal and background\nrestoration tasks within a singular branch, leading to residual watermarks in\nthe predictions and ignoring cases where watermarks heavily obscure the\nbackground. To address these limitations, this study introduces the Removing\nInterference and Recovering Content Imaginatively (RIRCI) framework. RIRCI\nembodies a two-stage approach: the initial phase centers on discerning and\nsegregating the watermark component, while the subsequent phase focuses on\nbackground content restoration. To achieve meticulous background restoration,\nour proposed model employs a dual-path network capable of fully exploring the\nintrinsic background information beneath semi-transparent watermarks and\nperipheral contextual information from unaffected regions. Moreover, a Global\nand Local Context Interaction module is built upon multi-layer perceptrons and\nbidirectional feature transformation for comprehensive representation modeling\nin the background restoration phase. The efficacy of our approach is\nempirically validated across two large-scale datasets, and our findings reveal\na marked enhancement over existing watermark removal techniques.\n","authors":["Yicheng Leng","Chaowei Fang","Gen Li","Yixiang Fang","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2312.14383v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.13091v2","updated":"2023-12-22T02:06:32Z","published":"2023-12-20T15:12:53Z","title":"MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using\n Differentiable Shading","summary":" Reconstructing an avatar from a portrait image has many applications in\nmultimedia, but remains a challenging research problem. Extracting reflectance\nmaps and geometry from one image is ill-posed: recovering geometry is a\none-to-many mapping problem and reflectance and light are difficult to\ndisentangle. Accurate geometry and reflectance can be captured under the\ncontrolled conditions of a light stage, but it is costly to acquire large\ndatasets in this fashion. Moreover, training solely with this type of data\nleads to poor generalization with in-the-wild images. This motivates the\nintroduction of MoSAR, a method for 3D avatar generation from monocular images.\nWe propose a semi-supervised training scheme that improves generalization by\nlearning from both light stage and in-the-wild datasets. This is achieved using\na novel differentiable shading formulation. We show that our approach\neffectively disentangles the intrinsic face parameters, producing relightable\navatars. As a result, MoSAR estimates a richer set of skin reflectance maps,\nand generates more realistic avatars than existing state-of-the-art methods. We\nalso introduce a new dataset, named FFHQ-UV-Intrinsics, the first public\ndataset providing intrinsic face attributes at scale (diffuse, specular,\nambient occlusion and translucency maps) for a total of 10k subjects. The\nproject website and the dataset are available on the following link:\nhttps://ubisoft-laforge.github.io/character/mosar/\n","authors":["Abdallah Dib","Luiz Gustavo Hafemann","Emeline Got","Trevor Anderson","Amin Fadaeinejad","Rafael M. O. Cruz","Marc-Andre Carbonneau"],"pdf_url":"https://arxiv.org/pdf/2312.13091v2.pdf","comment":"https://ubisoft-laforge.github.io/character/mosar/"},{"id":"http://arxiv.org/abs/2312.14373v1","updated":"2023-12-22T01:48:09Z","published":"2023-12-22T01:48:09Z","title":"Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction","summary":" In order to predict a pedestrian's trajectory in a crowd accurately, one has\nto take into account her/his underlying socio-temporal interactions with other\npedestrians consistently. Unlike existing work that represents the relevant\ninformation separately, partially, or implicitly, we propose a complete\nrepresentation for it to be fully and explicitly captured and analyzed. In\nparticular, we introduce a Directed Acyclic Graph-based structure, which we\nterm Socio-Temporal Graph (STG), to explicitly capture pair-wise socio-temporal\ninteractions among a group of people across both space and time. Our model is\nbuilt on a time-varying generative process, whose latent variables determine\nthe structure of the STGs. We design an attention-based model named STGformer\nthat affords an end-to-end pipeline to learn the structure of the STGs for\ntrajectory prediction. Our solution achieves overall state-of-the-art\nprediction accuracy in two large-scale benchmark datasets. Our analysis shows\nthat a person's past trajectory is critical for predicting another person's\nfuture path. Our model learns this relationship with a strong notion of\nsocio-temporal localities. Statistics show that utilizing this information\nexplicitly for prediction yields a noticeable performance gain with respect to\nthe trajectory-only approaches.\n","authors":["Yuke Li","Lixiong Chen","Guangyi Chen","Ching-Yao Chan","Kun Zhang","Stefano Anzellotti","Donglai Wei"],"pdf_url":"https://arxiv.org/pdf/2312.14373v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07884v2","updated":"2023-12-22T01:33:20Z","published":"2023-12-13T04:06:18Z","title":"Mutual-Learning Knowledge Distillation for Nighttime UAV Tracking","summary":" Nighttime unmanned aerial vehicle (UAV) tracking has been facilitated with\nindispensable plug-and-play low-light enhancers. However, the introduction of\nlow-light enhancers increases the extra computational burden for the UAV,\nsignificantly hindering the development of real-time UAV applications.\nMeanwhile, these state-of-the-art (SOTA) enhancers lack tight coupling with the\nadvanced daytime UAV tracking approach. To solve the above issues, this work\nproposes a novel mutual-learning knowledge distillation framework for nighttime\nUAV tracking, i.e., MLKD. This framework is constructed to learn a compact and\nfast nighttime tracker via knowledge transferring from the teacher and\nknowledge sharing among various students. Specifically, an advanced teacher\nbased on a SOTA enhancer and a superior tracking backbone is adopted for\nguiding the student based only on the tight coupling-aware tracking backbone to\ndirectly extract nighttime object features. To address the biased learning of a\nsingle student, diverse lightweight students with different distillation\nmethods are constructed to focus on various aspects of the teacher's knowledge.\nMoreover, an innovative mutual-learning room is designed to elect the superior\nstudent candidate to assist the remaining students frame-by-frame in the\ntraining phase. Furthermore, the final best student, i.e., MLKD-Track, is\nselected through the testing dataset. Extensive experiments demonstrate the\neffectiveness and superiority of MLKD and MLKD-Track. The practicality of the\nMLKD-Track is verified in real-world tests with different challenging\nsituations. The code is available at https://github.com/lyfeng001/MLKD.\n","authors":["Yufeng Liu"],"pdf_url":"https://arxiv.org/pdf/2312.07884v2.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2312.13434v2","updated":"2023-12-22T14:43:46Z","published":"2023-12-20T21:20:23Z","title":"Zero-1-to-3: Domain-level Zero-shot Cognitive Diagnosis via One Batch of\n Early-bird Students towards Three Diagnostic Objectives","summary":" Cognitive diagnosis seeks to estimate the cognitive states of students by\nexploring their logged practice quiz data. It plays a pivotal role in\npersonalized learning guidance within intelligent education systems. In this\npaper, we focus on an important, practical, yet often underexplored task:\ndomain-level zero-shot cognitive diagnosis (DZCD), which arises due to the\nabsence of student practice logs in newly launched domains. Recent cross-domain\ndiagnostic models have been demonstrated to be a promising strategy for DZCD.\nThese methods primarily focus on how to transfer student states across domains.\nHowever, they might inadvertently incorporate non-transferable information into\nstudent representations, thereby limiting the efficacy of knowledge transfer.\nTo tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive\ndiagnosis framework via one batch of early-bird students towards three\ndiagnostic objectives. Our approach initiates with pre-training a diagnosis\nmodel with dual regularizers, which decouples student states into domain-shared\nand domain-specific parts. The shared cognitive signals can be transferred to\nthe target domain, enriching the cognitive priors for the new domain, which\nensures the cognitive state propagation objective. Subsequently, we devise a\nstrategy to generate simulated practice logs for cold-start students through\nanalyzing the behavioral patterns from early-bird students, fulfilling the\ndomain-adaption goal. Consequently, we refine the cognitive states of\ncold-start students as diagnostic outcomes via virtual data, aligning with the\ndiagnosis-oriented goal. Finally, extensive experiments on six real-world\ndatasets highlight the efficacy of our model for DZCD and its practical\napplication in question recommendation.\n","authors":["Weibo Gao","Qi Liu","Hao Wang","Linan Yue","Haoyang Bi","Yin Gu","Fangzhou Yao","Zheng Zhang","Xin Li","Yuanjing He"],"pdf_url":"https://arxiv.org/pdf/2312.13434v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.09901v2","updated":"2023-12-22T12:32:11Z","published":"2023-12-15T15:53:45Z","title":"Temporally and Distributionally Robust Optimization for Cold-start\n Recommendation","summary":" Collaborative Filtering (CF) recommender models highly depend on user-item\ninteractions to learn CF representations, thus falling short of recommending\ncold-start items. To address this issue, prior studies mainly introduce item\nfeatures (e.g., thumbnails) for cold-start item recommendation. They learn a\nfeature extractor on warm-start items to align feature representations with\ninteractions, and then leverage the feature extractor to extract the feature\nrepresentations of cold-start items for interaction prediction. Unfortunately,\nthe features of cold-start items, especially the popular ones, tend to diverge\nfrom those of warm-start ones due to temporal feature shifts, preventing the\nfeature extractor from accurately learning feature representations of\ncold-start items.\n To alleviate the impact of temporal feature shifts, we consider using\nDistributionally Robust Optimization (DRO) to enhance the generation ability of\nthe feature extractor. Nonetheless, existing DRO methods face an inconsistency\nissue: the worse-case warm-start items emphasized during DRO training might not\nalign well with the cold-start item distribution. To capture the temporal\nfeature shifts and combat this inconsistency issue, we propose a novel temporal\nDRO with new optimization objectives, namely, 1) to integrate a worst-case\nfactor to improve the worst-case performance, and 2) to devise a shifting\nfactor to capture the shifting trend of item features and enhance the\noptimization of the potentially popular groups in cold-start items. Substantial\nexperiments on three real-world datasets validate the superiority of our\ntemporal DRO in enhancing the generalization ability of cold-start recommender\nmodels. The code is available at https://github.com/Linxyhaha/TDRO/.\n","authors":["Xinyu Lin","Wenjie Wang","Jujia Zhao","Yongqi Li","Fuli Feng","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2312.09901v2.pdf","comment":"Accepted by AAAI'24"},{"id":"http://arxiv.org/abs/2312.13695v2","updated":"2023-12-22T11:47:54Z","published":"2023-12-21T09:45:43Z","title":"Unexplored Frontiers: A Review of Empirical Studies of Exploratory\n Search","summary":" This article reviews how empirical research of exploratory search is\nconducted. We investigated aspects of interdisciplinarity, study settings and\nevaluation methodologies from a systematically selected sample of 231\npublications from 2010-2021, including a total of 172 articles with empirical\nstudies. Our results show that exploratory search is highly interdisciplinary,\nwith the most frequently occurring publication venues including high impact\nvenues in information science, information systems and human-computer\ninteraction. However, taken in aggregate, the breadth of study settings\ninvestigated was limited. We found that a majority of studies (77%) focused on\nevaluating novel retrieval systems as opposed to investigating users' search\nprocesses. Furthermore, a disproportionate number of studies were based on\nscientific literature search (20.7%), a majority of which only considered\nsearching for Computer Science articles. Study participants were generally from\nconvenience samples, with 75% of studies composed exclusively of students and\nother academics. The methodologies used for evaluation were mostly\nquantitative, but lacked consistency between studies and validated\nquestionnaires were rarely used. In discussion, we offer a critical analysis of\nour findings and suggest potential improvements for future exploratory search\nstudies.\n","authors":["Alan Medlar","Denis Kotkov","Dorota Glowacka"],"pdf_url":"https://arxiv.org/pdf/2312.13695v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15464v6","updated":"2023-12-22T10:22:07Z","published":"2023-07-28T10:34:47Z","title":"Framework to Automatically Determine the Quality of Open Data Catalogs","summary":" Data catalogs play a crucial role in modern data-driven organizations by\nfacilitating the discovery, understanding, and utilization of diverse data\nassets. However, ensuring their quality and reliability is complex, especially\nin open and large-scale data environments. This paper proposes a framework to\nautomatically determine the quality of open data catalogs, addressing the need\nfor efficient and reliable quality assessment mechanisms. Our framework can\nanalyze various core quality dimensions, such as accuracy, completeness,\nconsistency, scalability, and timeliness, offer several alternatives for the\nassessment of compatibility and similarity across such catalogs as well as the\nimplementation of a set of non-core quality dimensions such as provenance,\nreadability, and licensing. The goal is to empower data-driven organizations to\nmake informed decisions based on trustworthy and well-curated data assets. The\nsource code that illustrates our approach can be downloaded from\nhttps://www.github.com/jorge-martinez-gil/dataq/.\n","authors":["Jorge Martinez-Gil"],"pdf_url":"https://arxiv.org/pdf/2307.15464v6.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2312.14533v1","updated":"2023-12-22T08:58:42Z","published":"2023-12-22T08:58:42Z","title":"Multi-view user representation learning for user matching without\n personal information","summary":" As the digitization of travel industry accelerates, analyzing and\nunderstanding travelers' behaviors becomes increasingly important. However,\ntraveler data frequently exhibit high data sparsity due to the relatively low\nfrequency of user interactions with travel providers. Compounding this effect\nthe multiplication of devices, accounts and platforms while browsing travel\nproducts online also leads to data dispersion. To deal with these challenges,\nprobabilistic traveler matching can be used. Most existing solutions for user\nmatching are not suitable for traveler matching as a traveler's browsing\nhistory is typically short and URLs in the travel industry are very\nheterogeneous with many tokens. To deal with these challenges, we propose the\nsimilarity based multi-view information fusion to learn a better user\nrepresentation from URLs by treating the URLs as multi-view data. The\nexperimental results show that the proposed multi-view user representation\nlearning can take advantage of the complementary information from different\nviews, highlight the key information in URLs and perform significantly better\nthan other representation learning solutions for the user matching task.\n","authors":["Hongliu Cao","Ilias El Baamrani","Eoin Thomas"],"pdf_url":"https://arxiv.org/pdf/2312.14533v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.09049v3","updated":"2023-12-22T08:29:43Z","published":"2023-11-15T15:39:33Z","title":"Adapting Large Language Models by Integrating Collaborative Semantics\n for Recommendation","summary":" Recently, large language models (LLMs) have shown great potential in\nrecommender systems, either improving existing recommendation models or serving\nas the backbone. However, there exists a large semantic gap between LLMs and\nrecommender systems, since items to be recommended are often indexed by\ndiscrete identifiers (item ID) out of the LLM's vocabulary. In essence, LLMs\ncapture language semantics while recommender systems imply collaborative\nsemantics, making it difficult to sufficiently leverage the model capacity of\nLLMs for recommendation. To address this challenge, in this paper, we propose a\nnew LLM-based recommendation model called LC-Rec, which can better integrate\nlanguage and collaborative semantics for recommender systems. Our approach can\ndirectly generate items from the entire item set for recommendation, without\nrelying on candidate items. Specifically, we make two major contributions in\nour approach. For item indexing, we design a learning-based vector quantization\nmethod with uniform semantic mapping, which can assign meaningful and\nnon-conflicting IDs (called item indices) for items. For alignment tuning, we\npropose a series of specially designed tuning tasks to enhance the integration\nof collaborative semantics in LLMs. Our fine-tuning tasks enforce LLMs to\ndeeply integrate language and collaborative semantics (characterized by the\nlearned item indices), so as to achieve an effective adaptation to recommender\nsystems. Extensive experiments demonstrate the effectiveness of our method,\nshowing that our approach can outperform a number of competitive baselines\nincluding traditional recommenders and existing LLM-based recommenders. Our\ncode is available at https://github.com/RUCAIBox/LC-Rec/.\n","authors":["Bowen Zheng","Yupeng Hou","Hongyu Lu","Yu Chen","Wayne Xin Zhao","Ming Chen","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2311.09049v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14447v1","updated":"2023-12-22T05:23:56Z","published":"2023-12-22T05:23:56Z","title":"On the Effectiveness of Unlearning in Session-Based Recommendation","summary":" Session-based recommendation predicts users' future interests from previous\ninteractions in a session. Despite the memorizing of historical samples, the\nrequest of unlearning, i.e., to remove the effect of certain training samples,\nalso occurs for reasons such as user privacy or model fidelity. However,\nexisting studies on unlearning are not tailored for the session-based\nrecommendation. On the one hand, these approaches cannot achieve satisfying\nunlearning effects due to the collaborative correlations and sequential\nconnections between the unlearning item and the remaining items in the session.\nOn the other hand, seldom work has conducted the research to verify the\nunlearning effectiveness in the session-based recommendation scenario. In this\npaper, we propose SRU, a session-based recommendation unlearning framework,\nwhich enables high unlearning efficiency, accurate recommendation performance,\nand improved unlearning effectiveness in session-based recommendation.\nSpecifically, we first partition the training sessions into separate sub-models\naccording to the similarity across the sessions, then we utilize an\nattention-based aggregation layer to fuse the hidden states according to the\ncorrelations between the session and the centroid of the data in the sub-model.\nTo improve the unlearning effectiveness, we further propose three extra data\ndeletion strategies, including collaborative extra deletion (CED), neighbor\nextra deletion (NED), and random extra deletion (RED). Besides, we propose an\nevaluation metric that measures whether the unlearning sample can be inferred\nafter the data deletion to verify the unlearning effectiveness. We implement\nSRU with three representative session-based recommendation models and conduct\nexperiments on three benchmark datasets. Experimental results demonstrate the\neffectiveness of our methods.\n","authors":["Xin Xin","Liu Yang","Ziqi Zhao","Pengjie Ren","Zhumin Chen","Jun Ma","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2312.14447v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.14433v1","updated":"2023-12-22T04:46:21Z","published":"2023-12-22T04:46:21Z","title":"Attribute-driven Disentangled Representation Learning for Multimodal\n Recommendation","summary":" Recommendation algorithms forecast user preferences by correlating user and\nitem representations derived from historical interaction patterns. In pursuit\nof enhanced performance, many methods focus on learning robust and independent\nrepresentations by disentangling the intricate factors within interaction data\nacross various modalities in an unsupervised manner. However, such an approach\nobfuscates the discernment of how specific factors (e.g., category or brand)\ninfluence the outcomes, making it challenging to regulate their effects. In\nresponse to this challenge, we introduce a novel method called Attribute-Driven\nDisentangled Representation Learning (short for AD-DRL), which explicitly\nincorporates attributes from different modalities into the disentangled\nrepresentation learning process. By assigning a specific attribute to each\nfactor in multimodal features, AD-DRL can disentangle the factors at both\nattribute and attribute-value levels. To obtain robust and independent\nrepresentations for each factor associated with a specific attribute, we first\ndisentangle the representations of features both within and across different\nmodalities. Moreover, we further enhance the robustness of the representations\nby fusing the multimodal features of the same factor. Empirical evaluations\nconducted on three public real-world datasets substantiate the effectiveness of\nAD-DRL, as well as its interpretability and controllability.\n","authors":["Zhenyang Li","Fan Liu","Yinwei Wei","Zhiyong Cheng","Liqiang Nie","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2312.14433v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2312.14925v1","updated":"2023-12-22T18:58:06Z","published":"2023-12-22T18:58:06Z","title":"A Survey of Reinforcement Learning from Human Feedback","summary":" Reinforcement learning from human feedback (RLHF) is a variant of\nreinforcement learning (RL) that learns from human feedback instead of relying\non an engineered reward function. Building on prior work on the related setting\nof preference-based reinforcement learning (PbRL), it stands at the\nintersection of artificial intelligence and human-computer interaction. This\npositioning offers a promising avenue to enhance the performance and\nadaptability of intelligent systems while also improving the alignment of their\nobjectives with human values. The training of Large Language Models (LLMs) has\nimpressively demonstrated this potential in recent years, where RLHF played a\ndecisive role in targeting the model's capabilities toward human objectives.\nThis article provides a comprehensive overview of the fundamentals of RLHF,\nexploring the intricate dynamics between machine agents and human input. While\nrecent focus has been on RLHF for LLMs, our survey adopts a broader\nperspective, examining the diverse applications and wide-ranging impact of the\ntechnique. We delve into the core principles that underpin RLHF, shedding light\non the symbiotic relationship between algorithms and human feedback, and\ndiscuss the main research trends in the field. By synthesizing the current\nlandscape of RLHF research, this article aims to provide researchers as well as\npractitioners with a comprehensive understanding of this rapidly growing field\nof research.\n","authors":["Timo Kaufmann","Paul Weng","Viktor Bengs","Eyke Hüllermeier"],"pdf_url":"https://arxiv.org/pdf/2312.14925v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14923v1","updated":"2023-12-22T18:55:45Z","published":"2023-12-22T18:55:45Z","title":"Fast-NTK: Parameter-Efficient Unlearning for Large-Scale Models","summary":" The rapid growth of machine learning has spurred legislative initiatives such\nas ``the Right to be Forgotten,'' allowing users to request data removal. In\nresponse, ``machine unlearning'' proposes the selective removal of unwanted\ndata without the need for retraining from scratch. While the\nNeural-Tangent-Kernel-based (NTK-based) unlearning method excels in\nperformance, it suffers from significant computational complexity, especially\nfor large-scale models and datasets. Our work introduces ``Fast-NTK,'' a novel\nNTK-based unlearning algorithm that significantly reduces the computational\ncomplexity by incorporating parameter-efficient fine-tuning methods, such as\nfine-tuning batch normalization layers in a CNN or visual prompts in a vision\ntransformer. Our experimental results demonstrate scalability to much larger\nneural networks and datasets (e.g., 88M parameters; 5k images), surpassing the\nlimitations of previous full-model NTK-based approaches designed for smaller\ncases (e.g., 8M parameters; 500 images). Notably, our approach maintains a\nperformance comparable to the traditional method of retraining on the retain\nset alone. Fast-NTK can thus enable for practical and scalable NTK-based\nunlearning in deep neural networks.\n","authors":["Guihong Li","Hsiang Hsu","Chun-Fu Chen","Radu Marculescu"],"pdf_url":"https://arxiv.org/pdf/2312.14923v1.pdf","comment":"6 pages, 1 figure"},{"id":"http://arxiv.org/abs/2312.14922v1","updated":"2023-12-22T18:55:25Z","published":"2023-12-22T18:55:25Z","title":"Learning from higher-order statistics, efficiently: hypothesis tests,\n random features, and neural networks","summary":" Neural networks excel at discovering statistical patterns in high-dimensional\ndata sets. In practice, higher-order cumulants, which quantify the non-Gaussian\ncorrelations between three or more variables, are particularly important for\nthe performance of neural networks. But how efficient are neural networks at\nextracting features from higher-order cumulants? We study this question in the\nspiked cumulant model, where the statistician needs to recover a privileged\ndirection or \"spike\" from the order-$p\\ge 4$ cumulants of~$d$-dimensional\ninputs. We first characterise the fundamental statistical and computational\nlimits of recovering the spike by analysing the number of samples~$n$ required\nto strongly distinguish between inputs from the spiked cumulant model and\nisotropic Gaussian inputs. We find that statistical distinguishability requires\n$n\\gtrsim d$ samples, while distinguishing the two distributions in polynomial\ntime requires $n \\gtrsim d^2$ samples for a wide class of algorithms, i.e.\nthose covered by the low-degree conjecture. These results suggest the existence\nof a wide statistical-to-computational gap in this problem. Numerical\nexperiments show that neural networks learn to distinguish the two\ndistributions with quadratic sample complexity, while \"lazy\" methods like\nrandom features are not better than random guessing in this regime. Our results\nshow that neural networks extract information from higher-order correlations in\nthe spiked cumulant model efficiently, and reveal a large gap in the amount of\ndata required by neural networks and random features to learn from higher-order\ncumulants.\n","authors":["Eszter Székely","Lorenzo Bardone","Federica Gerace","Sebastian Goldt"],"pdf_url":"https://arxiv.org/pdf/2312.14922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14920v1","updated":"2023-12-22T18:53:02Z","published":"2023-12-22T18:53:02Z","title":"A Novel Sampled Clustering Algorithm for Rice Phenotypic Data","summary":" Phenotypic (or Physical) characteristics of plant species are commonly used\nto perform clustering. In one of our recent works (Shastri et al. (2021)), we\nused a probabilistically sampled (using pivotal sampling) and spectrally\nclustered algorithm to group soybean species. These techniques were used to\nobtain highly accurate clusterings at a reduced cost. In this work, we extend\nthe earlier algorithm to cluster rice species. We improve the base algorithm in\nthree ways. First, we propose a new function to build the similarity matrix in\nSpectral Clustering. Commonly, a natural exponential function is used for this\npurpose. Based upon the spectral graph theory and the involved Cheeger's\ninequality, we propose the use a base \"a\" exponential function instead. This\ngives a similarity matrix spectrum favorable for clustering, which we support\nvia an eigenvalue analysis.\n Second, the function used to build the similarity matrix in Spectral\nClustering was earlier scaled with a fixed factor (called global scaling).\nBased upon the idea of Zelnik-Manor and Perona (2004), we now use a factor that\nvaries with matrix elements (called local scaling) and works better. Third, to\ncompute the inclusion probability of a specie in the pivotal sampling\nalgorithm, we had earlier used the notion of deviation that captured how far\nspecie's characteristic values were from their respective base values (computed\nover all species). A maximum function was used before to find the base values.\nWe now use a median function, which is more intuitive. We support this choice\nusing a statistical analysis. With experiments on 1865 rice species, we\ndemonstrate that in terms of silhouette values, our new Sampled Spectral\nClustering is 61% better than Hierarchical Clustering (currently prevalent).\nAlso, our new algorithm is significantly faster than Hierarchical Clustering\ndue to the involved sampling.\n","authors":["Mithun Singh","Kapil Ahuja","Milind B. Ratnaparkhe"],"pdf_url":"https://arxiv.org/pdf/2312.14920v1.pdf","comment":"20 Pages, 2 Figures, 6 Tables"},{"id":"http://arxiv.org/abs/2312.14919v1","updated":"2023-12-22T18:51:50Z","published":"2023-12-22T18:51:50Z","title":"Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using\n transformers","summary":" Combining complementary sensor modalities is crucial to providing robust\nperception for safety-critical robotics applications such as autonomous driving\n(AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on\nmonocular depth estimation which is a notoriously difficult task compared to\nusing depth information from the lidar directly. Here, we find that this\napproach does not leverage depth as expected and show that naively improving\ndepth estimation does not lead to improvements in object detection performance\nand that, strikingly, removing depth estimation altogether does not degrade\nobject detection performance. This suggests that relying on monocular depth\ncould be an unnecessary architectural bottleneck during camera-lidar fusion. In\nthis work, we introduce a novel fusion method that bypasses monocular depth\nestimation altogether and instead selects and fuses camera and lidar features\nin a bird's-eye-view grid using a simple attention mechanism. We show that our\nmodel can modulate its use of camera features based on the availability of\nlidar features and that it yields better 3D object detection on the nuScenes\ndataset than baselines relying on monocular depth estimation.\n","authors":["James Gunn","Zygmunt Lenyk","Anuj Sharma","Andrea Donati","Alexandru Buburuzan","John Redford","Romain Mueller"],"pdf_url":"https://arxiv.org/pdf/2312.14919v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06585v3","updated":"2023-12-22T18:33:50Z","published":"2023-12-11T18:17:43Z","title":"Beyond Human Data: Scaling Self-Training for Problem-Solving with\n Language Models","summary":" Fine-tuning language models~(LMs) on human-generated data remains a prevalent\npractice. However, the performance of such models is often limited by the\nquantity and diversity of high-quality human data. In this paper, we explore\nwhether we can go beyond human data on tasks where we have access to scalar\nfeedback, for example, on math problems where one can verify correctness. To do\nso, we investigate a simple self-training method based on\nexpectation-maximization, which we call ReST$^{EM}$, where we (1) generate\nsamples from the model and filter them using binary feedback, (2) fine-tune the\nmodel on these samples, and (3) repeat this process a few times. Testing on\nadvanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find\nthat ReST$^{EM}$ scales favorably with model size and significantly surpasses\nfine-tuning only on human data. Overall, our findings suggest self-training\nwith feedback can substantially reduce dependence on human-generated data.\n","authors":["Avi Singh","John D. Co-Reyes","Rishabh Agarwal","Ankesh Anand","Piyush Patil","Xavier Garcia","Peter J. Liu","James Harrison","Jaehoon Lee","Kelvin Xu","Aaron Parisi","Abhishek Kumar","Alex Alemi","Alex Rizkowsky","Azade Nova","Ben Adlam","Bernd Bohnet","Gamaleldin Elsayed","Hanie Sedghi","Igor Mordatch","Isabelle Simpson","Izzeddin Gur","Jasper Snoek","Jeffrey Pennington","Jiri Hron","Kathleen Kenealy","Kevin Swersky","Kshiteej Mahajan","Laura Culp","Lechao Xiao","Maxwell L. Bileschi","Noah Constant","Roman Novak","Rosanne Liu","Tris Warkentin","Yundi Qian","Yamini Bansal","Ethan Dyer","Behnam Neyshabur","Jascha Sohl-Dickstein","Noah Fiedel"],"pdf_url":"https://arxiv.org/pdf/2312.06585v3.pdf","comment":"First three authors contributed equally"},{"id":"http://arxiv.org/abs/2312.14895v1","updated":"2023-12-22T18:16:13Z","published":"2023-12-22T18:16:13Z","title":"FAST: Feature Aware Similarity Thresholding for Weak Unlearning in\n Black-Box Generative Models","summary":" The heightened emphasis on the regulation of deep generative models,\npropelled by escalating concerns pertaining to privacy and compliance with\nregulatory frameworks, underscores the imperative need for precise control\nmechanisms over these models. This urgency is particularly underscored by\ninstances in which generative models generate outputs that encompass\nobjectionable, offensive, or potentially injurious content. In response,\nmachine unlearning has emerged to selectively forget specific knowledge or\nremove the influence of undesirable data subsets from pre-trained models.\nHowever, modern machine unlearning approaches typically assume access to model\nparameters and architectural details during unlearning, which is not always\nfeasible. In multitude of downstream tasks, these models function as black-box\nsystems, with inaccessible pre-trained parameters, architectures, and training\ndata. In such scenarios, the possibility of filtering undesired outputs becomes\na practical alternative. The primary goal of this study is twofold: first, to\nelucidate the relationship between filtering and unlearning processes, and\nsecond, to formulate a methodology aimed at mitigating the display of\nundesirable outputs generated from models characterized as black-box systems.\nTheoretical analysis in this study demonstrates that, in the context of\nblack-box models, filtering can be seen as a form of weak unlearning. Our\nproposed \\textbf{\\textit{Feature Aware Similarity Thresholding(FAST)}} method\neffectively suppresses undesired outputs by systematically encoding the\nrepresentation of unwanted features in the latent space.\n","authors":["Subhodip Panda","Prathosh AP"],"pdf_url":"https://arxiv.org/pdf/2312.14895v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14891v1","updated":"2023-12-22T18:09:20Z","published":"2023-12-22T18:09:20Z","title":"DRStageNet: Deep Learning for Diabetic Retinopathy Staging from Fundus\n Images","summary":" Diabetic retinopathy (DR) is a prevalent complication of diabetes associated\nwith a significant risk of vision loss. Timely identification is critical to\ncurb vision impairment. Algorithms for DR staging from digital fundus images\n(DFIs) have been recently proposed. However, models often fail to generalize\ndue to distribution shifts between the source domain on which the model was\ntrained and the target domain where it is deployed. A common and particularly\nchallenging shift is often encountered when the source- and target-domain\nsupports do not fully overlap. In this research, we introduce DRStageNet, a\ndeep learning model designed to mitigate this challenge. We used seven publicly\navailable datasets, comprising a total of 93,534 DFIs that cover a variety of\npatient demographics, ethnicities, geographic origins and comorbidities. We\nfine-tune DINOv2, a pretrained model of self-supervised vision transformer, and\nimplement a multi-source domain fine-tuning strategy to enhance generalization\nperformance. We benchmark and demonstrate the superiority of our method to two\nstate-of-the-art benchmarks, including a recently published foundation model.\nWe adapted the grad-rollout method to our regression task in order to provide\nhigh-resolution explainability heatmaps. The error analysis showed that 59\\% of\nthe main errors had incorrect reference labels. DRStageNet is accessible at URL\n[upon acceptance of the manuscript].\n","authors":["Yevgeniy Men","Jonathan Fhima","Leo Anthony Celi","Lucas Zago Ribeiro","Luis Filipe Nakayama","Joachim A. Behar"],"pdf_url":"https://arxiv.org/pdf/2312.14891v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14890v1","updated":"2023-12-22T18:07:44Z","published":"2023-12-22T18:07:44Z","title":"NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language\n Models via Complexity Classes","summary":" Complex reasoning ability is one of the most important features of current\nLLMs, which has also been leveraged to play an integral role in complex\ndecision-making tasks. Therefore, the investigation into the reasoning\ncapabilities of Large Language Models (LLMs) is critical: numerous benchmarks\nhave been established to assess the reasoning abilities of LLMs. However,\ncurrent benchmarks are inadequate in offering a rigorous evaluation of the full\nextent of reasoning abilities that LLMs are capable of achieving. They are also\nprone to the risk of overfitting, as these benchmarks, being publicly\naccessible and static, allow models to potentially tailor their responses to\nspecific benchmark metrics, thereby inflating their performance. Addressing\nthese limitations, our research introduces a new benchmark, named NPHardEval.\nThis benchmark is designed to evaluate the reasoning abilities of LLMs across a\nbroad spectrum of 900 algorithmic questions, extending up to the NP-Hard\ncomplexity class. These questions are meticulously chosen to represent a wide\nrange of complexity class below the NP-hard complexity class, offering a\nrigorous measure of the reasoning ability of LLMs. Through this study, we shed\nlight on the current state of reasoning in LLMs, providing an objective and\nrigorous perspective through the comparison of LLMs' performance across complex\nclasses. Moreover, this benchmark is designed with a dynamic update mechanism,\nwhere the datapoints are refreshed on a monthly basis. Such regular updates\nplay a crucial role in mitigating the risk of LLMs overfitting to the\nbenchmark, promoting a more accurate and reliable assessment of their reasoning\ncapabilities. The benchmark dataset and code of NPHardEval are available at\nhttps://github.com/casmlab/NPHardEval.\n","authors":["Lizhou Fan","Wenyue Hua","Lingyao Li","Haoyang Ling","Yongfeng Zhang","Libby Hemphill"],"pdf_url":"https://arxiv.org/pdf/2312.14890v1.pdf","comment":"22 pages, 6 figures, 2 tables"},{"id":"http://arxiv.org/abs/2307.16184v2","updated":"2023-12-22T18:07:41Z","published":"2023-07-30T09:48:36Z","title":"UnIVAL: Unified Model for Image, Video, Audio and Language Tasks","summary":" Large Language Models (LLMs) have made the ambitious quest for generalist\nagents significantly far from being a fantasy. A key hurdle for building such\ngeneral models is the diversity and heterogeneity of tasks and modalities. A\npromising solution is unification, allowing the support of a myriad of tasks\nand modalities within one unified framework. While few large models (e.g.,\nFlamingo (Alayrac et al., 2022), trained on massive datasets, can support more\nthan two modalities, current small to mid-scale unified models are still\nlimited to 2 modalities, usually image-text or video-text. The question that we\nask is: is it possible to build efficiently a unified model that can support\nall modalities? To answer this, we propose UnIVAL, a step further towards this\nambitious goal. Without relying on fancy datasets sizes or models with billions\nof parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities\nand unifies text, images, video, and audio into a single model. Our model is\nefficiently pretrained on many tasks, based on task balancing and multimodal\ncurriculum learning. UnIVAL shows competitive performance to existing\nstate-of-the-art approaches, across image and video-text tasks. The feature\nrepresentations learned from image and video-text modalities, allows the model\nto achieve competitive performance when finetuned on audio-text tasks, despite\nnot being pretrained on audio. Thanks to the unified model, we propose a novel\nstudy on multimodal model merging via weight interpolation of models trained on\ndifferent multimodal tasks, showing their benefits in particular for\nout-of-distribution generalization. Finally, we motivate unification by showing\nthe synergy between tasks. The model weights and code are released here:\nhttps://github.com/mshukor/UnIVAL.\n","authors":["Mustafa Shukor","Corentin Dancette","Alexandre Rame","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2307.16184v2.pdf","comment":"Accepted at TMLR 2023. 40 pages. Project page:\n https://unival-model.github.io/"},{"id":"http://arxiv.org/abs/2312.14889v1","updated":"2023-12-22T18:07:18Z","published":"2023-12-22T18:07:18Z","title":"On rate-optimal classification from non-private and from private data","summary":" In this paper we revisit the classical problem of classification, but impose\nprivacy constraints. Under such constraints, the raw data\n$(X_1,Y_1),\\ldots,(X_n,Y_n)$ cannot be directly observed, and all classifiers\nare functions of the randomised outcome of a suitable local differential\nprivacy mechanism. The statistician is free to choose the form of this privacy\nmechanism, and here we add Laplace distributed noise to a discretisation of the\nlocation of each feature vector $X_i$ and to its label $Y_i$. The\nclassification rule is the privatized version of the well-studied partitioning\nclassification rule. In addition to the standard Lipschitz and margin\nconditions, a novel characteristic is introduced, by which the exact rate of\nconvergence of the classification error probability is calculated, both for\nnon-private and private data.\n","authors":["Balázs Csanád Csáji","László Györfi","Ambrus Tamás"],"pdf_url":"https://arxiv.org/pdf/2312.14889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14886v1","updated":"2023-12-22T18:05:18Z","published":"2023-12-22T18:05:18Z","title":"Sample Path Regularity of Gaussian Processes from the Covariance Kernel","summary":" Gaussian processes (GPs) are the most common formalism for defining\nprobability distributions over spaces of functions. While applications of GPs\nare myriad, a comprehensive understanding of GP sample paths, i.e. the function\nspaces over which they define a probability measure on, is lacking. In\npractice, GPs are not constructed through a probability measure, but instead\nthrough a mean function and a covariance kernel. In this paper we provide\nnecessary and sufficient conditions on the covariance kernel for the sample\npaths of the corresponding GP to attain a given regularity. We use the\nframework of H\\\"older regularity as it grants us particularly straightforward\nconditions, which simplify further in the cases of stationary and isotropic\nGPs. We then demonstrate that our results allow for novel and unusually tight\ncharacterisations of the sample path regularities of the GPs commonly used in\nmachine learning applications, such as the Mat\\'ern GPs.\n","authors":["Nathaël Da Costa","Marvin Pförtner","Lancelot Da Costa","Philipp Hennig"],"pdf_url":"https://arxiv.org/pdf/2312.14886v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14880v1","updated":"2023-12-22T18:00:17Z","published":"2023-12-22T18:00:17Z","title":"SutraNets: Sub-series Autoregressive Networks for Long-Sequence,\n Probabilistic Forecasting","summary":" We propose SutraNets, a novel method for neural probabilistic forecasting of\nlong-sequence time series. SutraNets use an autoregressive generative model to\nfactorize the likelihood of long sequences into products of conditional\nprobabilities. When generating long sequences, most autoregressive approaches\nsuffer from harmful error accumulation, as well as challenges in modeling\nlong-distance dependencies. SutraNets treat long, univariate prediction as\nmultivariate prediction over lower-frequency sub-series. Autoregression\nproceeds across time and across sub-series in order to ensure coherent\nmultivariate (and, hence, high-frequency univariate) outputs. Since sub-series\ncan be generated using fewer steps, SutraNets effectively reduce error\naccumulation and signal path distances. We find SutraNets to significantly\nimprove forecasting accuracy over competitive alternatives on six real-world\ndatasets, including when we vary the number of sub-series and scale up the\ndepth and width of the underlying sequence models.\n","authors":["Shane Bergsma","Timothy Zeyl","Lei Guo"],"pdf_url":"https://arxiv.org/pdf/2312.14880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14878v1","updated":"2023-12-22T17:57:57Z","published":"2023-12-22T17:57:57Z","title":"Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning","summary":" A key method for creating Artificial Intelligence (AI) agents is\nReinforcement Learning (RL). However, constructing a standalone RL policy that\nmaps perception to action directly encounters severe problems, chief among them\nbeing its lack of generality across multiple tasks and the need for a large\namount of training data. The leading cause is that it cannot effectively\nintegrate prior information into the perception-action cycle when devising the\npolicy. Large language models (LLMs) emerged as a fundamental way to\nincorporate cross-domain knowledge into AI agents but lack crucial learning and\nadaptation toward specific decision problems. This paper presents a general\nframework model for integrating and learning structured reasoning into AI\nagents' policies. Our methodology is motivated by the modularity found in the\nhuman brain. The framework utilises the construction of intrinsic and extrinsic\nfunctions to add previous understandings of reasoning structures. It also\nprovides the adaptive ability to learn models inside every module or function,\nconsistent with the modular structure of cognitive processes. We describe the\nframework in-depth and compare it with other AI pipelines and existing\nframeworks. The paper explores practical applications, covering experiments\nthat show the effectiveness of our method. Our results indicate that AI agents\nperform and adapt far better when organised reasoning and prior knowledge are\nembedded. This opens the door to more resilient and general AI agent systems.\n","authors":["Filippos Christianos","Georgios Papoudakis","Matthieu Zimmer","Thomas Coste","Zhihao Wu","Jingxuan Chen","Khyati Khandelwal","James Doran","Xidong Feng","Jiacheng Liu","Zheng Xiong","Yicheng Luo","Jianye Hao","Kun Shao","Haitham Bou-Ammar","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14878v1.pdf","comment":"paper and appendix, 27 pages"},{"id":"http://arxiv.org/abs/2302.06117v2","updated":"2023-12-22T17:54:55Z","published":"2023-02-13T05:52:03Z","title":"The Framework Tax: Disparities Between Inference Efficiency in NLP\n Research and Deployment","summary":" Increased focus on the computational efficiency of NLP systems has motivated\nthe design of efficient model architectures and improvements to underlying\nhardware accelerators. However, the resulting increases in computational\nthroughput and reductions in floating point operations have not directly\ntranslated to improvements in wall-clock inference latency. We demonstrate that\nthese discrepancies can be largely attributed to bottlenecks introduced by deep\nlearning frameworks. We denote this phenomenon as the \\textit{framework tax},\nand observe that the disparity is growing as hardware speed increases over\ntime. In this work, we examine this phenomenon through a series of case studies\nanalyzing the effects of model design decisions, framework paradigms, and\nhardware platforms on total model latency. Code is available at\nhttps://github.com/JaredFern/Framework-Tax.\n","authors":["Jared Fernandez","Jacob Kahn","Clara Na","Yonatan Bisk","Emma Strubell"],"pdf_url":"https://arxiv.org/pdf/2302.06117v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2306.15774v2","updated":"2023-12-22T17:53:02Z","published":"2023-06-27T19:54:30Z","title":"Next Steps for Human-Centered Generative AI: A Technical Perspective","summary":" Through iterative, cross-disciplinary discussions, we define and propose\nnext-steps for Human-centered Generative AI (HGAI). We contribute a\ncomprehensive research agenda that lays out future directions of Generative AI\nspanning three levels: aligning with human values; assimilating human intents;\nand augmenting human abilities. By identifying these next-steps, we intend to\ndraw interdisciplinary research teams to pursue a coherent set of emergent\nideas in HGAI, focusing on their interested topics while maintaining a coherent\nbig picture of the future work landscape.\n","authors":["Xiang 'Anthony' Chen","Jeff Burke","Ruofei Du","Matthew K. Hong","Jennifer Jacobs","Philippe Laban","Dingzeyu Li","Nanyun Peng","Karl D. D. Willis","Chien-Sheng Wu","Bolei Zhou"],"pdf_url":"https://arxiv.org/pdf/2306.15774v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14869v1","updated":"2023-12-22T17:46:34Z","published":"2023-12-22T17:46:34Z","title":"Spatiotemporal-Linear: Towards Universal Multivariate Time Series\n Forecasting","summary":" Within the field of complicated multivariate time series forecasting (TSF),\npopular techniques frequently rely on intricate deep learning architectures,\nranging from transformer-based designs to recurrent neural networks. However,\nrecent findings suggest that simple Linear models can surpass sophisticated\nconstructs on diverse datasets. These models directly map observation to\nmultiple future time steps, thereby minimizing error accumulation in iterative\nmulti-step prediction. Yet, these models fail to incorporate spatial and\ntemporal information within the data, which is critical for capturing patterns\nand dependencies that drive insightful predictions. This oversight often leads\nto performance bottlenecks, especially under specific sequence lengths and\ndataset conditions, preventing their universal application. In response, we\nintroduce the SpatioTemporal-Linear (STL) framework. STL seamlessly integrates\ntime-embedded and spatially-informed bypasses to augment the Linear-based\narchitecture. These extra routes offer a more robust and refined regression to\nthe data, particularly when the amount of observation is limited and the\ncapacity of simple linear layers to capture dependencies declines. Empirical\nevidence highlights STL's prowess, outpacing both Linear and Transformer\nbenchmarks across varied observation and prediction durations and datasets.\nSuch robustness accentuates its suitability across a spectrum of applications,\nincluding but not limited to, traffic trajectory and rare disease progression\nforecasting. Through this discourse, we not only validate the STL's distinctive\ncapacities to become a more general paradigm in multivariate time-series\nprediction using deep-learning techniques but also stress the need to tackle\ndata-scarce prediction scenarios for universal application. Code will be made\navailable.\n","authors":["Aiyinsi Zuo","Haixi Zhang","Zirui Li","Ce Zheng"],"pdf_url":"https://arxiv.org/pdf/2312.14869v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.09552v2","updated":"2023-12-22T17:25:44Z","published":"2023-08-18T13:33:02Z","title":"Attesting Distributional Properties of Training Data for Machine\n Learning","summary":" The success of machine learning (ML) has been accompanied by increased\nconcerns about its trustworthiness. Several jurisdictions are preparing ML\nregulatory frameworks. One such concern is ensuring that model training data\nhas desirable distributional properties for certain sensitive attributes. For\nexample, draft regulations indicate that model trainers are required to show\nthat training datasets have specific distributional properties, such as\nreflecting diversity of the population.\n We propose the notion of property attestation allowing a prover (e.g., model\ntrainer) to demonstrate relevant distributional properties of training data to\na verifier (e.g., a customer) without revealing the data. We present an\neffective hybrid property attestation combining property inference with\ncryptographic mechanisms.\n","authors":["Vasisht Duddu","Anudeep Das","Nora Khayata","Hossein Yalame","Thomas Schneider","N. Asokan"],"pdf_url":"https://arxiv.org/pdf/2308.09552v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14847v1","updated":"2023-12-22T17:19:50Z","published":"2023-12-22T17:19:50Z","title":"Large Scale Traning of Graph Neural Networks for Optimal Markov-Chain\n Partitioning Using the Kemeny Constant","summary":" Traditional clustering algorithms often struggle to capture the complex\nrelationships within graphs and generalise to arbitrary clustering criteria.\nThe emergence of graph neural networks (GNNs) as a powerful framework for\nlearning representations of graph data provides new approaches to solving the\nproblem. Previous work has shown GNNs to be capable of proposing partitionings\nusing a variety of criteria, however, these approaches have not yet been\nextended to work on Markov chains or kinetic networks. These arise frequently\nin the study of molecular systems and are of particular interest to the\nbiochemical modelling community. In this work, we propose several GNN-based\narchitectures to tackle the graph partitioning problem for Markov Chains\ndescribed as kinetic networks. This approach aims to minimize how much a\nproposed partitioning changes the Kemeny constant. We propose using an\nencoder-decoder architecture and show how simple GraphSAGE-based GNNs with\nlinear layers can outperform much larger and more expressive attention-based\nmodels in this context. As a proof of concept, we first demonstrate the\nmethod's ability to cluster randomly connected graphs. We also use a linear\nchain architecture corresponding to a 1D free energy profile as our kinetic\nnetwork. Subsequently, we demonstrate the effectiveness of our method through\nexperiments on a data set derived from molecular dynamics. We compare the\nperformance of our method to other partitioning techniques such as PCCA+. We\nexplore the importance of feature and hyperparameter selection and propose a\ngeneral strategy for large-scale parallel training of GNNs for discovering\noptimal graph partitionings.\n","authors":["Sam Alexander Martino","João Morado","Chenghao Li","Zhenghao Lu","Edina Rosta"],"pdf_url":"https://arxiv.org/pdf/2312.14847v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11197v3","updated":"2023-12-22T17:14:38Z","published":"2023-08-22T05:14:42Z","title":"Toward Generalizable Machine Learning Models in Speech, Language, and\n Hearing Sciences: Estimating Sample Size and Reducing Overfitting","summary":" This study's first purpose is to provide quantitative evidence that would\nincentivize researchers to instead use the more robust method of nested\ncross-validation. The second purpose is to present methods and MATLAB codes for\ndoing power analysis for ML-based analysis during the design of a study. Monte\nCarlo simulations were used to quantify the interactions between the employed\ncross-validation method, the discriminative power of features, the\ndimensionality of the feature space, and the dimensionality of the model. Four\ndifferent cross-validations (single holdout, 10-fold, train-validation-test,\nand nested 10-fold) were compared based on the statistical power and\nstatistical confidence of the ML models. Distributions of the null and\nalternative hypotheses were used to determine the minimum required sample size\nfor obtaining a statistically significant outcome ({\\alpha}=0.05,\n1-\\b{eta}=0.8). Statistical confidence of the model was defined as the\nprobability of correct features being selected and hence being included in the\nfinal model. Our analysis showed that the model generated based on the single\nholdout method had very low statistical power and statistical confidence and\nthat it significantly overestimated the accuracy. Conversely, the nested\n10-fold cross-validation resulted in the highest statistical confidence and the\nhighest statistical power, while providing an unbiased estimate of the\naccuracy. The required sample size with a single holdout could be 50% higher\nthan what would be needed if nested cross-validation were used. Confidence in\nthe model based on nested cross-validation was as much as four times higher\nthan the confidence in the single holdout-based model. A computational model,\nMATLAB codes, and lookup tables are provided to assist researchers with\nestimating the sample size during the design of their future studies.\n","authors":["Hamzeh Ghasemzadeh","Robert E. Hillman","Daryush D. Mehta"],"pdf_url":"https://arxiv.org/pdf/2308.11197v3.pdf","comment":"Accepted at JSLHR"},{"id":"http://arxiv.org/abs/2312.14836v1","updated":"2023-12-22T17:09:34Z","published":"2023-12-22T17:09:34Z","title":"Learning Lagrangian Multipliers for the Travelling Salesman Problem","summary":" Lagrangian relaxation is a versatile mathematical technique employed to relax\nconstraints in an optimization problem, enabling the generation of dual bounds\nto prove the optimality of feasible solutions and the design of efficient\npropagators in constraint programming (such as the weighted circuit\nconstraint). However, the conventional process of deriving Lagrangian\nmultipliers (e.g., using subgradient methods) is often computationally\nintensive, limiting its practicality for large-scale or time-sensitive\nproblems. To address this challenge, we propose an innovative unsupervised\nlearning approach that harnesses the capabilities of graph neural networks to\nexploit the problem structure, aiming to generate accurate Lagrangian\nmultipliers efficiently. We apply this technique to the well-known Held-Karp\nLagrangian relaxation for the travelling salesman problem. The core idea is to\npredict accurate Lagrangian multipliers and to employ them as a warm start for\ngenerating Held-Karp relaxation bounds. These bounds are subsequently utilized\nto enhance the filtering process carried out by branch-and-bound algorithms. In\ncontrast to much of the existing literature, which primarily focuses on finding\nfeasible solutions, our approach operates on the dual side, demonstrating that\nlearning can also accelerate the proof of optimality. We conduct experiments\nacross various distributions of the metric travelling salesman problem,\nconsidering instances with up to 200 cities. The results illustrate that our\napproach can improve the filtering level of the weighted circuit global\nconstraint, reduce the optimality gap by a factor two for unsolved instances up\nto a timeout, and reduce the execution time for solved instances by 10%.\n","authors":["Augustin Parjadis","Quentin Cappart","Bistra Dilkina","Aaron Ferber","Louis-Martin Rousseau"],"pdf_url":"https://arxiv.org/pdf/2312.14836v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14820v1","updated":"2023-12-22T16:47:10Z","published":"2023-12-22T16:47:10Z","title":"Understanding the Regularity of Self-Attention with Optimal Transport","summary":" Transformers and their multi-head attention mechanism have completely changed\nthe machine learning landscape in just a few years, by outperforming\nstate-of-art models in a wide range of domains. Still, little is known about\ntheir robustness from a theoretical perspective. We tackle this problem by\nstudying the local Lipschitz constant of self-attention, that provides an\nattack-agnostic way of measuring the robustness of a neural network. We adopt a\nmeasure-theoretic framework, by viewing inputs as probability measures equipped\nwith the Wasserstein distance. This allows us to generalize attention to inputs\nof infinite length, and to derive an upper bound and a lower bound on the\nLipschitz constant of self-attention on compact sets. The lower bound\nsignificantly improves prior results, and grows more than exponentially with\nthe radius of the compact set, which rules out the possibility of obtaining\nrobustness guarantees without any additional constraint on the input space. Our\nresults also point out that measures with a high local Lipschitz constant are\ntypically made of a few diracs, with a very unbalanced distribution of mass.\nFinally, we analyze the stability of self-attention under perturbations that\nchange the number of tokens, which appears to be a natural question in the\nmeasure-theoretic framework. In particular, we show that for some inputs,\nattacks that duplicate tokens before perturbing them are more efficient than\nattacks that simply move tokens. We call this phenomenon mass splitting.\n","authors":["Valérie Castin","Pierre Ablin","Gabriel Peyré"],"pdf_url":"https://arxiv.org/pdf/2312.14820v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14812v1","updated":"2023-12-22T16:33:45Z","published":"2023-12-22T16:33:45Z","title":"PARDINUS: Weakly supervised discarding of photo-trapping empty images\n based on autoencoders","summary":" Photo-trapping cameras are widely employed for wildlife monitoring. Those\ncameras take photographs when motion is detected to capture images where\nanimals appear. A significant portion of these images are empty - no wildlife\nappears in the image. Filtering out those images is not a trivial task since it\nrequires hours of manual work from biologists. Therefore, there is a notable\ninterest in automating this task. Automatic discarding of empty photo-trapping\nimages is still an open field in the area of Machine Learning. Existing\nsolutions often rely on state-of-the-art supervised convolutional neural\nnetworks that require the annotation of the images in the training phase.\nPARDINUS (Weakly suPervised discARDINg of photo-trapping empty images based on\naUtoencoderS) is constructed on the foundation of weakly supervised learning\nand proves that this approach equals or even surpasses other fully supervised\nmethods that require further labeling work.\n","authors":["David de la Rosa","Antonio J Rivera","María J del Jesus","Francisco Charte"],"pdf_url":"https://arxiv.org/pdf/2312.14812v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14806v1","updated":"2023-12-22T16:27:12Z","published":"2023-12-22T16:27:12Z","title":"The Effects of Signal-to-Noise Ratio on Generative Adversarial Networks\n Applied to Marine Bioacoustic Data","summary":" In recent years generative adversarial networks (GANs) have been used to\nsupplement datasets within the field of marine bioacoustics. This is driven by\nfactors such as the cost to collect data, data sparsity and aid preprocessing.\nOne notable challenge with marine bioacoustic data is the low signal-to-noise\nratio (SNR) posing difficulty when applying deep learning techniques such as\nGANs. This work investigates the effect SNR has on the audio-based GAN\nperformance and examines three different evaluation methodologies for GAN\nperformance, yielding interesting results on the effects of SNR on GANs,\nspecifically WaveGAN.\n","authors":["Georgia Atkinson","Nick Wright","A. Stephen McGough","Per Berggren"],"pdf_url":"https://arxiv.org/pdf/2312.14806v1.pdf","comment":"6 pages, 6 figures"},{"id":"http://arxiv.org/abs/2312.14795v1","updated":"2023-12-22T16:12:25Z","published":"2023-12-22T16:12:25Z","title":"On support vector machines under a multiple-cost scenario","summary":" Support Vector Machine (SVM) is a powerful tool in binary classification,\nknown to attain excellent misclassification rates. On the other hand, many\nrealworld classification problems, such as those found in medical diagnosis,\nchurn or fraud prediction, involve misclassification costs which may be\ndifferent in the different classes. However, it may be hard for the user to\nprovide precise values for such misclassification costs, whereas it may be much\neasier to identify acceptable misclassification rates values. In this paper we\npropose a novel SVM model in which misclassification costs are considered by\nincorporating performance constraints in the problem formulation. Specifically,\nour aim is to seek the hyperplane with maximal margin yielding\nmisclassification rates below given threshold values. Such maximal margin\nhyperplane is obtained by solving a quadratic convex problem with linear\nconstraints and integer variables. The reported numerical experience shows that\nour model gives the user control on the misclassification rates in one class\n(possibly at the expense of an increase in misclassification rates for the\nother class) and is feasible in terms of running times.\n","authors":["Sandra Benítez-Peña","Rafael Blanquero","Emilio Carrizosa","Pepa Ramírez-Cobo"],"pdf_url":"https://arxiv.org/pdf/2312.14795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14792v1","updated":"2023-12-22T16:06:43Z","published":"2023-12-22T16:06:43Z","title":"The Rate-Distortion-Perception-Classification Tradeoff: Joint Source\n Coding and Modulation via Inverse-Domain GANs","summary":" The joint source coding and modulation (JSCM) framework was enabled by recent\ndevelopments in deep learning, which allows to automatically learn from data,\nand in an end-to-end fashion, the best compression codes and modulation\nschemes. In this paper, we show the existence of a strict tradeoff between\nchannel rate, distortion, perception, and classification accuracy in a JSCM\nscenario. We then propose two image compression methods to navigate that\ntradeoff: an inverse-domain generative adversarial network (ID-GAN), which\nachieves extreme compression, and a simpler, heuristic method that reveals\ninsights about the performance of ID-GAN. Experiment results not only\ncorroborate the theoretical findings, but also demonstrate that the proposed\nID-GAN algorithm significantly improves system performance compared to\ntraditional separation-based methods and recent deep JSCM architectures.\n","authors":["Junli Fang","João F. C. Mota","Baoshan Lu","Weicheng Zhang","Xuemin Hong"],"pdf_url":"https://arxiv.org/pdf/2312.14792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01438v2","updated":"2023-12-22T15:59:38Z","published":"2023-09-30T15:44:39Z","title":"Building Flexible, Scalable, and Machine Learning-ready Multimodal\n Oncology Datasets","summary":" The advancements in data acquisition, storage, and processing techniques have\nresulted in the rapid growth of heterogeneous medical data. Integrating\nradiological scans, histopathology images, and molecular information with\nclinical data is essential for developing a holistic understanding of the\ndisease and optimizing treatment. The need for integrating data from multiple\nsources is further pronounced in complex diseases such as cancer for enabling\nprecision medicine and personalized treatments. This work proposes Multimodal\nIntegration of Oncology Data System (MINDS) - a flexible, scalable, and\ncost-effective metadata framework for efficiently fusing disparate data from\npublic sources such as the Cancer Research Data Commons (CRDC) into an\ninterconnected, patient-centric framework. MINDS offers an interface for\nexploring relationships across data types and building cohorts for developing\nlarge-scale multimodal machine learning models. By harmonizing multimodal data,\nMINDS aims to potentially empower researchers with greater analytical ability\nto uncover diagnostic and prognostic insights and enable evidence-based\npersonalized care. MINDS tracks granular end-to-end data provenance, ensuring\nreproducibility and transparency. The cloud-native architecture of MINDS can\nhandle exponential data growth in a secure, cost-optimized manner while\nensuring substantial storage optimization, replication avoidance, and dynamic\naccess capabilities. Auto-scaling, access controls, and other mechanisms\nguarantee pipelines' scalability and security. MINDS overcomes the limitations\nof existing biomedical data silos via an interoperable metadata-driven approach\nthat represents a pivotal step toward the future of oncology data integration.\n","authors":["Aakash Tripathi","Asim Waqas","Kavya Venkatesan","Yasin Yilmaz","Ghulam Rasool"],"pdf_url":"https://arxiv.org/pdf/2310.01438v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14770v1","updated":"2023-12-22T15:39:03Z","published":"2023-12-22T15:39:03Z","title":"Integration Of Evolutionary Automated Machine Learning With Structural\n Sensitivity Analysis For Composite Pipelines","summary":" Automated machine learning (AutoML) systems propose an end-to-end solution to\na given machine learning problem, creating either fixed or flexible pipelines.\nFixed pipelines are task independent constructs: their general composition\nremains the same, regardless of the data. In contrast, the structure of\nflexible pipelines varies depending on the input, making them finely tailored\nto individual tasks. However, flexible pipelines can be structurally\novercomplicated and have poor explainability. We propose the EVOSA approach\nthat compensates for the negative points of flexible pipelines by incorporating\na sensitivity analysis which increases the robustness and interpretability of\nthe flexible solutions. EVOSA quantitatively estimates positive and negative\nimpact of an edge or a node on a pipeline graph, and feeds this information to\nthe evolutionary AutoML optimizer. The correctness and efficiency of EVOSA was\nvalidated in tabular, multimodal and computer vision tasks, suggesting\ngeneralizability of the proposed approach across domains.\n","authors":["Nikolay O. Nikitin","Maiia Pinchuk","Valerii Pokrovskii","Peter Shevchenko","Andrey Getmanov","Yaroslav Aksenkin","Ilia Revin","Andrey Stebenkov","Ekaterina Poslavskaya","Anna V. Kalyuzhnaya"],"pdf_url":"https://arxiv.org/pdf/2312.14770v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14769v1","updated":"2023-12-22T15:38:13Z","published":"2023-12-22T15:38:13Z","title":"Large Language Model (LLM) Bias Index -- LLMBI","summary":" The Large Language Model Bias Index (LLMBI) is a pioneering approach designed\nto quantify and address biases inherent in large language models (LLMs), such\nas GPT-4. We recognise the increasing prevalence and impact of LLMs across\ndiverse sectors. This research introduces a novel metric, LLMBI, to\nsystematically measure and mitigate biases potentially skewing model responses.\nWe formulated LLMBI using a composite scoring system incorporating multiple\ndimensions of bias, including but not limited to age, gender, and racial\nbiases.\n To operationalise this metric, we engaged in a multi-step process involving\ncollecting and annotating LLM responses, applying sophisticated Natural\nLanguage Processing (NLP) techniques for bias detection, and computing the\nLLMBI score through a specially crafted mathematical formula. The formula\nintegrates weighted averages of various bias dimensions, a penalty for dataset\ndiversity deficiencies, and a correction for sentiment biases. Our empirical\nanalysis, conducted using responses from OpenAI's API, employs advanced\nsentiment analysis as a representative method for bias detection.\n The research reveals LLMs, whilst demonstrating impressive capabilities in\ntext generation, exhibit varying degrees of bias across different dimensions.\nLLMBI provides a quantifiable measure to compare biases across models and over\ntime, offering a vital tool for systems engineers, researchers and regulators\nin enhancing the fairness and reliability of LLMs. It highlights the potential\nof LLMs in mimicking unbiased human-like responses. Additionally, it\nunderscores the necessity of continuously monitoring and recalibrating such\nmodels to align with evolving societal norms and ethical standards.\n","authors":["Abiodun Finbarrs Oketunji","Muhammad Anas","Deepthi Saina"],"pdf_url":"https://arxiv.org/pdf/2312.14769v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14763v1","updated":"2023-12-22T15:28:55Z","published":"2023-12-22T15:28:55Z","title":"Enhanced Latent Multi-view Subspace Clustering","summary":" Latent multi-view subspace clustering has been demonstrated to have desirable\nclustering performance. However, the original latent representation method\nvertically concatenates the data matrices from multiple views into a single\nmatrix along the direction of dimensionality to recover the latent\nrepresentation matrix, which may result in an incomplete information recovery.\nTo fully recover the latent space representation, we in this paper propose an\nEnhanced Latent Multi-view Subspace Clustering (ELMSC) method. The ELMSC method\ninvolves constructing an augmented data matrix that enhances the representation\nof multi-view data. Specifically, we stack the data matrices from various views\ninto the block-diagonal locations of the augmented matrix to exploit the\ncomplementary information. Meanwhile, the non-block-diagonal entries are\ncomposed based on the similarity between different views to capture the\nconsistent information. In addition, we enforce a sparse regularization for the\nnon-diagonal blocks of the augmented self-representation matrix to avoid\nredundant calculations of consistency information. Finally, a novel iterative\nalgorithm based on the framework of Alternating Direction Method of Multipliers\n(ADMM) is developed to solve the optimization problem for ELMSC. Extensive\nexperiments on real-world datasets demonstrate that our proposed ELMSC is able\nto achieve higher clustering performance than some state-of-art multi-view\nclustering methods.\n","authors":["Long Shi","Lei Cao","Jun Wang","Badong Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13970v2","updated":"2023-12-22T15:28:23Z","published":"2023-12-21T15:56:09Z","title":"On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and\n Efficient Gradient Methods","summary":" This paper studies the Partial Optimal Transport (POT) problem between two\nunbalanced measures with at most $n$ supports and its applications in various\nAI tasks such as color transfer or domain adaptation. There is hence the need\nfor fast approximations of POT with increasingly large problem sizes in arising\napplications. We first theoretically and experimentally investigate the\ninfeasibility of the state-of-the-art Sinkhorn algorithm for POT due to its\nincompatible rounding procedure, which consequently degrades its qualitative\nperformance in real world applications like point-cloud registration. To this\nend, we propose a novel rounding algorithm for POT, and then provide a feasible\nSinkhorn procedure with a revised computation complexity of\n$\\mathcal{\\widetilde O}(n^2/\\varepsilon^4)$. Our rounding algorithm also\npermits the development of two first-order methods to approximate the POT\nproblem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent\n(APDAGD), finds an $\\varepsilon$-approximate solution to the POT problem in\n$\\mathcal{\\widetilde O}(n^{2.5}/\\varepsilon)$, which is better in $\\varepsilon$\nthan revised Sinkhorn. The second method, Dual Extrapolation, achieves the\ncomputation complexity of $\\mathcal{\\widetilde O}(n^2/\\varepsilon)$, thereby\nbeing the best in the literature. We further demonstrate the flexibility of POT\ncompared to standard OT as well as the practicality of our algorithms on real\napplications where two marginal distributions are unbalanced.\n","authors":["Anh Duc Nguyen","Tuan Dung Nguyen","Quang Minh Nguyen","Hoang H. Nguyen","Lam M. Nguyen","Kim-Chuan Toh"],"pdf_url":"https://arxiv.org/pdf/2312.13970v2.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14758v1","updated":"2023-12-22T15:17:44Z","published":"2023-12-22T15:17:44Z","title":"Diffusion Maps for Signal Filtering in Graph Learning","summary":" This paper explores the application diffusion maps as graph shift operators\nin understanding the underlying geometry of graph signals. The study evaluates\nthe improvements in graph learning when using diffusion map generated filters\nto the Markov Variation minimization problem. The paper showcases the\neffectiveness of this approach through examples involving synthetically\ngenerated and real-world temperature sensor data. These examples also compare\nthe diffusion map graph signal model with other commonly used graph signal\noperators. The results provide new approaches for the analysis and\nunderstanding of complex, non-Euclidean data structures.\n","authors":["Todd Hildebrant"],"pdf_url":"https://arxiv.org/pdf/2312.14758v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14751v1","updated":"2023-12-22T15:05:56Z","published":"2023-12-22T15:05:56Z","title":"Hazards from Increasingly Accessible Fine-Tuning of Downloadable\n Foundation Models","summary":" Public release of the weights of pretrained foundation models, otherwise\nknown as downloadable access \\citep{solaiman_gradient_2023}, enables\nfine-tuning without the prohibitive expense of pretraining. Our work argues\nthat increasingly accessible fine-tuning of downloadable models may increase\nhazards. First, we highlight research to improve the accessibility of\nfine-tuning. We split our discussion into research that A) reduces the\ncomputational cost of fine-tuning and B) improves the ability to share that\ncost across more actors. Second, we argue that increasingly accessible\nfine-tuning methods may increase hazard through facilitating malicious use and\nmaking oversight of models with potentially dangerous capabilities more\ndifficult. Third, we discuss potential mitigatory measures, as well as benefits\nof more accessible fine-tuning. Given substantial remaining uncertainty about\nhazards, we conclude by emphasizing the urgent need for the development of\nmitigations.\n","authors":["Alan Chan","Ben Bucknall","Herbie Bradley","David Krueger"],"pdf_url":"https://arxiv.org/pdf/2312.14751v1.pdf","comment":"Accepted as a spotlight workshop paper at the Socially Responsible\n Language Modelling Research (SoLaR) workshop, held at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14748v1","updated":"2023-12-22T15:04:20Z","published":"2023-12-22T15:04:20Z","title":"Progressing from Anomaly Detection to Automated Log Labeling and\n Pioneering Root Cause Analysis","summary":" The realm of AIOps is transforming IT landscapes with the power of AI and ML.\nDespite the challenge of limited labeled data, supervised models show promise,\nemphasizing the importance of leveraging labels for training, especially in\ndeep learning contexts. This study enhances the field by introducing a taxonomy\nfor log anomalies and exploring automated data labeling to mitigate labeling\nchallenges. It goes further by investigating the potential of diverse anomaly\ndetection techniques and their alignment with specific anomaly types. However,\nthe exploration doesn't stop at anomaly detection. The study envisions a future\nwhere root cause analysis follows anomaly detection, unraveling the underlying\ntriggers of anomalies. This uncharted territory holds immense potential for\nrevolutionizing IT systems management. In essence, this paper enriches our\nunderstanding of anomaly detection, and automated labeling, and sets the stage\nfor transformative root cause analysis. Together, these advances promise more\nresilient IT systems, elevating operational efficiency and user satisfaction in\nan ever-evolving technological landscape.\n","authors":["Thorsten Wittkopp","Alexander Acker","Odej Kao"],"pdf_url":"https://arxiv.org/pdf/2312.14748v1.pdf","comment":"accepted at AIOPS workshop @ICDM 2023"},{"id":"http://arxiv.org/abs/2310.09433v2","updated":"2023-12-22T14:45:45Z","published":"2023-10-13T22:48:50Z","title":"Effects of cavity nonlinearities and linear losses on silicon\n microring-based reservoir computing","summary":" Microring resonators (MRRs) are promising devices for time-delay photonic\nreservoir computing, but the impact of the different physical effects taking\nplace in the MRRs on the reservoir computing performance is yet to be fully\nunderstood. We numerically analyze the impact of linear losses as well as\nthermo-optic and free-carrier effects relaxation times on the prediction error\nof the time-series task NARMA-10. We demonstrate the existence of three\nregions, defined by the input power and the frequency detuning between the\noptical source and the microring resonance, that reveal the cavity transition\nfrom linear to nonlinear regimes. One of these regions offers very low error in\ntime-series prediction under relatively low input power and number of nodes\nwhile the other regions either lack nonlinearity or become unstable. This study\nprovides insight into the design of the MRR and the optimization of its\nphysical properties for improving the prediction performance of time-delay\nreservoir computing.\n","authors":["Bernard J. Giron Castro","Christophe Peucheret","Darko Zibar","Francesco Da Ros"],"pdf_url":"https://arxiv.org/pdf/2310.09433v2.pdf","comment":"20 pages, 11 figures, submitted to Optics Express (reviewed version)"},{"id":"http://arxiv.org/abs/2312.14712v1","updated":"2023-12-22T14:10:07Z","published":"2023-12-22T14:10:07Z","title":"Can Machines Learn Robustly, Privately, and Efficiently?","summary":" The success of machine learning (ML) applications relies on vast datasets and\ndistributed architectures, which, as they grow, present challenges for ML. In\nreal-world scenarios, where data often contains sensitive information, issues\nlike data poisoning and hardware failures are common. Ensuring privacy and\nrobustness is vital for the broad adoption of ML in public life. This paper\nexamines the costs associated with achieving these objectives in distributed\narchitectures. We overview the meanings of privacy and robustness in\ndistributed ML, and clarify how they can be achieved efficiently in isolation.\nHowever, we contend that the integration of these objectives entails a notable\ncompromise in computational efficiency. We delve into this intricate balance,\nexploring the challenges and solutions for privacy, robustness, and\ncomputational efficiency in ML applications.\n","authors":["Youssef Allouah","Rachid Guerraoui","John Stephan"],"pdf_url":"https://arxiv.org/pdf/2312.14712v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14705v1","updated":"2023-12-22T14:06:03Z","published":"2023-12-22T14:06:03Z","title":"SCUNet++: Assessment of Pulmonary Embolism CT Image Segmentation\n Leveraging Swin-UNet and CNN Bottleneck Hybrid Architecture with Multi-Fusion\n Dense Skip Connection","summary":" Pulmonary embolism (PE) is a prevalent lung disease that can lead to right\nventricular hypertrophy and failure in severe cases, ranking second in severity\nonly to myocardial infarction and sudden death. Pulmonary artery CT angiography\n(CTPA) is a widely used diagnostic method for PE. However, PE detection\npresents challenges in clinical practice due to limitations in imaging\ntechnology. CTPA can produce noises similar to PE, making confirmation of its\npresence time-consuming and prone to overdiagnosis. Nevertheless, the\ntraditional segmentation method of PE can not fully consider the hierarchical\nstructure of features, local and global spatial features of PE CT images. In\nthis paper, we propose an automatic PE segmentation method called SCUNet++\n(Swin Conv UNet++). This method incorporates multiple fusion dense skip\nconnections between the encoder and decoder, utilizing the Swin Transformer as\nthe encoder. And fuses features of different scales in the decoder subnetwork\nto compensate for spatial information loss caused by the inevitable\ndownsampling in Swin-UNet or other state-of-the-art methods, effectively\nsolving the above problem. We provide a theoretical analysis of this method in\ndetail and validate it on publicly available PE CT image datasets FUMPE and\nCAD-PE. The experimental results indicate that our proposed method achieved a\nDice similarity coefficient (DSC) of 83.47% and a Hausdorff distance 95th\npercentile (HD95) of 3.83 on the FUMPE dataset, as well as a DSC of 83.42% and\nan HD95 of 5.10 on the CAD-PE dataset. These findings demonstrate that our\nmethod exhibits strong performance in PE segmentation tasks, potentially\nenhancing the accuracy of automatic segmentation of PE and providing a powerful\ndiagnostic tool for clinical physicians. Our source code and new FUMPE dataset\nare available at https://github.com/JustlfC03/SCUNet-plusplus.\n","authors":["Yifei Chen","Binfeng Zou","Zhaoxin Guo","Yiyu Huang","Yifan Huang","Feiwei Qin","Qinhai Li","Changmiao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14705v1.pdf","comment":"10 pages, 7 figures, accept wacv2024"},{"id":"http://arxiv.org/abs/2312.14698v1","updated":"2023-12-22T13:57:29Z","published":"2023-12-22T13:57:29Z","title":"Time-changed normalizing flows for accurate SDE modeling","summary":" The generative paradigm has become increasingly important in machine learning\nand deep learning models. Among popular generative models are normalizing\nflows, which enable exact likelihood estimation by transforming a base\ndistribution through diffeomorphic transformations. Extending the normalizing\nflow framework to handle time-indexed flows gave dynamic normalizing flows, a\npowerful tool to model time series, stochastic processes, and neural stochastic\ndifferential equations (SDEs). In this work, we propose a novel variant of\ndynamic normalizing flows, a Time Changed Normalizing Flow (TCNF), based on\ntime deformation of a Brownian motion which constitutes a versatile and\nextensive family of Gaussian processes. This approach enables us to effectively\nmodel some SDEs, that cannot be modeled otherwise, including standard ones such\nas the well-known Ornstein-Uhlenbeck process, and generalizes prior\nmethodologies, leading to improved results and better inference and prediction\ncapability.\n","authors":["Naoufal El Bekri","Lucas Drumetz","Franck Vermet"],"pdf_url":"https://arxiv.org/pdf/2312.14698v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11241v2","updated":"2023-12-22T13:55:53Z","published":"2023-04-21T20:22:17Z","title":"AutoNeRF: Training Implicit Scene Representations with Autonomous Agents","summary":" Implicit representations such as Neural Radiance Fields (NeRF) have been\nshown to be very effective at novel view synthesis. However, these models\ntypically require manual and careful human data collection for training. In\nthis paper, we present AutoNeRF, a method to collect data required to train\nNeRFs using autonomous embodied agents. Our method allows an agent to explore\nan unseen environment efficiently and use the experience to build an implicit\nmap representation autonomously. We compare the impact of different exploration\nstrategies including handcrafted frontier-based exploration, end-to-end and\nmodular approaches composed of trained high-level planners and classical\nlow-level path followers. We train these models with different reward functions\ntailored to this problem and evaluate the quality of the learned\nrepresentations on four different downstream tasks: classical viewpoint\nrendering, map reconstruction, planning, and pose refinement. Empirical results\nshow that NeRFs can be trained on actively collected data using just a single\nepisode of experience in an unseen environment, and can be used for several\ndownstream robotic tasks, and that modular trained exploration models\noutperform other classical and end-to-end baselines. Finally, we show that\nAutoNeRF can reconstruct large-scale scenes, and is thus a useful tool to\nperform scene-specific adaptation as the produced 3D environment models can be\nloaded into a simulator to fine-tune a policy of interest.\n","authors":["Pierre Marza","Laetitia Matignon","Olivier Simonin","Dhruv Batra","Christian Wolf","Devendra Singh Chaplot"],"pdf_url":"https://arxiv.org/pdf/2304.11241v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11706v2","updated":"2023-12-22T13:55:42Z","published":"2023-06-20T17:35:20Z","title":"RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation","summary":" The ability to leverage heterogeneous robotic experience from different\nrobots and tasks to quickly master novel skills and embodiments has the\npotential to transform robot learning. Inspired by recent advances in\nfoundation models for vision and language, we propose a multi-embodiment,\nmulti-task generalist agent for robotic manipulation. This agent, named\nRoboCat, is a visual goal-conditioned decision transformer capable of consuming\naction-labelled visual experience. This data spans a large repertoire of motor\ncontrol skills from simulated and real robotic arms with varying sets of\nobservations and actions. With RoboCat, we demonstrate the ability to\ngeneralise to new tasks and robots, both zero-shot as well as through\nadaptation using only 100-1000 examples for the target task. We also show how a\ntrained model itself can be used to generate data for subsequent training\niterations, thus providing a basic building block for an autonomous improvement\nloop. We investigate the agent's capabilities, with large-scale evaluations\nboth in simulation and on three different real robot embodiments. We find that\nas we grow and diversify its training data, RoboCat not only shows signs of\ncross-task transfer, but also becomes more efficient at adapting to new tasks.\n","authors":["Konstantinos Bousmalis","Giulia Vezzani","Dushyant Rao","Coline Devin","Alex X. Lee","Maria Bauza","Todor Davchev","Yuxiang Zhou","Agrim Gupta","Akhil Raju","Antoine Laurens","Claudio Fantacci","Valentin Dalibard","Martina Zambelli","Murilo Martins","Rugile Pevceviciute","Michiel Blokzijl","Misha Denil","Nathan Batchelor","Thomas Lampe","Emilio Parisotto","Konrad Żołna","Scott Reed","Sergio Gómez Colmenarejo","Jon Scholz","Abbas Abdolmaleki","Oliver Groth","Jean-Baptiste Regli","Oleg Sushkov","Tom Rothörl","José Enrique Chen","Yusuf Aytar","Dave Barker","Joy Ortiz","Martin Riedmiller","Jost Tobias Springenberg","Raia Hadsell","Francesco Nori","Nicolas Heess"],"pdf_url":"https://arxiv.org/pdf/2306.11706v2.pdf","comment":"Transactions on Machine Learning Research (12/2023)"},{"id":"http://arxiv.org/abs/2312.14688v1","updated":"2023-12-22T13:43:57Z","published":"2023-12-22T13:43:57Z","title":"A Mathematical Guide to Operator Learning","summary":" Operator learning aims to discover properties of an underlying dynamical\nsystem or partial differential equation (PDE) from data. Here, we present a\nstep-by-step guide to operator learning. We explain the types of problems and\nPDEs amenable to operator learning, discuss various neural network\narchitectures, and explain how to employ numerical PDE solvers effectively. We\nalso give advice on how to create and manage training data and conduct\noptimization. We offer intuition behind the various neural network\narchitectures employed in operator learning by motivating them from the\npoint-of-view of numerical linear algebra.\n","authors":["Nicolas Boullé","Alex Townsend"],"pdf_url":"https://arxiv.org/pdf/2312.14688v1.pdf","comment":"45 pages, 11 figures"},{"id":"http://arxiv.org/abs/2312.14681v1","updated":"2023-12-22T13:34:18Z","published":"2023-12-22T13:34:18Z","title":"Engineered Ordinary Differential Equations as Classification Algorithm\n (EODECA): thorough characterization and testing","summary":" EODECA (Engineered Ordinary Differential Equations as Classification\nAlgorithm) is a novel approach at the intersection of machine learning and\ndynamical systems theory, presenting a unique framework for classification\ntasks [1]. This method stands out with its dynamical system structure,\nutilizing ordinary differential equations (ODEs) to efficiently handle complex\nclassification challenges. The paper delves into EODECA's dynamical properties,\nemphasizing its resilience against random perturbations and robust performance\nacross various classification scenarios. Notably, EODECA's design incorporates\nthe ability to embed stable attractors in the phase space, enhancing\nreliability and allowing for reversible dynamics. In this paper, we carry out a\ncomprehensive analysis by expanding on the work [1], and employing a Euler\ndiscretization scheme. In particular, we evaluate EODECA's performance across\nfive distinct classification problems, examining its adaptability and\nefficiency. Significantly, we demonstrate EODECA's effectiveness on the MNIST\nand Fashion MNIST datasets, achieving impressive accuracies of $98.06\\%$ and\n$88.21\\%$, respectively. These results are comparable to those of a multi-layer\nperceptron (MLP), underscoring EODECA's potential in complex data processing\ntasks. We further explore the model's learning journey, assessing its evolution\nin both pre and post training environments and highlighting its ability to\nnavigate towards stable attractors. The study also investigates the\ninvertibility of EODECA, shedding light on its decision-making processes and\ninternal workings. This paper presents a significant step towards a more\ntransparent and robust machine learning paradigm, bridging the gap between\nmachine learning algorithms and dynamical systems methodologies.\n","authors":["Raffaele Marino","Lorenzo Buffoni","Lorenzo Chicchi","Lorenzo Giambagli","Duccio Fanelli"],"pdf_url":"https://arxiv.org/pdf/2312.14681v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.03131v2","updated":"2023-12-22T13:23:51Z","published":"2022-12-06T16:55:10Z","title":"Explainability as statistical inference","summary":" A wide variety of model explanation approaches have been proposed in recent\nyears, all guided by very different rationales and heuristics. In this paper,\nwe take a new route and cast interpretability as a statistical inference\nproblem. We propose a general deep probabilistic model designed to produce\ninterpretable predictions. The model parameters can be learned via maximum\nlikelihood, and the method can be adapted to any predictor network architecture\nand any type of prediction problem. Our method is a case of amortized\ninterpretability models, where a neural network is used as a selector to allow\nfor fast interpretation at inference time. Several popular interpretability\nmethods are shown to be particular cases of regularised maximum likelihood for\nour general model. We propose new datasets with ground truth selection which\nallow for the evaluation of the features importance map. Using these datasets,\nwe show experimentally that using multiple imputation provides more reasonable\ninterpretations.\n","authors":["Hugo Henri Joseph Senetaire","Damien Garreau","Jes Frellsen","Pierre-Alexandre Mattei"],"pdf_url":"https://arxiv.org/pdf/2212.03131v2.pdf","comment":"10 pages, 22 figures, submitted at ICLR 2023"},{"id":"http://arxiv.org/abs/2306.05059v2","updated":"2023-12-22T13:22:17Z","published":"2023-06-08T09:23:22Z","title":"Reconciling Predictive and Statistical Parity: A Causal Approach","summary":" Since the rise of fair machine learning as a critical field of inquiry, many\ndifferent notions on how to quantify and measure discrimination have been\nproposed in the literature. Some of these notions, however, were shown to be\nmutually incompatible. Such findings make it appear that numerous different\nkinds of fairness exist, thereby making a consensus on the appropriate measure\nof fairness harder to reach, hindering the applications of these tools in\npractice. In this paper, we investigate one of these key impossibility results\nthat relates the notions of statistical and predictive parity. Specifically, we\nderive a new causal decomposition formula for the fairness measures associated\nwith predictive parity, and obtain a novel insight into how this criterion is\nrelated to statistical parity through the legal doctrines of disparate\ntreatment, disparate impact, and the notion of business necessity. Our results\nshow that through a more careful causal analysis, the notions of statistical\nand predictive parity are not really mutually exclusive, but complementary and\nspanning a spectrum of fairness notions through the concept of business\nnecessity. Finally, we demonstrate the importance of our findings on a\nreal-world example.\n","authors":["Drago Plecko","Elias Bareinboim"],"pdf_url":"https://arxiv.org/pdf/2306.05059v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14667v1","updated":"2023-12-22T13:03:23Z","published":"2023-12-22T13:03:23Z","title":"Token-Level Contrastive Learning with Modality-Aware Prompting for\n Multimodal Intent Recognition","summary":" Multimodal intent recognition aims to leverage diverse modalities such as\nexpressions, body movements and tone of speech to comprehend user's intent,\nconstituting a critical task for understanding human language and behavior in\nreal-world multimodal scenarios. Nevertheless, the majority of existing methods\nignore potential correlations among different modalities and own limitations in\neffectively learning semantic features from nonverbal modalities. In this\npaper, we introduce a token-level contrastive learning method with\nmodality-aware prompting (TCL-MAP) to address the above challenges. To\nestablish an optimal multimodal semantic environment for text modality, we\ndevelop a modality-aware prompting module (MAP), which effectively aligns and\nfuses features from text, video and audio modalities with similarity-based\nmodality alignment and cross-modality attention mechanism. Based on the\nmodality-aware prompt and ground truth labels, the proposed token-level\ncontrastive learning framework (TCL) constructs augmented samples and employs\nNT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal\ntextual semantic insights derived from intent labels to guide the learning\nprocesses of other modalities in return. Extensive experiments show that our\nmethod achieves remarkable improvements compared to state-of-the-art methods.\nAdditionally, ablation analyses demonstrate the superiority of the\nmodality-aware prompt over the handcrafted prompt, which holds substantial\nsignificance for multimodal prompt learning. The codes are released at\nhttps://github.com/thuiar/TCL-MAP.\n","authors":["Qianrui Zhou","Hua Xu","Hao Li","Hanlei Zhang","Xiaohan Zhang","Yifan Wang","Kai Gao"],"pdf_url":"https://arxiv.org/pdf/2312.14667v1.pdf","comment":"Accepted by AAAI 2024 (Main Track, Long Paper)"},{"id":"http://arxiv.org/abs/2312.06275v2","updated":"2023-12-22T13:01:13Z","published":"2023-12-11T10:26:21Z","title":"DG-TTA: Out-of-domain medical image segmentation through Domain\n Generalization and Test-Time Adaptation","summary":" Applying pre-trained medical segmentation models on out-of-domain images\noften yields predictions of insufficient quality. Several strategies have been\nproposed to maintain model performance, such as finetuning or unsupervised- and\nsource-free domain adaptation. These strategies set restrictive requirements\nfor data availability. In this study, we propose to combine domain\ngeneralization and test-time adaptation to create a highly effective approach\nfor reusing pre-trained models in unseen target domains. Domain-generalized\npre-training on source data is used to obtain the best initial performance in\nthe target domain. We introduce the MIND descriptor previously used in image\nregistration tasks as a further technique to achieve generalization and present\nsuperior performance for small-scale datasets compared to existing approaches.\nAt test-time, high-quality segmentation for every single unseen scan is ensured\nby optimizing the model weights for consistency given different image\naugmentations. That way, our method enables separate use of source and target\ndata and thus removes current data availability barriers. Moreover, the\npresented method is highly modular as it does not require specific model\narchitectures or prior knowledge of involved domains and labels. We demonstrate\nthis by integrating it into the nnUNet, which is currently the most popular and\naccurate framework for medical image segmentation. We employ multiple datasets\ncovering abdominal, cardiac, and lumbar spine scans and compose several\nout-of-domain scenarios in this study. We demonstrate that our method, combined\nwith pre-trained whole-body CT models, can effectively segment MR images with\nhigh accuracy in all of the aforementioned scenarios. Open-source code can be\nfound here: https://github.com/multimodallearning/DG-TTA\n","authors":["Christian Weihsbach","Christian N. Kruse","Alexander Bigalke","Mattias P. Heinrich"],"pdf_url":"https://arxiv.org/pdf/2312.06275v2.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2312.10794v2","updated":"2023-12-22T12:47:06Z","published":"2023-12-17T19:06:29Z","title":"A mathematical perspective on Transformers","summary":" Transformers play a central role in the inner workings of large language\nmodels. We develop a mathematical framework for analyzing Transformers based on\ntheir interpretation as interacting particle systems, which reveals that\nclusters emerge in long time. Our study explores the underlying theory and\noffers new perspectives for mathematicians as well as computer scientists.\n","authors":["Borjan Geshkovski","Cyril Letrouit","Yury Polyanskiy","Philippe Rigollet"],"pdf_url":"https://arxiv.org/pdf/2312.10794v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14657v1","updated":"2023-12-22T12:46:30Z","published":"2023-12-22T12:46:30Z","title":"Deep Non-Parametric Time Series Forecaster","summary":" This paper presents non-parametric baseline models for time series\nforecasting. Unlike classical forecasting models, the proposed approach does\nnot assume any parametric form for the predictive distribution and instead\ngenerates predictions by sampling from the empirical distribution according to\na tunable strategy. By virtue of this, the model is always able to produce\nreasonable forecasts (i.e., predictions within the observed data range) without\nfail unlike classical models that suffer from numerical stability on some data\ndistributions. Moreover, we develop a global version of the proposed method\nthat automatically learns the sampling strategy by exploiting the information\nacross multiple related time series. The empirical evaluation shows that the\nproposed methods have reasonable and consistent performance across all\ndatasets, proving them to be strong baselines to be considered in one's\nforecasting toolbox.\n","authors":["Syama Sundar Rangapuram","Jan Gasthaus","Lorenzo Stella","Valentin Flunkert","David Salinas","Yuyang Wang","Tim Januschowski"],"pdf_url":"https://arxiv.org/pdf/2312.14657v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14651v1","updated":"2023-12-22T12:36:50Z","published":"2023-12-22T12:36:50Z","title":"SAVAE: Leveraging the variational Bayes autoencoder for survival\n analysis","summary":" As in many fields of medical research, survival analysis has witnessed a\ngrowing interest in the application of deep learning techniques to model\ncomplex, high-dimensional, heterogeneous, incomplete, and censored medical\ndata. Current methods often make assumptions about the relations between data\nthat may not be valid in practice. In response, we introduce SAVAE (Survival\nAnalysis Variational Autoencoder), a novel approach based on Variational\nAutoencoders. SAVAE contributes significantly to the field by introducing a\ntailored ELBO formulation for survival analysis, supporting various parametric\ndistributions for covariates and survival time (as long as the log-likelihood\nis differentiable). It offers a general method that consistently performs well\non various metrics, demonstrating robustness and stability through different\nexperiments. Our proposal effectively estimates time-to-event, accounting for\ncensoring, covariate interactions, and time-varying risk associations. We\nvalidate our model in diverse datasets, including genomic, clinical, and\ndemographic data, with varying levels of censoring. This approach demonstrates\ncompetitive performance compared to state-of-the-art techniques, as assessed by\nthe Concordance Index and the Integrated Brier Score. SAVAE also offers an\ninterpretable model that parametrically models covariates and time. Moreover,\nits generative architecture facilitates further applications such as\nclustering, data imputation, and the generation of synthetic patient data\nthrough latent space inference from survival data.\n","authors":["Patricia A. Apellániz","Juan Parras","Santiago Zazo"],"pdf_url":"https://arxiv.org/pdf/2312.14651v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.14647v1","updated":"2023-12-22T12:30:18Z","published":"2023-12-22T12:30:18Z","title":"Pub/Sub Message Brokers for GenAI","summary":" In today's digital world, Generative Artificial Intelligence (GenAI) such as\nLarge Language Models (LLMs) is becoming increasingly prevalent, extending its\nreach across diverse applications. This surge in adoption has sparked a\nsignificant increase in demand for data-centric GenAI models, highlighting the\nnecessity for robust data communication infrastructures. Central to this need\nare message brokers, which serve as essential channels for data transfer within\nvarious system components. This survey aims to delve into a comprehensive\nanalysis of traditional and modern message brokers, offering a comparative\nstudy of prevalent platforms. Our study considers numerous criteria including,\nbut not limited to, open-source availability, integrated monitoring tools,\nmessage prioritization mechanisms, capabilities for parallel processing,\nreliability, distribution and clustering functionalities, authentication\nprocesses, data persistence strategies, fault tolerance, and scalability.\nFurthermore, we explore the intrinsic constraints that the design and operation\nof each message broker might impose, recognizing that these limitations are\ncrucial in understanding their real-world applicability. We then leverage these\ninsights to propose a sophisticated message broker framework -- one designed\nwith the adaptability and robustness necessary to meet the evolving requisites\nof GenAI applications. Finally, this study examines the enhancement of message\nbroker mechanisms specifically for GenAI contexts, emphasizing the criticality\nof developing a versatile message broker framework. Such a framework would be\npoised for quick adaptation, catering to the dynamic and growing demands of\nGenAI in the foreseeable future. Through this dual-pronged approach, we intend\nto contribute a foundational compendium that can guide future innovations and\ninfrastructural advancements in the realm of GenAI data communication.\n","authors":["Alaa Saleh","Susanna Pirttikangas","Lauri Lovén"],"pdf_url":"https://arxiv.org/pdf/2312.14647v1.pdf","comment":"24 pages, 282 references, 4 figures, 4 tables"},{"id":"http://arxiv.org/abs/2312.14646v1","updated":"2023-12-22T12:28:29Z","published":"2023-12-22T12:28:29Z","title":"Collaborative Synthesis of Patient Records through Multi-Visit Health\n State Inference","summary":" Electronic health records (EHRs) have become the foundation of machine\nlearning applications in healthcare, while the utility of real patient records\nis often limited by privacy and security concerns. Synthetic EHR generation\nprovides an additional perspective to compensate for this limitation. Most\nexisting methods synthesize new records based on real EHR data, without\nconsideration of different types of events in EHR data, which cannot control\nthe event combinations in line with medical common sense. In this paper, we\npropose MSIC, a Multi-visit health Status Inference model for Collaborative EHR\nsynthesis to address these limitations. First, we formulate the synthetic EHR\ngeneration process as a probabilistic graphical model and tightly connect\ndifferent types of events by modeling the latent health states. Then, we derive\na health state inference method tailored for the multi-visit scenario to\neffectively utilize previous records to synthesize current and future records.\nFurthermore, we propose to generate medical reports to add textual descriptions\nfor each medical event, providing broader applications for synthesized EHR\ndata. For generating different paragraphs in each visit, we incorporate a\nmulti-generator deliberation framework to collaborate the message passing of\nmultiple generators and employ a two-phase decoding strategy to generate\nhigh-quality reports. Our extensive experiments on the widely used benchmarks,\nMIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results\non the quality of synthetic data while maintaining low privacy risks.\n","authors":["Hongda Sun","Hongzhan Lin","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2312.14646v1.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14638v1","updated":"2023-12-22T12:15:52Z","published":"2023-12-22T12:15:52Z","title":"Balancing Energy Efficiency and Distributional Robustness in\n Over-the-Air Federated Learning","summary":" The growing number of wireless edge devices has magnified challenges\nconcerning energy, bandwidth, latency, and data heterogeneity. These challenges\nhave become bottlenecks for distributed learning. To address these issues, this\npaper presents a novel approach that ensures energy efficiency for\ndistributionally robust federated learning (FL) with over air computation\n(AirComp). In this context, to effectively balance robustness with energy\nefficiency, we introduce a novel client selection method that integrates two\ncomplementary insights: a deterministic one that is designed for energy\nefficiency, and a probabilistic one designed for distributional robustness.\nSimulation results underscore the efficacy of the proposed algorithm, revealing\nits superior performance compared to baselines from both robustness and energy\nefficiency perspectives, achieving more than 3-fold energy savings compared to\nthe considered baselines.\n","authors":["Mohamed Badi","Chaouki Ben Issaid","Anis Elgabli","Mehdi Bennis"],"pdf_url":"https://arxiv.org/pdf/2312.14638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14635v1","updated":"2023-12-22T12:13:19Z","published":"2023-12-22T12:13:19Z","title":"Fluid Simulation on Neural Flow Maps","summary":" We introduce Neural Flow Maps, a novel simulation method bridging the\nemerging paradigm of implicit neural representations with fluid simulation\nbased on the theory of flow maps, to achieve state-of-the-art simulation of\ninviscid fluid phenomena. We devise a novel hybrid neural field representation,\nSpatially Sparse Neural Fields (SSNF), which fuses small neural networks with a\npyramid of overlapping, multi-resolution, and spatially sparse grids, to\ncompactly represent long-term spatiotemporal velocity fields at high accuracy.\nWith this neural velocity buffer in hand, we compute long-term, bidirectional\nflow maps and their Jacobians in a mechanistically symmetric manner, to\nfacilitate drastic accuracy improvement over existing solutions. These\nlong-range, bidirectional flow maps enable high advection accuracy with low\ndissipation, which in turn facilitates high-fidelity incompressible flow\nsimulations that manifest intricate vortical structures. We demonstrate the\nefficacy of our neural fluid simulation in a variety of challenging simulation\nscenarios, including leapfrogging vortices, colliding vortices, vortex\nreconnections, as well as vortex generation from moving obstacles and density\ndifferences. Our examples show increased performance over existing methods in\nterms of energy conservation, visual complexity, adherence to experimental\nobservations, and preservation of detailed vortical structures.\n","authors":["Yitong Deng","Hong-Xing Yu","Diyang Zhang","Jiajun Wu","Bo Zhu"],"pdf_url":"https://arxiv.org/pdf/2312.14635v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14628v1","updated":"2023-12-22T11:58:53Z","published":"2023-12-22T11:58:53Z","title":"Towards more sustainable enterprise data and application management with\n cross silo Federated Learning and Analytics","summary":" To comply with new legal requirements and policies committed to privacy\nprotection, more and more companies start to deploy cross-silo Federated\nLearning at global scale, where several clients/silos collaboratively train a\nglobal model under the coordination of a central server. Instead of data\nsharing and transmission, clients train models using their private local data\nand exchange model updates. However, there is little understanding of the\ncarbon emission impact of cross silo Federated Learning due to the lack of\nrelated works. In this study, we first analyze the sustainability aspect of\ncross-silo Federated Learning, across the AI product life cycle instead of\nfocusing only on the model training, with the comparison to the centralized\nmethod. A more holistic quantitative cost and CO2 emission estimation method\nfor real world cross-silo Federated Learning setting is proposed. Secondly, we\npropose a novel data and application management system using cross silo\nFederated Learning and analytics to make IT companies more sustainable and cost\neffective.\n","authors":["Hongliu Cao"],"pdf_url":"https://arxiv.org/pdf/2312.14628v1.pdf","comment":"Presented in Sophia Summit 2023"},{"id":"http://arxiv.org/abs/2312.14625v1","updated":"2023-12-22T11:48:13Z","published":"2023-12-22T11:48:13Z","title":"Hierarchical Multi-Agent Reinforcement Learning for Assessing False-Data\n Injection Attacks on Transportation Networks","summary":" The increasing reliance of drivers on navigation applications has made\ntransportation networks more susceptible to data-manipulation attacks by\nmalicious actors. Adversaries may exploit vulnerabilities in the data\ncollection or processing of navigation services to inject false information,\nand to thus interfere with the drivers' route selection. Such attacks can\nsignificantly increase traffic congestions, resulting in substantial waste of\ntime and resources, and may even disrupt essential services that rely on road\nnetworks. To assess the threat posed by such attacks, we introduce a\ncomputational framework to find worst-case data-injection attacks against\ntransportation networks. First, we devise an adversarial model with a threat\nactor who can manipulate drivers by increasing the travel times that they\nperceive on certain roads. Then, we employ hierarchical multi-agent\nreinforcement learning to find an approximate optimal adversarial strategy for\ndata manipulation. We demonstrate the applicability of our approach through\nsimulating attacks on the Sioux Falls, ND network topology.\n","authors":["Taha Eghtesad","Sirui Li","Yevgeniy Vorobeychik","Aron Laszka"],"pdf_url":"https://arxiv.org/pdf/2312.14625v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.05400v3","updated":"2023-12-22T11:30:28Z","published":"2023-05-09T12:45:43Z","title":"Investigating the Corruption Robustness of Image Classifiers with Random\n Lp-norm Corruptions","summary":" Robustness is a fundamental property of machine learning classifiers required\nto achieve safety and reliability. In the field of adversarial robustness of\nimage classifiers, robustness is commonly defined as the stability of a model\nto all input changes within a p-norm distance. However, in the field of random\ncorruption robustness, variations observed in the real world are used, while\np-norm corruptions are rarely considered. This study investigates the use of\nrandom p-norm corruptions to augment the training and test data of image\nclassifiers. We evaluate the model robustness against imperceptible random\np-norm corruptions and propose a novel robustness metric. We empirically\ninvestigate whether robustness transfers across different p-norms and derive\nconclusions on which p-norm corruptions a model should be trained and\nevaluated. We find that training data augmentation with a combination of p-norm\ncorruptions significantly improves corruption robustness, even on top of\nstate-of-the-art data augmentation schemes.\n","authors":["Georg Siedel","Weijia Shao","Silvia Vock","Andrey Morozov"],"pdf_url":"https://arxiv.org/pdf/2305.05400v3.pdf","comment":"Camera-ready version submitted to VISAPP 2024"},{"id":"http://arxiv.org/abs/2310.19958v2","updated":"2023-12-22T11:29:00Z","published":"2023-10-30T19:18:09Z","title":"PriPrune: Quantifying and Preserving Privacy in Pruned Federated\n Learning","summary":" Federated learning (FL) is a paradigm that allows several client devices and\na server to collaboratively train a global model, by exchanging only model\nupdates, without the devices sharing their local training data. These devices\nare often constrained in terms of communication and computation resources, and\ncan further benefit from model pruning -- a paradigm that is widely used to\nreduce the size and complexity of models. Intuitively, by making local models\ncoarser, pruning is expected to also provide some protection against privacy\nattacks in the context of FL. However this protection has not been previously\ncharacterized, formally or experimentally, and it is unclear if it is\nsufficient against state-of-the-art attacks.\n In this paper, we perform the first investigation of privacy guarantees for\nmodel pruning in FL. We derive information-theoretic upper bounds on the amount\nof information leaked by pruned FL models. We complement and validate these\ntheoretical findings, with comprehensive experiments that involve\nstate-of-the-art privacy attacks, on several state-of-the-art FL pruning\nschemes, using benchmark datasets. This evaluation provides valuable insights\ninto the choices and parameters that can affect the privacy protection provided\nby pruning. Based on these insights, we introduce PriPrune -- a privacy-aware\nalgorithm for local model pruning, which uses a personalized per-client defense\nmask and adapts the defense pruning rate so as to jointly optimize privacy and\nmodel performance. PriPrune is universal in that can be applied after any\npruned FL scheme on the client, without modification, and protects against any\ninversion attack by the server. Our empirical evaluation demonstrates that\nPriPrune significantly improves the privacy-accuracy tradeoff compared to\nstate-of-the-art pruned FL schemes that do not take privacy into account.\n","authors":["Tianyue Chu","Mengwei Yang","Nikolaos Laoutaris","Athina Markopoulou"],"pdf_url":"https://arxiv.org/pdf/2310.19958v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15639v3","updated":"2023-12-22T11:15:44Z","published":"2023-09-27T13:18:23Z","title":"Enhancing Sharpness-Aware Optimization Through Variance Suppression","summary":" Sharpness-aware minimization (SAM) has well documented merits in enhancing\ngeneralization of deep neural networks, even without sizable data augmentation.\nEmbracing the geometry of the loss function, where neighborhoods of 'flat\nminima' heighten generalization ability, SAM seeks 'flat valleys' by minimizing\nthe maximum loss caused by an adversary perturbing parameters within the\nneighborhood. Although critical to account for sharpness of the loss function,\nsuch an 'over-friendly adversary' can curtail the outmost level of\ngeneralization. The novel approach of this contribution fosters stabilization\nof adversaries through variance suppression (VaSSO) to avoid such friendliness.\nVaSSO's provable stability safeguards its numerical improvement over SAM in\nmodel-agnostic tasks, including image classification and machine translation.\nIn addition, experiments confirm that VaSSO endows SAM with robustness against\nhigh levels of label noise.\n","authors":["Bingcong Li","Georgios B. Giannakis"],"pdf_url":"https://arxiv.org/pdf/2309.15639v3.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2312.14606v1","updated":"2023-12-22T11:03:12Z","published":"2023-12-22T11:03:12Z","title":"Explainable Multi-Camera 3D Object Detection with Transformer-Based\n Saliency Maps","summary":" Vision Transformers (ViTs) have achieved state-of-the-art results on various\ncomputer vision tasks, including 3D object detection. However, their end-to-end\nimplementation also makes ViTs less explainable, which can be a challenge for\ndeploying them in safety-critical applications, such as autonomous driving,\nwhere it is important for authorities, developers, and users to understand the\nmodel's reasoning behind its predictions. In this paper, we propose a novel\nmethod for generating saliency maps for a DetR-like ViT with multiple camera\ninputs used for 3D object detection. Our method is based on the raw attention\nand is more efficient than gradient-based methods. We evaluate the proposed\nmethod on the nuScenes dataset using extensive perturbation tests and show that\nit outperforms other explainability methods in terms of visual quality and\nquantitative metrics. We also demonstrate the importance of aggregating\nattention across different layers of the transformer. Our work contributes to\nthe development of explainable AI for ViTs, which can help increase trust in AI\napplications by establishing more transparency regarding the inner workings of\nAI models.\n","authors":["Till Beemelmanns","Wassim Zahr","Lutz Eckstein"],"pdf_url":"https://arxiv.org/pdf/2312.14606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14590v1","updated":"2023-12-22T10:29:18Z","published":"2023-12-22T10:29:18Z","title":"SIG: Speaker Identification in Literature via Prompt-Based Generation","summary":" Identifying speakers of quotations in narratives is an important task in\nliterary analysis, with challenging scenarios including the out-of-domain\ninference for unseen speakers, and non-explicit cases where there are no\nspeaker mentions in surrounding context. In this work, we propose a simple and\neffective approach SIG, a generation-based method that verbalizes the task and\nquotation input based on designed prompt templates, which also enables easy\nintegration of other auxiliary tasks that further bolster the speaker\nidentification performance. The prediction can either come from direct\ngeneration by the model, or be determined by the highest generation probability\nof each speaker candidate. Based on our approach design, SIG supports\nout-of-domain evaluation, and achieves open-world classification paradigm that\nis able to accept any forms of candidate input. We perform both cross-domain\nevaluation and in-domain evaluation on PDNC, the largest dataset of this task,\nwhere empirical results suggest that SIG outperforms previous baselines of\ncomplicated designs, as well as the zero-shot ChatGPT, especially excelling at\nthose hard non-explicit scenarios by up to 17% improvement. Additional\nexperiments on another dataset WP further corroborate the efficacy of SIG.\n","authors":["Zhenlin Su","Liyan Xu","Jin Xu","Jiangnan Li","Mingdu Huangfu"],"pdf_url":"https://arxiv.org/pdf/2312.14590v1.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2312.14589v1","updated":"2023-12-22T10:26:31Z","published":"2023-12-22T10:26:31Z","title":"Non-Denoising Forward-Time Diffusions","summary":" The scope of this paper is generative modeling through diffusion processes.\nAn approach falling within this paradigm is the work of Song et al. (2021),\nwhich relies on a time-reversal argument to construct a diffusion process\ntargeting the desired data distribution. We show that the time-reversal\nargument, common to all denoising diffusion probabilistic modeling proposals,\nis not necessary. We obtain diffusion processes targeting the desired data\ndistribution by taking appropriate mixtures of diffusion bridges. The resulting\ntransport is exact by construction, allows for greater flexibility in choosing\nthe dynamics of the underlying diffusion, and can be approximated by means of a\nneural network via novel training objectives. We develop a unifying view of the\ndrift adjustments corresponding to our and to time-reversal approaches and make\nuse of this representation to inspect the inner workings of diffusion-based\ngenerative models. Finally, we leverage on scalable simulation and inference\ntechniques common in spatial statistics to move beyond fully factorial\ndistributions in the underlying diffusion dynamics. The methodological advances\ncontained in this work contribute toward establishing a general framework for\ngenerative modeling based on diffusion processes.\n","authors":["Stefano Peluchetti"],"pdf_url":"https://arxiv.org/pdf/2312.14589v1.pdf","comment":"original date: 18 Nov 2021; archival of ICLR submission\n (https://openreview.net/forum?id=oVfIKuhqfC); no differences"},{"id":"http://arxiv.org/abs/2304.00917v2","updated":"2023-12-22T10:25:03Z","published":"2023-04-03T12:13:42Z","title":"Diffusion Bridge Mixture Transports, Schrödinger Bridge Problems and\n Generative Modeling","summary":" The dynamic Schr\\\"odinger bridge problem seeks a stochastic process that\ndefines a transport between two target probability measures, while optimally\nsatisfying the criteria of being closest, in terms of Kullback-Leibler\ndivergence, to a reference process. We propose a novel sampling-based iterative\nalgorithm, the iterated diffusion bridge mixture (IDBM) procedure, aimed at\nsolving the dynamic Schr\\\"odinger bridge problem. The IDBM procedure exhibits\nthe attractive property of realizing a valid transport between the target\nprobability measures at each iteration. We perform an initial theoretical\ninvestigation of the IDBM procedure, establishing its convergence properties.\nThe theoretical findings are complemented by numerical experiments illustrating\nthe competitive performance of the IDBM procedure. Recent advancements in\ngenerative modeling employ the time-reversal of a diffusion process to define a\ngenerative process that approximately transports a simple distribution to the\ndata distribution. As an alternative, we propose utilizing the first iteration\nof the IDBM procedure as an approximation-free method for realizing this\ntransport. This approach offers greater flexibility in selecting the generative\nprocess dynamics and exhibits accelerated training and superior sample quality\nover larger discretization intervals. In terms of implementation, the necessary\nmodifications are minimally intrusive, being limited to the training loss\ndefinition.\n","authors":["Stefano Peluchetti"],"pdf_url":"https://arxiv.org/pdf/2304.00917v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14574v1","updated":"2023-12-22T10:10:50Z","published":"2023-12-22T10:10:50Z","title":"MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning","summary":" Prompt learning has demonstrated impressive efficacy in the fine-tuning of\nmultimodal large models to a wide range of downstream tasks. Nonetheless,\napplying existing prompt learning methods for the diagnosis of neurological\ndisorder still suffers from two issues: (i) existing methods typically treat\nall patches equally, despite the fact that only a small number of patches in\nneuroimaging are relevant to the disease, and (ii) they ignore the structural\ninformation inherent in the brain connection network which is crucial for\nunderstanding and diagnosing neurological disorders. To tackle these issues, we\nintroduce a novel prompt learning model by learning graph prompts during the\nfine-tuning process of multimodal large models for diagnosing neurological\ndisorders. Specifically, we first leverage GPT-4 to obtain relevant disease\nconcepts and compute semantic similarity between these concepts and all\npatches. Secondly, we reduce the weight of irrelevant patches according to the\nsemantic similarity between each patch and disease-related concepts. Moreover,\nwe construct a graph among tokens based on these concepts and employ a graph\nconvolutional network layer to extract the structural information of the graph,\nwhich is used to prompt the pre-trained multimodal large models for diagnosing\nneurological disorders. Extensive experiments demonstrate that our method\nachieves superior performance for neurological disorder diagnosis compared with\nstate-of-the-art methods and validated by clinicians.\n","authors":["Liang Peng","Songyue Cai","Zongqian Wu","Huifang Shang","Xiaofeng Zhu","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2312.14574v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14571v1","updated":"2023-12-22T10:00:50Z","published":"2023-12-22T10:00:50Z","title":"Data is Moody: Discovering Data Modification Rules from Process Event\n Logs","summary":" Although event logs are a powerful source to gain insight about the behavior\nof the underlying business process, existing work primarily focuses on finding\npatterns in the activity sequences of an event log, while ignoring event\nattribute data. Event attribute data has mostly been used to predict event\noccurrences and process outcome, but the state of the art neglects to mine\nsuccinct and interpretable rules how event attribute data changes during\nprocess execution. Subgroup discovery and rule-based classification approaches\nlack the ability to capture the sequential dependencies present in event logs,\nand thus lead to unsatisfactory results with limited insight into the process\nbehavior.\n Given an event log, we are interested in finding accurate yet succinct and\ninterpretable if-then rules how the process modifies data. We formalize the\nproblem in terms of the Minimum Description Length (MDL) principle, by which we\nchoose the model with the best lossless description of the data. Additionally,\nwe propose the greedy Moody algorithm to efficiently search for rules. By\nextensive experiments on both synthetic and real-world data, we show Moody\nindeed finds compact and interpretable rules, needs little data for accurate\ndiscovery, and is robust to noise.\n","authors":["Marco Bjarne Schuster","Boris Wiegand","Jilles Vreeken"],"pdf_url":"https://arxiv.org/pdf/2312.14571v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14567v1","updated":"2023-12-22T09:58:39Z","published":"2023-12-22T09:58:39Z","title":"Accelerated Convergence of Stochastic Heavy Ball Method under\n Anisotropic Gradient Noise","summary":" Heavy-ball momentum with decaying learning rates is widely used with SGD for\noptimizing deep learning models. In contrast to its empirical popularity, the\nunderstanding of its theoretical property is still quite limited, especially\nunder the standard anisotropic gradient noise condition for quadratic\nregression problems. Although it is widely conjectured that heavy-ball momentum\nmethod can provide accelerated convergence and should work well in large batch\nsettings, there is no rigorous theoretical analysis. In this paper, we fill\nthis theoretical gap by establishing a non-asymptotic convergence bound for\nstochastic heavy-ball methods with step decay scheduler on quadratic\nobjectives, under the anisotropic gradient noise condition. As a direct\nimplication, we show that heavy-ball momentum can provide\n$\\tilde{\\mathcal{O}}(\\sqrt{\\kappa})$ accelerated convergence of the bias term\nof SGD while still achieving near-optimal convergence rate with respect to the\nstochastic variance term. The combined effect implies an overall convergence\nrate within log factors from the statistical minimax rate. This means SGD with\nheavy-ball momentum is useful in the large-batch settings such as distributed\nmachine learning or federated learning, where a smaller number of iterations\ncan significantly reduce the number of communication rounds, leading to\nacceleration in practice.\n","authors":["Rui Pan","Yuxing Liu","Xiaoyu Wang","Tong Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.14567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14564v1","updated":"2023-12-22T09:48:45Z","published":"2023-12-22T09:48:45Z","title":"Online Covering with Multiple Experts","summary":" Designing online algorithms with machine learning predictions is a recent\ntechnique beyond the worst-case paradigm for various practically relevant\nonline problems (scheduling, caching, clustering, ski rental, etc.). While most\nprevious learning-augmented algorithm approaches focus on integrating the\npredictions of a single oracle, we study the design of online algorithms with\n\\emph{multiple} experts. To go beyond the popular benchmark of a static best\nexpert in hindsight, we propose a new \\emph{dynamic} benchmark (linear\ncombinations of predictions that change over time). We present a competitive\nalgorithm in the new dynamic benchmark with a performance guarantee of $O(\\log\nK)$, where $K$ is the number of experts, for $0-1$ online optimization\nproblems. Furthermore, our multiple-expert approach provides a new perspective\non how to combine in an online manner several online algorithms - a\nlong-standing central subject in the online algorithm research community.\n","authors":["Enikő Kevi","Kim-Thang Nguyen"],"pdf_url":"https://arxiv.org/pdf/2312.14564v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15930v4","updated":"2023-12-22T09:38:26Z","published":"2023-05-25T10:58:46Z","title":"End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes","summary":" Meta-Bayesian optimisation (meta-BO) aims to improve the sample efficiency of\nBayesian optimisation by leveraging data from related tasks. While previous\nmethods successfully meta-learn either a surrogate model or an acquisition\nfunction independently, joint training of both components remains an open\nchallenge. This paper proposes the first end-to-end differentiable meta-BO\nframework that generalises neural processes to learn acquisition functions via\ntransformer architectures. We enable this end-to-end framework with\nreinforcement learning (RL) to tackle the lack of labelled acquisition data.\nEarly on, we notice that training transformer-based neural processes from\nscratch with RL is challenging due to insufficient supervision, especially when\nrewards are sparse. We formalise this claim with a combinatorial analysis\nshowing that the widely used notion of regret as a reward signal exhibits a\nlogarithmic sparsity pattern in trajectory lengths. To tackle this problem, we\naugment the RL objective with an auxiliary task that guides part of the\narchitecture to learn a valid probabilistic model as an inductive bias. We\ndemonstrate that our method achieves state-of-the-art regret results against\nvarious baselines in experiments on standard hyperparameter optimisation tasks\nand also outperforms others in the real-world problems of mixed-integer\nprogramming tuning, antibody design, and logic synthesis for electronic design\nautomation.\n","authors":["Alexandre Maraval","Matthieu Zimmer","Antoine Grosnit","Haitham Bou Ammar"],"pdf_url":"https://arxiv.org/pdf/2305.15930v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14552v1","updated":"2023-12-22T09:28:30Z","published":"2023-12-22T09:28:30Z","title":"Machine learning for structure-guided materials and process design","summary":" In recent years, there has been a growing interest in accelerated materials\ninnovation in both, research and industry. However, to truly add value to the\ndevelopment of new advanced materials, it is inevitable to take into account\nmanufacturing processes and thereby tailor materials design approaches to\nsupport downstream process design approaches. As a major step into this\ndirection, we present a holistic optimization approach that covers the entire\nmaterials process-structure-property chain. Our approach specifically employs\nmachine learning techniques to address two critical identification problems.\nThe first is to solve a materials design problem, which involves identifying\nnear-optimal material structures that exhibit desired macroscopic properties.\nThe second is to solve a process design problem that is to find an optimal\nprocessing path to manufacture these material structures. Both identification\nproblems are typically ill-posed, which presents a significant challenge for\nsolution approaches. However, the non-unique nature of these problems also\noffers an important advantage for processing: By having several target\nstructures that perform similarly well, the corresponding processes can be\nefficiently guided towards manufacturing the best reachable structure. In\nparticular, we apply deep reinforcement learning for process design in\ncombination with a multi-task learning-based optimization approach for\nmaterials design. The functionality of the approach will be demonstrated by\nusing it to manufacture crystallographic textures with desired properties in a\nmetal forming process.\n","authors":["Lukas Morand","Tarek Iraki","Johannes Dornheim","Stefan Sandfeld","Norbert Link","Dirk Helm"],"pdf_url":"https://arxiv.org/pdf/2312.14552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14535v1","updated":"2023-12-22T09:02:01Z","published":"2023-12-22T09:02:01Z","title":"ADA-GAD: Anomaly-Denoised Autoencoders for Graph Anomaly Detection","summary":" Graph anomaly detection is crucial for identifying nodes that deviate from\nregular behavior within graphs, benefiting various domains such as fraud\ndetection and social network. Although existing reconstruction-based methods\nhave achieved considerable success, they may face the \\textit{Anomaly\nOverfitting} and \\textit{Homophily Trap} problems caused by the abnormal\npatterns in the graph, breaking the assumption that normal nodes are often\nbetter reconstructed than abnormal ones. Our observations indicate that models\ntrained on graphs with fewer anomalies exhibit higher detection performance.\nBased on this insight, we introduce a novel two-stage framework called\nAnomaly-Denoised Autoencoders for Graph Anomaly Detection (ADA-GAD). In the\nfirst stage, we design a learning-free anomaly-denoised augmentation method to\ngenerate graphs with reduced anomaly levels. We pretrain graph autoencoders on\nthese augmented graphs at multiple levels, which enables the graph autoencoders\nto capture normal patterns. In the next stage, the decoders are retrained for\ndetection on the original graph, benefiting from the multi-level\nrepresentations learned in the previous stage. Meanwhile, we propose the node\nanomaly distribution regularization to further alleviate \\textit{Anomaly\nOverfitting}. We validate the effectiveness of our approach through extensive\nexperiments on both synthetic and real-world datasets.\n","authors":["Junwei He","Qianqian Xu","Yangbangyan Jiang","Zitai Wang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2312.14535v1.pdf","comment":"Accepted to AAAI-2024"},{"id":"http://arxiv.org/abs/2312.14533v1","updated":"2023-12-22T08:58:42Z","published":"2023-12-22T08:58:42Z","title":"Multi-view user representation learning for user matching without\n personal information","summary":" As the digitization of travel industry accelerates, analyzing and\nunderstanding travelers' behaviors becomes increasingly important. However,\ntraveler data frequently exhibit high data sparsity due to the relatively low\nfrequency of user interactions with travel providers. Compounding this effect\nthe multiplication of devices, accounts and platforms while browsing travel\nproducts online also leads to data dispersion. To deal with these challenges,\nprobabilistic traveler matching can be used. Most existing solutions for user\nmatching are not suitable for traveler matching as a traveler's browsing\nhistory is typically short and URLs in the travel industry are very\nheterogeneous with many tokens. To deal with these challenges, we propose the\nsimilarity based multi-view information fusion to learn a better user\nrepresentation from URLs by treating the URLs as multi-view data. The\nexperimental results show that the proposed multi-view user representation\nlearning can take advantage of the complementary information from different\nviews, highlight the key information in URLs and perform significantly better\nthan other representation learning solutions for the user matching task.\n","authors":["Hongliu Cao","Ilias El Baamrani","Eoin Thomas"],"pdf_url":"https://arxiv.org/pdf/2312.14533v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14532v1","updated":"2023-12-22T08:57:43Z","published":"2023-12-22T08:57:43Z","title":"DuaLight: Enhancing Traffic Signal Control by Leveraging\n Scenario-Specific and Scenario-Shared Knowledge","summary":" Reinforcement learning has been revolutionizing the traditional traffic\nsignal control task, showing promising power to relieve congestion and improve\nefficiency. However, the existing methods lack effective learning mechanisms\ncapable of absorbing dynamic information inherent to a specific scenario and\nuniversally applicable dynamic information across various scenarios. Moreover,\nwithin each specific scenario, they fail to fully capture the essential\nempirical experiences about how to coordinate between neighboring and target\nintersections, leading to sub-optimal system-wide outcomes.\n Viewing these issues, we propose DuaLight, which aims to leverage both the\nexperiential information within a single scenario and the generalizable\ninformation across various scenarios for enhanced decision-making.\nSpecifically, DuaLight introduces a scenario-specific experiential weight\nmodule with two learnable parts: Intersection-wise and Feature-wise, guiding\nhow to adaptively utilize neighbors and input features for each scenario, thus\nproviding a more fine-grained understanding of different intersections.\nFurthermore, we implement a scenario-shared Co-Train module to facilitate the\nlearning of generalizable dynamics information across different scenarios.\nEmpirical results on both real-world and synthetic scenarios show DuaLight\nachieves competitive performance across various metrics, offering a promising\nsolution to alleviate traffic congestion, with 3-7\\% improvements. The code is\navailable under: https://github.com/lujiaming-12138/DuaLight.\n","authors":["Jiaming Lu","Jingqing Ruan","Haoyuan Jiang","Ziyue Li","Hangyu Mao","Rui Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.14532v1.pdf","comment":"Accepted by AAMAS2024"},{"id":"http://arxiv.org/abs/2312.14528v1","updated":"2023-12-22T08:52:08Z","published":"2023-12-22T08:52:08Z","title":"An effective and efficient green federated learning method for one-layer\n neural networks","summary":" Nowadays, machine learning algorithms continue to grow in complexity and\nrequire a substantial amount of computational resources and energy. For these\nreasons, there is a growing awareness of the development of new green\nalgorithms and distributed AI can contribute to this. Federated learning (FL)\nis one of the most active research lines in machine learning, as it allows the\ntraining of collaborative models in a distributed way, an interesting option in\nmany real-world environments, such as the Internet of Things, allowing the use\nof these models in edge computing devices. In this work, we present a FL\nmethod, based on a neural network without hidden layers, capable of generating\na global collaborative model in a single training round, unlike traditional FL\nmethods that require multiple rounds for convergence. This allows obtaining an\neffective and efficient model that simplifies the management of the training\nprocess. Moreover, this method preserve data privacy by design, a crucial\naspect in current data protection regulations. We conducted experiments with\nlarge datasets and a large number of federated clients. Despite being based on\na network model without hidden layers, it maintains in all cases competitive\naccuracy results compared to more complex state-of-the-art machine learning\nmodels. Furthermore, we show that the method performs equally well in both\nidentically and non-identically distributed scenarios. Finally, it is an\nenvironmentally friendly algorithm as it allows significant energy savings\nduring the training process compared to its centralized counterpart.\n","authors":["Oscar Fontenla-Romero","Bertha Guijarro-Berdiñas","Elena Hernández-Pereira","Beatriz Pérez-Sánchez"],"pdf_url":"https://arxiv.org/pdf/2312.14528v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.10425v4","updated":"2023-12-22T08:45:34Z","published":"2021-12-20T09:52:12Z","title":"Model-based Clustering with Missing Not At Random Data","summary":" Model-based unsupervised learning, as any learning task, stalls as soon as\nmissing data occurs. This is even more true when the missing data are\ninformative, or said missing not at random (MNAR). In this paper, we propose\nmodel-based clustering algorithms designed to handle very general types of\nmissing data, including MNAR data. To do so, we introduce a mixture model for\ndifferent types of data (continuous, count, categorical and mixed) to jointly\nmodel the data distribution and the MNAR mechanism, remaining vigilant to the\nrelative degrees of freedom of each. Several MNAR models are discussed, for\nwhich the cause of the missingness can depend on both the values of the missing\nvariable themselves and on the class membership. However, we focus on a\nspecific MNAR model, called MNARz, for which the missingness only depends on\nthe class membership. We first underline its ease of estimation, by showing\nthat the statistical inference can be carried out on the data matrix\nconcatenated with the missing mask considering finally a standard MAR\nmechanism. Consequently, we propose to perform clustering using the Expectation\nMaximization algorithm, specially developed for this simplified\nreinterpretation. Finally, we assess the numerical performances of the proposed\nmethods on synthetic data and on the real medical registry TraumaBase as well.\n","authors":["Aude Sportisse","Matthieu Marbac","Fabien Laporte","Gilles Celeux","Claire Boyer","Julie Josse","Christophe Biernacki"],"pdf_url":"https://arxiv.org/pdf/2112.10425v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14507v1","updated":"2023-12-22T08:10:30Z","published":"2023-12-22T08:10:30Z","title":"Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and\n Spectral Optimal Transport","summary":" In neural audio signal processing, pitch conditioning has been used to\nenhance the performance of synthesizers. However, jointly training pitch\nestimators and synthesizers is a challenge when using standard audio-to-audio\nreconstruction loss, leading to reliance on external pitch trackers. To address\nthis issue, we propose using a spectral loss function inspired by optimal\ntransportation theory that minimizes the displacement of spectral energy. We\nvalidate this approach through an unsupervised autoencoding task that fits a\nharmonic template to harmonic signals. We jointly estimate the fundamental\nfrequency and amplitudes of harmonics using a lightweight encoder and\nreconstruct the signals using a differentiable harmonic synthesizer. The\nproposed approach offers a promising direction for improving unsupervised\nparameter estimation in neural audio applications.\n","authors":["Bernardo Torres","Geoffroy Peeters","Gaël Richard"],"pdf_url":"https://arxiv.org/pdf/2312.14507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14504v1","updated":"2023-12-22T08:08:45Z","published":"2023-12-22T08:08:45Z","title":"Theory of Hallucinations based on Equivariance","summary":" Equivariance is an important feature in machine learning, including language\nmodels. It ensures that any sequences of phrases with the same meanings are\ninterpreted consistently. For example, the sentence 'There is a cat on the\ntable' should be interpreted by language models as it is, regardless of\nvariations in its token-level expression. Building on this insight, I propose a\nnew theory suggesting that insufficient equivariance in language models can\nlead to hallucinations. According to this theory, which is both intuitive and\nnovel, language models trained on relatively small datasets tend to\nmisinterpret input texts and/or generate incorrect texts (i.e.,\nhallucinations). To test this theory, I developed a toy model known as 'dancing\nmen', which is a character-level substitution cipher. Additionally, I propose a\nnovel technique based on the T5 (Text To Text Transfer Transformer) model to\nefficiently decipher these codes without relying on frequency analysis. I have\nfound that this T5 model can almost completely solve the cipher, demonstrating\nits ability to acquire equivariance in this frame. This method could be scaled\nup to word-level and sentence-level substitution ciphers, analogous to large\nlanguage models without tokenizers or dictionaries. This scalability makes it\nsuitable for investigating the proposed link between inadequate equivariance\nacquisition and the emergence of hallucinations.\n","authors":["Hisaichi Shibata"],"pdf_url":"https://arxiv.org/pdf/2312.14504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14499v1","updated":"2023-12-22T07:56:30Z","published":"2023-12-22T07:56:30Z","title":"Hutchinson Trace Estimation for High-Dimensional and High-Order\n Physics-Informed Neural Networks","summary":" Physics-Informed Neural Networks (PINNs) have proven effective in solving\npartial differential equations (PDEs), especially when some data are available\nby blending seamlessly data and physics. However, extending PINNs to\nhigh-dimensional and even high-order PDEs encounters significant challenges due\nto the computational cost associated with automatic differentiation in the\nresidual loss. Herein, we address the limitations of PINNs in handling\nhigh-dimensional and high-order PDEs by introducing Hutchinson Trace Estimation\n(HTE). Starting with the second-order high-dimensional PDEs ubiquitous in\nscientific computing, HTE transforms the calculation of the entire Hessian\nmatrix into a Hessian vector product (HVP). This approach alleviates the\ncomputational bottleneck via Taylor-mode automatic differentiation and\nsignificantly reduces memory consumption from the Hessian matrix to HVP. We\nfurther showcase HTE's convergence to the original PINN loss and its unbiased\nbehavior under specific conditions. Comparisons with Stochastic Dimension\nGradient Descent (SDGD) highlight the distinct advantages of HTE, particularly\nin scenarios with significant variance among dimensions. We further extend HTE\nto higher-order and higher-dimensional PDEs, specifically addressing the\nbiharmonic equation. By employing tensor-vector products (TVP), HTE efficiently\ncomputes the colossal tensor associated with the fourth-order high-dimensional\nbiharmonic equation, saving memory and enabling rapid computation. The\neffectiveness of HTE is illustrated through experimental setups, demonstrating\ncomparable convergence rates with SDGD under memory and speed constraints.\nAdditionally, HTE proves valuable in accelerating the Gradient-Enhanced PINN\n(gPINN) version as well as the Biharmonic equation. Overall, HTE opens up a new\ncapability in scientific machine learning for tackling high-order and\nhigh-dimensional PDEs.\n","authors":["Zheyuan Hu","Zekun Shi","George Em Karniadakis","Kenji Kawaguchi"],"pdf_url":"https://arxiv.org/pdf/2312.14499v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2206.11004v4","updated":"2023-12-22T07:50:59Z","published":"2022-06-22T12:07:50Z","title":"Auto-Encoding Adversarial Imitation Learning","summary":" Reinforcement learning (RL) provides a powerful framework for\ndecision-making, but its application in practice often requires a carefully\ndesigned reward function. Adversarial Imitation Learning (AIL) sheds light on\nautomatic policy acquisition without access to the reward signal from the\nenvironment. In this work, we propose Auto-Encoding Adversarial Imitation\nLearning (AEAIL), a robust and scalable AIL framework. To induce expert\npolicies from demonstrations, AEAIL utilizes the reconstruction error of an\nauto-encoder as a reward signal, which provides more information for optimizing\npolicies than the prior discriminator-based ones. Subsequently, we use the\nderived objective functions to train the auto-encoder and the agent policy.\nExperiments show that our AEAIL performs superior compared to state-of-the-art\nmethods on both state and image based environments. More importantly, AEAIL\nshows much better robustness when the expert demonstrations are noisy.\n","authors":["Kaifeng Zhang","Rui Zhao","Ziming Zhang","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2206.11004v4.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2309.12204v2","updated":"2023-12-22T07:49:25Z","published":"2023-09-16T10:43:59Z","title":"PrNet: A Neural Network for Correcting Pseudoranges to Improve\n Positioning with Android Raw GNSS Measurements","summary":" We present a neural network for mitigating biased errors in pseudoranges to\nimprove localization performance with data collected from mobile phones. A\nsatellite-wise Multilayer Perceptron (MLP) is designed to regress the\npseudorange bias correction from six satellite, receiver, context-related\nfeatures derived from Android raw Global Navigation Satellite System (GNSS)\nmeasurements. To train the MLP, we carefully calculate the target values of\npseudorange bias using location ground truth and smoothing techniques and\noptimize a loss function involving the estimation residuals of smartphone clock\nbias. The corrected pseudoranges are then used by a model-based localization\nengine to compute locations. The Google Smartphone Decimeter Challenge (GSDC)\ndataset, which contains Android smartphone data collected from both rural and\nurban areas, is utilized for evaluation. Both fingerprinting and cross-trace\nlocalization results demonstrate that our proposed method outperforms\nmodel-based and state-of-the-art data-driven approaches.\n","authors":["Xu Weng","Keck Voon Ling","Haochen Liu"],"pdf_url":"https://arxiv.org/pdf/2309.12204v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.08655v2","updated":"2023-12-22T07:46:49Z","published":"2023-11-15T02:28:52Z","title":"Review of AlexNet for Medical Image Classification","summary":" In recent years, the rapid development of deep learning has led to a wide\nrange of applications in the field of medical image classification. The\nvariants of neural network models with ever-increasing performance share some\ncommonalities: to try to mitigate overfitting, improve generalization, avoid\ngradient vanishing and exploding, etc. AlexNet first utilizes the dropout\ntechnique to mitigate overfitting and the ReLU activation function to avoid\ngradient vanishing. Therefore, we focus our discussion on AlexNet, which has\ncontributed greatly to the development of CNNs in 2012. After reviewing over 40\npapers, including journal papers and conference papers, we give a narrative on\nthe technical details, advantages, and application areas of AlexNet.\n","authors":["Wenhao Tang","Junding Sun","Shuihua Wang","Yudong Zhang"],"pdf_url":"https://arxiv.org/pdf/2311.08655v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.12459v2","updated":"2023-12-22T07:42:59Z","published":"2022-08-26T06:48:01Z","title":"Meta Objective Guided Disambiguation for Partial Label Learning","summary":" Partial label learning (PLL) is a typical weakly supervised learning\nframework, where each training instance is associated with a candidate label\nset, among which only one label is valid. To solve PLL problems, typically\nmethods try to perform disambiguation for candidate sets by either using prior\nknowledge, such as structure information of training data, or refining model\noutputs in a self-training manner. Unfortunately, these methods often fail to\nobtain a favorable performance due to the lack of prior information or\nunreliable predictions in the early stage of model training. In this paper, we\npropose a novel framework for partial label learning with meta objective guided\ndisambiguation (MoGD), which aims to recover the ground-truth label from\ncandidate labels set by solving a meta objective on a small validation set.\nSpecifically, to alleviate the negative impact of false positive labels, we\nre-weight each candidate label based on the meta loss on the validation set.\nThen, the classifier is trained by minimizing the weighted cross entropy loss.\nThe proposed method can be easily implemented by using various deep networks\nwith the ordinary SGD optimizer. Theoretically, we prove the convergence\nproperty of meta objective and derive the estimation error bounds of the\nproposed method. Extensive experiments on various benchmark datasets and\nreal-world PLL datasets demonstrate that the proposed method can achieve\ncompetent performance when compared with the state-of-the-art methods.\n","authors":["Bo-Shi Zou","Ming-Kun Xie","Sheng-Jun Huang"],"pdf_url":"https://arxiv.org/pdf/2208.12459v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2312.14478v1","updated":"2023-12-22T07:05:13Z","published":"2023-12-22T07:05:13Z","title":"Federated Learning via Input-Output Collaborative Distillation","summary":" Federated learning (FL) is a machine learning paradigm in which distributed\nlocal nodes collaboratively train a central model without sharing individually\nheld private data. Existing FL methods either iteratively share local model\nparameters or deploy co-distillation. However, the former is highly susceptible\nto private data leakage, and the latter design relies on the prerequisites of\ntask-relevant real data. Instead, we propose a data-free FL framework based on\nlocal-to-central collaborative distillation with direct input and output space\nexploitation. Our design eliminates any requirement of recursive local\nparameter exchange or auxiliary task-relevant data to transfer knowledge,\nthereby giving direct privacy control to local users. In particular, to cope\nwith the inherent data heterogeneity across locals, our technique learns to\ndistill input on which each local model produces consensual yet unique results\nto represent each expertise. Our proposed FL framework achieves notable\nprivacy-utility trade-offs with extensive experiments on image classification\nand segmentation tasks under various real-world heterogeneous federated\nlearning settings on both natural and medical images.\n","authors":["Xuan Gong","Shanglin Li","Yuxiang Bao","Barry Yao","Yawen Huang","Ziyan Wu","Baochang Zhang","Yefeng Zheng","David Doermann"],"pdf_url":"https://arxiv.org/pdf/2312.14478v1.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2305.01658v2","updated":"2023-12-22T06:54:27Z","published":"2023-05-02T04:11:23Z","title":"FlightBERT++: A Non-autoregressive Multi-Horizon Flight Trajectory\n Prediction Framework","summary":" Flight Trajectory Prediction (FTP) is an essential task in Air Traffic\nControl (ATC), which can assist air traffic controllers in managing airspace\nmore safely and efficiently. Existing approaches generally perform\nmulti-horizon FTP tasks in an autoregressive manner, thereby suffering from\nerror accumulation and low-efficiency problems. In this paper, a novel\nframework, called FlightBERT++, is proposed to i) forecast multi-horizon flight\ntrajectories directly in a non-autoregressive way, and ii) improve the\nlimitation of the binary encoding (BE) representation in the FlightBERT.\nSpecifically, the FlightBERT++ is implemented by a generalized encoder-decoder\narchitecture, in which the encoder learns the temporal-spatial patterns from\nhistorical observations and the decoder predicts the flight status for the\nfuture horizons. Compared with conventional architecture, an innovative\nhorizon-aware contexts generator is dedicatedly designed to consider the prior\nhorizon information, which further enables non-autoregressive multi-horizon\nprediction. Moreover, a differential prompted decoder is proposed to enhance\nthe capability of the differential predictions by leveraging the stationarity\nof the differential sequence. The experimental results on a real-world dataset\ndemonstrated that the FlightBERT++ outperformed the competitive baselines in\nboth FTP performance and computational efficiency.\n","authors":["Dongyue Guo","Zheng Zhang","Zhen Yan","Jianwei Zhang","Yi Lin"],"pdf_url":"https://arxiv.org/pdf/2305.01658v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2312.14470v1","updated":"2023-12-22T06:45:45Z","published":"2023-12-22T06:45:45Z","title":"Safe Reinforcement Learning with Instantaneous Constraints: The Role of\n Aggressive Exploration","summary":" This paper studies safe Reinforcement Learning (safe RL) with linear function\napproximation and under hard instantaneous constraints where unsafe actions\nmust be avoided at each step. Existing studies have considered safe RL with\nhard instantaneous constraints, but their approaches rely on several key\nassumptions: $(i)$ the RL agent knows a safe action set for {\\it every} state\nor knows a {\\it safe graph} in which all the state-action-state triples are\nsafe, and $(ii)$ the constraint/cost functions are {\\it linear}. In this paper,\nwe consider safe RL with instantaneous hard constraints without assumption\n$(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our\nproposed algorithm, LSVI-AE, achieves $\\tilde{\\cO}(\\sqrt{d^3H^4K})$ regret and\n$\\tilde{\\cO}(H \\sqrt{dK})$ hard constraint violation when the cost function is\nlinear and $\\cO(H\\gamma_K \\sqrt{K})$ hard constraint violation when the cost\nfunction belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length\nof each episode, and $\\gamma_K$ is the information gain w.r.t the kernel used\nto approximate cost functions. Our results achieve the optimal dependency on\nthe learning horizon $K$, matching the lower bound we provide in this paper and\ndemonstrating the efficiency of LSVI-AE. Notably, the design of our approach\nencourages aggressive policy exploration, providing a unique perspective on\nsafe RL with general cost functions and no prior knowledge of safe actions,\nwhich may be of independent interest.\n","authors":["Honghao Wei","Xin Liu","Lei Ying"],"pdf_url":"https://arxiv.org/pdf/2312.14470v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.16940v4","updated":"2023-12-22T06:43:24Z","published":"2022-10-30T20:30:19Z","title":"FI-ODE: Certifiably Robust Forward Invariance in Neural ODEs","summary":" Forward invariance is a long-studied property in control theory that is used\nto certify that a dynamical system stays within some pre-specified set of\nstates for all time, and also admits robustness guarantees (e.g., the\ncertificate holds under perturbations). We propose a general framework for\ntraining and provably certifying robust forward invariance in Neural ODEs. We\napply this framework to provide certified safety in robust continuous control.\nTo our knowledge, this is the first instance of training Neural ODE policies\nwith such non-vacuous certified guarantees. In addition, we explore the\ngenerality of our framework by using it to certify adversarial robustness for\nimage classification.\n","authors":["Yujia Huang","Ivan Dario Jimenez Rodriguez","Huan Zhang","Yuanyuan Shi","Yisong Yue"],"pdf_url":"https://arxiv.org/pdf/2210.16940v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06209v2","updated":"2023-12-22T06:43:18Z","published":"2023-05-11T10:05:57Z","title":"Backdoor Attack with Sparse and Invisible Trigger","summary":" Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the\nadversary manipulates a small portion of training data such that the victim\nmodel predicts normally on the benign samples but classifies the triggered\nsamples as the target class. The backdoor attack is an emerging yet threatening\ntraining-phase threat, leading to serious risks in DNN-based applications. In\nthis paper, we revisit the trigger patterns of existing backdoor attacks. We\nreveal that they are either visible or not sparse and therefore are not\nstealthy enough. More importantly, it is not feasible to simply combine\nexisting methods to design an effective sparse and invisible backdoor attack.\nTo address this problem, we formulate the trigger generation as a bi-level\noptimization problem with sparsity and invisibility constraints and propose an\neffective method to solve it. The proposed method is dubbed sparse and\ninvisible backdoor attack (SIBA). We conduct extensive experiments on benchmark\ndatasets under different settings, which verify the effectiveness of our attack\nand its resistance to existing backdoor defenses. The codes for reproducing\nmain experiments are available at \\url{https://github.com/YinghuaGao/SIBA}.\n","authors":["Yinghua Gao","Yiming Li","Xueluan Gong","Zhifeng Li","Shu-Tao Xia","Qian Wang"],"pdf_url":"https://arxiv.org/pdf/2306.06209v2.pdf","comment":"The first two authors contributed equally to this work. 13 pages"},{"id":"http://arxiv.org/abs/2310.13230v3","updated":"2023-12-22T06:40:48Z","published":"2023-10-20T02:40:05Z","title":"Absolute Policy Optimization","summary":" In recent years, trust region on-policy reinforcement learning has achieved\nimpressive results in addressing complex control tasks and gaming scenarios.\nHowever, contemporary state-of-the-art algorithms within this category\nprimarily emphasize improvement in expected performance, lacking the ability to\ncontrol over the worst-case performance outcomes. To address this limitation,\nwe introduce a novel objective function; by optimizing which, it will lead to\nguaranteed monotonic improvement in the lower bound of near-total performance\nsamples (absolute performance). Considering this groundbreaking theoretical\nadvancement, we then refine this theoretically grounded algorithm through a\nseries of approximations, resulting in a practical solution called Absolute\nPolicy Optimization (APO). Our experiments demonstrate the effectiveness of our\napproach across challenging continuous control benchmark tasks and extend its\napplicability to mastering Atari games. Our findings reveal that APO\nsignificantly outperforms state-of-the-art policy gradient algorithms,\nresulting in substantial improvements in both expected performance and\nworst-case performance.\n","authors":["Weiye Zhao","Feihan Li","Yifan Sun","Rui Chen","Tianhao Wei","Changliu Liu"],"pdf_url":"https://arxiv.org/pdf/2310.13230v3.pdf","comment":"submission to Journal of Machine Learning Research"},{"id":"http://arxiv.org/abs/2312.14461v1","updated":"2023-12-22T06:25:46Z","published":"2023-12-22T06:25:46Z","title":"Attacking Byzantine Robust Aggregation in High Dimensions","summary":" Training modern neural networks or models typically requires averaging over a\nsample of high-dimensional vectors. Poisoning attacks can skew or bias the\naverage vectors used to train the model, forcing the model to learn specific\npatterns or avoid learning anything useful. Byzantine robust aggregation is a\nprincipled algorithmic defense against such biasing. Robust aggregators can\nbound the maximum bias in computing centrality statistics, such as mean, even\nwhen some fraction of inputs are arbitrarily corrupted. Designing such\naggregators is challenging when dealing with high dimensions. However, the\nfirst polynomial-time algorithms with strong theoretical bounds on the bias\nhave recently been proposed. Their bounds are independent of the number of\ndimensions, promising a conceptual limit on the power of poisoning attacks in\ntheir ongoing arms race against defenses.\n In this paper, we show a new attack called HIDRA on practical realization of\nstrong defenses which subverts their claim of dimension-independent bias. HIDRA\nhighlights a novel computational bottleneck that has not been a concern of\nprior information-theoretic analysis. Our experimental evaluation shows that\nour attacks almost completely destroy the model performance, whereas existing\nattacks with the same goal fail to have much effect. Our findings leave the\narms race between poisoning attacks and provable defenses wide open.\n","authors":["Sarthak Choudhary","Aashish Kolluri","Prateek Saxena"],"pdf_url":"https://arxiv.org/pdf/2312.14461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14458v1","updated":"2023-12-22T06:15:50Z","published":"2023-12-22T06:15:50Z","title":"Multiagent Copilot Approach for Shared Autonomy between Human EEG and\n TD3 Deep Reinforcement Learning","summary":" Deep reinforcement learning (RL) algorithms enable the development of fully\nautonomous agents that can interact with the environment. Brain-computer\ninterface (BCI) systems decipher human implicit brain signals regardless of the\nexplicit environment. In this study, we integrated deep RL and BCI to improve\nbeneficial human interventions in autonomous systems and the performance in\ndecoding brain activities by considering environmental factors. Shared autonomy\nwas allowed between the action command decoded from the electroencephalography\n(EEG) of the human agent and the action generated from the twin delayed DDPG\n(TD3) agent for a given environment. Our proposed copilot control scheme with a\nfull blocker (Co-FB) significantly outperformed the individual EEG (EEG-NB) or\nTD3 control. The Co-FB model achieved a higher target approaching score, lower\nfailure rate, and lower human workload than the EEG-NB model. The Co-FB control\nscheme had a higher invisible target score and level of allowed human\nintervention than the TD3 model. We also proposed a disparity d-index to\nevaluate the effect of contradicting agent decisions on the control accuracy\nand authority of the copilot model. We found a significant correlation between\nthe control authority of the TD3 agent and the performance improvement of human\nEEG classification with respect to the d-index. We also observed that shifting\ncontrol authority to the TD3 agent improved performance when BCI decoding was\nnot optimal. These findings indicate that the copilot system can effectively\nhandle complex environments and that BCI performance can be improved by\nconsidering environmental factors. Future work should employ continuous action\nspace and different multi-agent approaches to evaluate copilot performance.\n","authors":["Chun-Ren Phang","Akimasa Hirata"],"pdf_url":"https://arxiv.org/pdf/2312.14458v1.pdf","comment":"14 pages, 6 figures"},{"id":"http://arxiv.org/abs/2308.04119v3","updated":"2023-12-22T06:12:20Z","published":"2023-08-08T08:19:43Z","title":"Constructing Custom Thermodynamics Using Deep Learning","summary":" One of the most exciting applications of artificial intelligence (AI) is\nautomated scientific discovery based on previously amassed data, coupled with\nrestrictions provided by known physical principles, including symmetries and\nconservation laws. Such automated hypothesis creation and verification can\nassist scientists in studying complex phenomena, where traditional physical\nintuition may fail. Here we develop a platform based on a generalized Onsager\nprinciple to learn macroscopic dynamical descriptions of arbitrary stochastic\ndissipative systems directly from observations of their microscopic\ntrajectories. Our method simultaneously constructs reduced thermodynamic\ncoordinates and interprets the dynamics on these coordinates. We demonstrate\nits effectiveness by studying theoretically and validating experimentally the\nstretching of long polymer chains in an externally applied field. Specifically,\nwe learn three interpretable thermodynamic coordinates and build a dynamical\nlandscape of polymer stretching, including the identification of stable and\ntransition states and the control of the stretching rate. Our general\nmethodology can be used to address a wide range of scientific and technological\napplications.\n","authors":["Xiaoli Chen","Beatrice W. Soh","Zi-En Ooi","Eleonore Vissol-Gaudin","Haijun Yu","Kostya S. Novoselov","Kedar Hippalgaonkar","Qianxiao Li"],"pdf_url":"https://arxiv.org/pdf/2308.04119v3.pdf","comment":"Fix figure visibility issue"},{"id":"http://arxiv.org/abs/2312.14452v1","updated":"2023-12-22T06:04:09Z","published":"2023-12-22T06:04:09Z","title":"How to Overcome Curse-of-Dimensionality for Out-of-Distribution\n Detection?","summary":" Machine learning models deployed in the wild can be challenged by\nout-of-distribution (OOD) data from unknown classes. Recent advances in OOD\ndetection rely on distance measures to distinguish samples that are relatively\nfar away from the in-distribution (ID) data. Despite the promise,\ndistance-based methods can suffer from the curse-of-dimensionality problem,\nwhich limits the efficacy in high-dimensional feature space. To combat this\nproblem, we propose a novel framework, Subspace Nearest Neighbor (SNN), for OOD\ndetection. In training, our method regularizes the model and its feature\nrepresentation by leveraging the most relevant subset of dimensions (i.e.\nsubspace). Subspace learning yields highly distinguishable distance measures\nbetween ID and OOD data. We provide comprehensive experiments and ablations to\nvalidate the efficacy of SNN. Compared to the current best distance-based\nmethod, SNN reduces the average FPR95 by 15.96% on the CIFAR-100 benchmark.\n","authors":["Soumya Suvra Ghosal","Yiyou Sun","Yixuan Li"],"pdf_url":"https://arxiv.org/pdf/2312.14452v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2301.11997v2","updated":"2023-12-22T05:49:33Z","published":"2023-01-27T21:31:14Z","title":"Prompt-Based Editing for Text Style Transfer","summary":" Prompting approaches have been recently explored in text style transfer,\nwhere a textual prompt is used to query a pretrained language model to generate\nstyle-transferred texts word by word in an autoregressive manner. However, such\na generation process is less controllable and early prediction errors may\naffect future word predictions. In this paper, we present a prompt-based\nediting approach for text style transfer. Specifically, we prompt a pretrained\nlanguage model for style classification and use the classification probability\nto compute a style score. Then, we perform discrete search with word-level\nediting to maximize a comprehensive scoring function for the style-transfer\ntask. In this way, we transform a prompt-based generation problem into a\nclassification one, which is a training-free process and more controllable than\nthe autoregressive generation of sentences. In our experiments, we performed\nboth automatic and human evaluation on three style-transfer benchmark datasets,\nand show that our approach largely outperforms the state-of-the-art systems\nthat have 20 times more parameters. Additional empirical analyses further\ndemonstrate the effectiveness of our approach.\n","authors":["Guoqing Luo","Yu Tong Han","Lili Mou","Mauajama Firdaus"],"pdf_url":"https://arxiv.org/pdf/2301.11997v2.pdf","comment":"Accepted by EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2312.14441v1","updated":"2023-12-22T05:12:30Z","published":"2023-12-22T05:12:30Z","title":"DMC4ML: Data Movement Complexity for Machine Learning","summary":" The greatest demand for today's computing is machine learning. This paper\nanalyzes three machine learning algorithms: transformers, spatial convolution,\nand FFT. The analysis is novel in three aspects. First, it measures the cost of\nmemory access on an abstract memory hierarchy, instead of traditional time or\nspace complexity. Second, the analysis is asymptotic and identifies the primary\nsources of the memory cost. Finally, the result is symbolic, which can be used\nto select algorithmic parameters such as the group size in grouped query\nattention for any dimension size and number of heads and the batch size for\nbatched convolution for any image size and kernel size.\n","authors":["Chen Ding","Christopher Kanan","Dylan McKellips","Toranosuke Ozawa","Arian Shahmirza","Wesley Smith"],"pdf_url":"https://arxiv.org/pdf/2312.14441v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14440v1","updated":"2023-12-22T05:10:32Z","published":"2023-12-22T05:10:32Z","title":"Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks","summary":" The widespread use of Text-to-Image (T2I) models in content generation\nrequires careful examination of their safety, including their robustness to\nadversarial attacks. Despite extensive research into this, the reasons for\ntheir effectiveness are underexplored. This paper presents an empirical study\non adversarial attacks against T2I models, focusing on analyzing factors\nassociated with attack success rates (ASRs). We introduce a new attack\nobjective - entity swapping using adversarial suffixes and two gradient-based\nattack algorithms. Human and automatic evaluations reveal the asymmetric nature\nof ASRs on entity swap: for example, it is easier to replace \"human\" with\n\"robot\" in the prompt \"a human dancing in the rain.\" with an adversarial suffix\nbut is significantly harder in reverse. We further propose probing metrics to\nestablish indicative signals from the model's beliefs to the adversarial ASR.\nWe identify conditions resulting in a 60% success probability for adversarial\nattacks and others where this likelihood drops below 5%.\n","authors":["Haz Sameen Shahgir","Xianghao Kong","Greg Ver Steeg","Yue Dong"],"pdf_url":"https://arxiv.org/pdf/2312.14440v1.pdf","comment":"preprint version"},{"id":"http://arxiv.org/abs/2312.14439v1","updated":"2023-12-22T05:09:58Z","published":"2023-12-22T05:09:58Z","title":"PUMA: Efficient Continual Graph Learning with Graph Condensation","summary":" When handling streaming graphs, existing graph representation learning models\nencounter a catastrophic forgetting problem, where previously learned knowledge\nof these models is easily overwritten when learning with newly incoming graphs.\nIn response, Continual Graph Learning emerges as a novel paradigm enabling\ngraph representation learning from static to streaming graphs. Our prior work,\nCaT is a replay-based framework with a balanced continual learning procedure,\nwhich designs a small yet effective memory bank for replaying data by\ncondensing incoming graphs. Although the CaT alleviates the catastrophic\nforgetting problem, there exist three issues: (1) The graph condensation\nalgorithm derived in CaT only focuses on labelled nodes while neglecting\nabundant information carried by unlabelled nodes; (2) The continual training\nscheme of the CaT overemphasises on the previously learned knowledge, limiting\nthe model capacity to learn from newly added memories; (3) Both the\ncondensation process and replaying process of the CaT are time-consuming. In\nthis paper, we propose a psudo-label guided memory bank (PUMA) CGL framework,\nextending from the CaT to enhance its efficiency and effectiveness by\novercoming the above-mentioned weaknesses and limits. To fully exploit the\ninformation in a graph, PUMA expands the coverage of nodes during graph\ncondensation with both labelled and unlabelled nodes. Furthermore, a\ntraining-from-scratch strategy is proposed to upgrade the previous continual\nlearning scheme for a balanced training between the historical and the new\ngraphs. Besides, PUMA uses a one-time prorogation and wide graph encoders to\naccelerate the graph condensation and the graph encoding process in the\ntraining stage to improve the efficiency of the whole framework. Extensive\nexperiments on four datasets demonstrate the state-of-the-art performance and\nefficiency over existing methods.\n","authors":["Yilun Liu","Ruihong Qiu","Yanran Tang","Hongzhi Yin","Zi Huang"],"pdf_url":"https://arxiv.org/pdf/2312.14439v1.pdf","comment":"The code has been released in https://github.com/superallen13/PUMA.\n arXiv admin note: substantial text overlap with arXiv:2309.09455"},{"id":"http://arxiv.org/abs/2312.14438v1","updated":"2023-12-22T05:04:28Z","published":"2023-12-22T05:04:28Z","title":"PC-Conv: Unifying Homophily and Heterophily with Two-fold Filtering","summary":" Recently, many carefully crafted graph representation learning methods have\nachieved impressive performance on either strong heterophilic or homophilic\ngraphs, but not both. Therefore, they are incapable of generalizing well across\nreal-world graphs with different levels of homophily. This is attributed to\ntheir neglect of homophily in heterophilic graphs, and vice versa. In this\npaper, we propose a two-fold filtering mechanism to extract homophily in\nheterophilic graphs and vice versa. In particular, we extend the graph heat\nequation to perform heterophilic aggregation of global information from a long\ndistance. The resultant filter can be exactly approximated by the\nPossion-Charlier (PC) polynomials. To further exploit information at multiple\norders, we introduce a powerful graph convolution PC-Conv and its instantiation\nPCNet for the node classification task. Compared with state-of-the-art GNNs,\nPCNet shows competitive performance on well-known homophilic and heterophilic\ngraphs. Our implementation is available at https://github.com/uestclbh/PC-Conv.\n","authors":["Bingheng Li","Erlin Pan","Zhao Kang"],"pdf_url":"https://arxiv.org/pdf/2312.14438v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2303.11959v2","updated":"2023-12-22T04:59:00Z","published":"2023-03-15T11:47:57Z","title":"Optimizing Trading Strategies in Quantitative Markets using Multi-Agent\n Reinforcement Learning","summary":" Quantitative markets are characterized by swift dynamics and abundant\nuncertainties, making the pursuit of profit-driven stock trading actions\ninherently challenging. Within this context, reinforcement learning (RL), which\noperates on a reward-centric mechanism for optimal control, has surfaced as a\npotentially effective solution to the intricate financial decision-making\nconundrums presented. This paper delves into the fusion of two established\nfinancial trading strategies, namely the constant proportion portfolio\ninsurance (CPPI) and the time-invariant portfolio protection (TIPP), with the\nmulti-agent deep deterministic policy gradient (MADDPG) framework. As a result,\nwe introduce two novel multi-agent RL (MARL) methods, CPPI-MADDPG and\nTIPP-MADDPG, tailored for probing strategic trading within quantitative\nmarkets. To validate these innovations, we implemented them on a diverse\nselection of 100 real-market shares. Our empirical findings reveal that the\nCPPI-MADDPG and TIPP-MADDPG strategies consistently outpace their traditional\ncounterparts, affirming their efficacy in the realm of quantitative trading.\n","authors":["Hengxi Zhang","Zhendong Shi","Yuanquan Hu","Wenbo Ding","Ercan E. Kuruoglu","Xiao-Ping Zhang"],"pdf_url":"https://arxiv.org/pdf/2303.11959v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14436v1","updated":"2023-12-22T04:56:37Z","published":"2023-12-22T04:56:37Z","title":"REBEL: A Regularization-Based Solution for Reward Overoptimization in\n Reinforcement Learning from Human Feedback","summary":" In this work, we propose REBEL, an algorithm for sample efficient reward\nregularization based robotic reinforcement learning from human feedback\n(RRLHF). Reinforcement learning (RL) performance for continuous control\nrobotics tasks is sensitive to the underlying reward function. In practice, the\nreward function often ends up misaligned with human intent, values, social\nnorms, etc., leading to catastrophic failures in the real world. We leverage\nhuman preferences to learn regularized reward functions and eventually align\nthe agents with the true intended behavior. We introduce a novel notion of\nreward regularization to the existing RRLHF framework, which is termed as agent\npreferences. So, we not only consider human feedback in terms of preferences,\nwe also propose to take into account the preference of the underlying RL agent\nwhile learning the reward function. We show that this helps to improve the\nover-optimization associated with the design of reward functions in RL. We\nexperimentally show that REBEL exhibits up to 70% improvement in sample\nefficiency to achieve a similar level of episodic reward returns as compared to\nthe state-of-the-art methods such as PEBBLE and PEBBLE+SURF.\n","authors":["Souradip Chakraborty","Amisha Bhaskar","Anukriti Singh","Pratap Tokekar","Dinesh Manocha","Amrit Singh Bedi"],"pdf_url":"https://arxiv.org/pdf/2312.14436v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14432v1","updated":"2023-12-22T04:41:31Z","published":"2023-12-22T04:41:31Z","title":"Scalable 3D Reconstruction From Single Particle X-Ray Diffraction Images\n Based on Online Machine Learning","summary":" X-ray free-electron lasers (XFELs) offer unique capabilities for measuring\nthe structure and dynamics of biomolecules, helping us understand the basic\nbuilding blocks of life. Notably, high-repetition-rate XFELs enable single\nparticle imaging (X-ray SPI) where individual, weakly scattering biomolecules\nare imaged under near-physiological conditions with the opportunity to access\nfleeting states that cannot be captured in cryogenic or crystallized\nconditions. Existing X-ray SPI reconstruction algorithms, which estimate the\nunknown orientation of a particle in each captured image as well as its shared\n3D structure, are inadequate in handling the massive datasets generated by\nthese emerging XFELs. Here, we introduce X-RAI, an online reconstruction\nframework that estimates the structure of a 3D macromolecule from large X-ray\nSPI datasets. X-RAI consists of a convolutional encoder, which amortizes pose\nestimation over large datasets, as well as a physics-based decoder, which\nemploys an implicit neural representation to enable high-quality 3D\nreconstruction in an end-to-end, self-supervised manner. We demonstrate that\nX-RAI achieves state-of-the-art performance for small-scale datasets in\nsimulation and challenging experimental settings and demonstrate its\nunprecedented ability to process large datasets containing millions of\ndiffraction images in an online fashion. These abilities signify a paradigm\nshift in X-ray SPI towards real-time capture and reconstruction.\n","authors":["Jay Shenoy","Axel Levy","Frédéric Poitevin","Gordon Wetzstein"],"pdf_url":"https://arxiv.org/pdf/2312.14432v1.pdf","comment":"Project page: http://jayshenoy.com/xrai"},{"id":"http://arxiv.org/abs/2310.05707v2","updated":"2023-12-22T04:31:49Z","published":"2023-10-09T13:29:37Z","title":"Guiding Language Model Reasoning with Planning Tokens","summary":" Large language models (LLMs) have recently attracted considerable interest\nfor their ability to perform complex reasoning tasks, such as chain-of-thought\nreasoning. However, most of the existing approaches to enhance this ability\nrely heavily on data-driven methods, while neglecting the structural aspects of\nthe model's reasoning capacity. We find that while LLMs can manage individual\nreasoning steps well, they struggle with maintaining consistency across an\nentire reasoning chain. To solve this, we introduce 'planning tokens' at the\nstart of each reasoning step, serving as a guide for the model. These token\nembeddings are then fine-tuned along with the rest of the model parameters. Our\napproach requires a negligible increase in trainable parameters (just 0.001%)\nand can be applied through either full fine-tuning or a more\nparameter-efficient scheme. We demonstrate our method's effectiveness by\napplying it to three different LLMs, showing notable accuracy improvements\nacross three math word problem datasets w.r.t. plain chain-of-thought\nfine-tuning baselines.\n","authors":["Xinyi Wang","Lucas Caccia","Oleksiy Ostapenko","Xingdi Yuan","Alessandro Sordoni"],"pdf_url":"https://arxiv.org/pdf/2310.05707v2.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.14428v1","updated":"2023-12-22T04:30:27Z","published":"2023-12-22T04:30:27Z","title":"A Unified Industrial Large Knowledge Model Framework in Smart\n Manufacturing","summary":" The recent emergence of large language models (LLMs) shows the potential for\nartificial general intelligence, revealing new opportunities in industry 4.0\nand smart manufacturing. However, a notable gap exists in applying these LLMs\nin industry, primarily due to their training on general knowledge rather than\ndomain-specific knowledge. Such specialized domain knowledge is vital for\neffectively addressing the complex needs of industrial applications. To bridge\nthis gap, this paper proposes an Industrial Large Knowledge Model (ILKM)\nframework emphasizing their potential to revolutionize the industry in smart\nmanufacturing. In addition, ILKMs and LLMs are compared from eight\nperspectives. Finally, \"6S Principle\" is proposed as the guideline for the\ndevelopment of ILKMs in smart manufacturing.\n","authors":["Jay Lee","Hanqi Su"],"pdf_url":"https://arxiv.org/pdf/2312.14428v1.pdf","comment":"The paper has been submitted to Manufacturing Letters (Under Review)"},{"id":"http://arxiv.org/abs/2312.14426v1","updated":"2023-12-22T04:16:34Z","published":"2023-12-22T04:16:34Z","title":"Room Occupancy Prediction: Exploring the Power of Machine Learning and\n Temporal Insights","summary":" Energy conservation in buildings is a paramount concern to combat greenhouse\ngas emissions and combat climate change. The efficient management of room\noccupancy, involving actions like lighting control and climate adjustment, is a\npivotal strategy to curtail energy consumption. In contexts where surveillance\ntechnology isn't viable, non-intrusive sensors are employed to estimate room\noccupancy. In this study, we present a predictive framework for room occupancy\nthat leverages a diverse set of machine learning models, with Random Forest\nconsistently achieving the highest predictive accuracy. Notably, this dataset\nencompasses both temporal and spatial dimensions, revealing a wealth of\ninformation. Intriguingly, our framework demonstrates robust performance even\nin the absence of explicit temporal modeling. These findings underscore the\nremarkable predictive power of traditional machine learning models. The success\ncan be attributed to the presence of feature redundancy, the simplicity of\nlinear spatial and temporal patterns, and the advantages of high-frequency data\nsampling. While these results are compelling, it's essential to remain open to\nthe possibility that explicitly modeling the temporal dimension could unlock\ndeeper insights or further enhance predictive capabilities in specific\nscenarios. In summary, our research not only validates the effectiveness of our\nprediction framework for continuous and classification tasks but also\nunderscores the potential for improvements through the inclusion of temporal\naspects. The study highlights the promise of machine learning in shaping\nenergy-efficient practices and room occupancy management.\n","authors":["Siqi Mao","Yaping Yuan","Yinpu Li","Ziren Wang","Yuanxin Yao","Yixin Kang"],"pdf_url":"https://arxiv.org/pdf/2312.14426v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14418v1","updated":"2023-12-22T03:52:17Z","published":"2023-12-22T03:52:17Z","title":"Sharp error estimates for target measure diffusion maps with\n applications to the committor problem","summary":" We obtain asymptotically sharp error estimates for the consistency error of\nthe Target Measure Diffusion map (TMDmap) (Banisch et al. 2020), a variant of\ndiffusion maps featuring importance sampling and hence allowing input data\ndrawn from an arbitrary density. The derived error estimates include the bias\nerror and the variance error. The resulting convergence rates are consistent\nwith the approximation theory of graph Laplacians. The key novelty of our\nresults lies in the explicit quantification of all the prefactors on\nleading-order terms. We also prove an error estimate for solutions of Dirichlet\nBVPs obtained using TMDmap, showing that the solution error is controlled by\nconsistency error. We use these results to study an important application of\nTMDmap in the analysis of rare events in systems governed by overdamped\nLangevin dynamics using the framework of transition path theory (TPT). The\ncornerstone ingredient of TPT is the solution of the committor problem, a\nboundary value problem for the backward Kolmogorov PDE. Remarkably, we find\nthat the TMDmap algorithm is particularly suited as a meshless solver to the\ncommittor problem due to the cancellation of several error terms in the\nprefactor formula. Furthermore, significant improvements in bias and variance\nerrors occur when using a quasi-uniform sampling density. Our numerical\nexperiments show that these improvements in accuracy are realizable in practice\nwhen using $\\delta$-nets as spatially uniform inputs to the TMDmap algorithm.\n","authors":["Shashank Sule","Luke Evans","Maria Cameron"],"pdf_url":"https://arxiv.org/pdf/2312.14418v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14406v1","updated":"2023-12-22T03:15:17Z","published":"2023-12-22T03:15:17Z","title":"Generative Pretraining at Scale: Transformer-Based Encoding of\n Transactional Behavior for Fraud Detection","summary":" In this work, we introduce an innovative autoregressive model leveraging\nGenerative Pretrained Transformer (GPT) architectures, tailored for fraud\ndetection in payment systems. Our approach innovatively confronts token\nexplosion and reconstructs behavioral sequences, providing a nuanced\nunderstanding of transactional behavior through temporal and contextual\nanalysis. Utilizing unsupervised pretraining, our model excels in feature\nrepresentation without the need for labeled data. Additionally, we integrate a\ndifferential convolutional approach to enhance anomaly detection, bolstering\nthe security and efficacy of one of the largest online payment merchants in\nChina. The scalability and adaptability of our model promise broad\napplicability in various transactional contexts.\n","authors":["Ze Yu Zhao","Zheng Zhu","Guilin Li","Wenhan Wang","Bo Wang"],"pdf_url":"https://arxiv.org/pdf/2312.14406v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14405v1","updated":"2023-12-22T03:10:59Z","published":"2023-12-22T03:10:59Z","title":"Graph Attention-Based Symmetry Constraint Extraction for Analog Circuits","summary":" In recent years, analog circuits have received extensive attention and are\nwidely used in many emerging applications. The high demand for analog circuits\nnecessitates shorter circuit design cycles. To achieve the desired performance\nand specifications, various geometrical symmetry constraints must be carefully\nconsidered during the analog layout process. However, the manual labeling of\nthese constraints by experienced analog engineers is a laborious and\ntime-consuming process. To handle the costly runtime issue, we propose a\ngraph-based learning framework to automatically extract symmetric constraints\nin analog circuit layout. The proposed framework leverages the connection\ncharacteristics of circuits and the devices'information to learn the general\nrules of symmetric constraints, which effectively facilitates the extraction of\ndevice-level constraints on circuit netlists. The experimental results\ndemonstrate that compared to state-of-the-art symmetric constraint detection\napproaches, our framework achieves higher accuracy and lower false positive\nrate.\n","authors":["Qi Xu","Lijie Wang","Jing Wang","Song Chen","Lin Cheng","Yi Kang"],"pdf_url":"https://arxiv.org/pdf/2312.14405v1.pdf","comment":"9 pages,9 figures,3 tables, 1 algorithm"},{"id":"http://arxiv.org/abs/2312.14385v1","updated":"2023-12-22T02:21:26Z","published":"2023-12-22T02:21:26Z","title":"Generative AI Beyond LLMs: System Implications of Multi-Modal Generation","summary":" As the development of large-scale Generative AI models evolve beyond text\n(1D) generation to include image (2D) and video (3D) generation, processing\nspatial and temporal information presents unique challenges to quality,\nperformance, and efficiency. We present the first work towards understanding\nthis new system design space for multi-modal text-to-image (TTI) and\ntext-to-video (TTV) generation models. Current model architecture designs are\nbifurcated into 2 categories: Diffusion- and Transformer-based models. Our\nsystematic performance characterization on a suite of eight representative\nTTI/TTV models shows that after state-of-the-art optimization techniques such\nas Flash Attention are applied, Convolution accounts for up to 44% of execution\ntime for Diffusion-based TTI models, while Linear layers consume up to 49% of\nexecution time for Transformer-based models. We additionally observe that\nDiffusion-based TTI models resemble the Prefill stage of LLM inference, and\nbenefit from 1.1-2.5x greater speedup from Flash Attention than\nTransformer-based TTI models that resemble the Decode phase. Since\noptimizations designed for LLMs do not map directly onto TTI/TTV models, we\nmust conduct a thorough characterization of these workloads to gain insights\nfor new optimization opportunities. In doing so, we define sequence length in\nthe context of TTI/TTV models and observe sequence length can vary up to 4x in\nDiffusion model inference. We additionally observe temporal aspects of TTV\nworkloads pose unique system bottlenecks, with Temporal Attention accounting\nfor over 60% of total Attention time. Overall, our in-depth system performance\ncharacterization is a critical first step towards designing efficient and\ndeployable systems for emerging TTI/TTV workloads.\n","authors":["Alicia Golden","Samuel Hsia","Fei Sun","Bilge Acun","Basil Hosmer","Yejin Lee","Zachary DeVito","Jeff Johnson","Gu-Yeon Wei","David Brooks","Carole-Jean Wu"],"pdf_url":"https://arxiv.org/pdf/2312.14385v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.09619v2","updated":"2023-12-22T02:14:19Z","published":"2023-07-18T20:27:45Z","title":"Towards Federated Foundation Models: Scalable Dataset Pipelines for\n Group-Structured Learning","summary":" We introduce Dataset Grouper, a library to create large-scale\ngroup-structured (e.g., federated) datasets, enabling federated learning\nsimulation at the scale of foundation models. This library facilitates the\ncreation of group-structured versions of existing datasets based on\nuser-specified partitions and directly leads to a variety of useful\nheterogeneous datasets that can be plugged into existing software frameworks.\nDataset Grouper offers three key advantages. First, it scales to settings where\neven a single group's dataset is too large to fit in memory. Second, it\nprovides flexibility, both in choosing the base (non-partitioned) dataset and\nin defining partitions. Finally, it is framework-agnostic. We empirically\ndemonstrate that Dataset Grouper enables large-scale federated language\nmodeling simulations on datasets that are orders of magnitude larger than in\nprevious work, allowing for federated training of language models with hundreds\nof millions, and even billions, of parameters. Our experimental results show\nthat algorithms like FedAvg operate more as meta-learning methods than as\nempirical risk minimization methods at this scale, suggesting their utility in\ndownstream personalization and task-specific adaptation. Dataset Grouper is\navailable at https://github.com/google-research/dataset_grouper.\n","authors":["Zachary Charles","Nicole Mitchell","Krishna Pillutla","Michael Reneer","Zachary Garrett"],"pdf_url":"https://arxiv.org/pdf/2307.09619v2.pdf","comment":"Dataset Grouper is available at\n https://github.com/google-research/dataset_grouper"},{"id":"http://arxiv.org/abs/2312.14380v1","updated":"2023-12-22T02:12:08Z","published":"2023-12-22T02:12:08Z","title":"Federated Learning with Projected Trajectory Regularization","summary":" Federated learning enables joint training of machine learning models from\ndistributed clients without sharing their local data. One key challenge in\nfederated learning is to handle non-identically distributed data across the\nclients, which leads to deteriorated model training performances. Prior works\nin this line of research mainly focus on utilizing last-step global model\nparameters/gradients or the linear combinations of the past model\nparameters/gradients, which do not fully exploit the potential of global\ninformation from the model training trajectory. In this paper, we propose a\nnovel federated learning framework with projected trajectory regularization\n(FedPTR) for tackling the data heterogeneity issue, which proposes a unique way\nto better extract the essential global information from the model training\ntrajectory. Specifically, FedPTR allows local clients or the server to optimize\nan auxiliary (synthetic) dataset that mimics the learning dynamics of the\nrecent model update and utilizes it to project the next-step model trajectory\nfor local training regularization. We conduct rigorous theoretical analysis for\nour proposed framework under nonconvex stochastic settings to verify its fast\nconvergence under heterogeneous data distributions. Experiments on various\nbenchmark datasets and non-i.i.d. settings validate the effectiveness of our\nproposed framework.\n","authors":["Tiejin Chen","Yuanpu Cao","Yujia Wang","Cho-Jui Hsieh","Jinghui Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14380v1.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2312.14378v1","updated":"2023-12-22T02:08:40Z","published":"2023-12-22T02:08:40Z","title":"Multimodal Attention Merging for Improved Speech Recognition and Audio\n Event Classification","summary":" Training large foundation models using self-supervised objectives on\nunlabeled data, followed by fine-tuning on downstream tasks, has emerged as a\nstandard procedure. Unfortunately, the efficacy of this approach is often\nconstrained by both limited fine-tuning compute and scarcity in labeled\ndownstream data. We introduce Multimodal Attention Merging (MAM), an attempt\nthat facilitates direct knowledge transfer from attention matrices of models\nrooted in high resource modalities, text and images, to those in\nresource-constrained domains, speech and audio, employing a zero-shot paradigm.\nMAM reduces the relative Word Error Rate (WER) of an Automatic Speech\nRecognition (ASR) model by up to 6.70%, and relative classification error of an\nAudio Event Classification (AEC) model by 10.63%. In cases where some\ndata/compute is available, we present Learnable-MAM, a data-driven approach to\nmerging attention matrices, resulting in a further 2.90% relative reduction in\nWER for ASR and 18.42% relative reduction in AEC compared to fine-tuning.\n","authors":["Anirudh S. Sundar","Chao-Han Huck Yang","David M. Chan","Shalini Ghosh","Venkatesh Ravichandran","Phani Sankar Nidadavolu"],"pdf_url":"https://arxiv.org/pdf/2312.14378v1.pdf","comment":"5 pages, 1 figure"},{"id":"http://arxiv.org/abs/2312.13091v2","updated":"2023-12-22T02:06:32Z","published":"2023-12-20T15:12:53Z","title":"MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using\n Differentiable Shading","summary":" Reconstructing an avatar from a portrait image has many applications in\nmultimedia, but remains a challenging research problem. Extracting reflectance\nmaps and geometry from one image is ill-posed: recovering geometry is a\none-to-many mapping problem and reflectance and light are difficult to\ndisentangle. Accurate geometry and reflectance can be captured under the\ncontrolled conditions of a light stage, but it is costly to acquire large\ndatasets in this fashion. Moreover, training solely with this type of data\nleads to poor generalization with in-the-wild images. This motivates the\nintroduction of MoSAR, a method for 3D avatar generation from monocular images.\nWe propose a semi-supervised training scheme that improves generalization by\nlearning from both light stage and in-the-wild datasets. This is achieved using\na novel differentiable shading formulation. We show that our approach\neffectively disentangles the intrinsic face parameters, producing relightable\navatars. As a result, MoSAR estimates a richer set of skin reflectance maps,\nand generates more realistic avatars than existing state-of-the-art methods. We\nalso introduce a new dataset, named FFHQ-UV-Intrinsics, the first public\ndataset providing intrinsic face attributes at scale (diffuse, specular,\nambient occlusion and translucency maps) for a total of 10k subjects. The\nproject website and the dataset are available on the following link:\nhttps://ubisoft-laforge.github.io/character/mosar/\n","authors":["Abdallah Dib","Luiz Gustavo Hafemann","Emeline Got","Trevor Anderson","Amin Fadaeinejad","Rafael M. O. Cruz","Marc-Andre Carbonneau"],"pdf_url":"https://arxiv.org/pdf/2312.13091v2.pdf","comment":"https://ubisoft-laforge.github.io/character/mosar/"},{"id":"http://arxiv.org/abs/2312.14372v1","updated":"2023-12-22T01:47:16Z","published":"2023-12-22T01:47:16Z","title":"Generative Models for Simulation of KamLAND-Zen","summary":" The next generation of searches for neutrinoless double beta decay\n(0{\\nu}\\b{eta}\\b{eta}) are poised to answer deep questions on the nature of\nneutrinos and the source of the Universe's matter-antimatter asymmetry. They\nwill be looking for event rates of less than one event per ton of instrumented\nisotope per year. To claim discovery, accurate and efficient simulations of\ndetector events that mimic 0{\\nu}\\b{eta}\\b{eta} is critical. Traditional Monte\nCarlo (MC) simulations can be supplemented by machine-learning-based generative\nmodels. In this work, we describe the performance of generative models designed\nfor monolithic liquid scintillator detectors like KamLAND to produce highly\naccurate simulation data without a predefined physics model. We demonstrate its\nability to recover low-level features and perform interpolation. In the future,\nthe results of these generative models can be used to improve event\nclassification and background rejection by providing high-quality abundant\ngenerated data.\n","authors":["Z. Fu","C. Grant","D. M. Krawiec","A. Li","L. Winslow"],"pdf_url":"https://arxiv.org/pdf/2312.14372v1.pdf","comment":"Submitted to EPJC"},{"id":"http://arxiv.org/abs/2312.14369v1","updated":"2023-12-22T01:43:27Z","published":"2023-12-22T01:43:27Z","title":"Quality-Diversity Generative Sampling for Learning with Synthetic Data","summary":" Generative models can serve as surrogates for some real data sources by\ncreating synthetic training datasets, but in doing so they may transfer biases\nto downstream tasks. We focus on protecting quality and diversity when\ngenerating synthetic training datasets. We propose quality-diversity generative\nsampling (QDGS), a framework for sampling data uniformly across a user-defined\nmeasure space, despite the data coming from a biased generator. QDGS is a\nmodel-agnostic framework that uses prompt guidance to optimize a quality\nobjective across measures of diversity for synthetically generated data,\nwithout fine-tuning the generative model. Using balanced synthetic datasets\ngenerated by QDGS, we first debias classifiers trained on color-biased shape\ndatasets as a proof-of-concept. By applying QDGS to facial data synthesis, we\nprompt for desired semantic concepts, such as skin tone and age, to create an\nintersectional dataset with a combined blend of visual features. Leveraging\nthis balanced data for training classifiers improves fairness while maintaining\naccuracy on facial recognition benchmarks. Code available at:\nhttps://github.com/Cylumn/qd-generative-sampling\n","authors":["Allen Chang","Matthew C. Fontaine","Serena Booth","Maja J. Matarić","Stefanos Nikolaidis"],"pdf_url":"https://arxiv.org/pdf/2312.14369v1.pdf","comment":"Accepted at AAAI 2024; 7 pages main, 12 pages total, 9 figures"},{"id":"http://arxiv.org/abs/2312.10303v2","updated":"2023-12-22T01:40:28Z","published":"2023-12-16T03:35:56Z","title":"Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints","summary":" Restless multi-armed bandits (RMAB) have been widely used to model sequential\ndecision making problems with constraints. The decision maker (DM) aims to\nmaximize the expected total reward over an infinite horizon under an\n\"instantaneous activation constraint\" that at most B arms can be activated at\nany decision epoch, where the state of each arm evolves stochastically\naccording to a Markov decision process (MDP). However, this basic model fails\nto provide any fairness guarantee among arms. In this paper, we introduce\nRMAB-F, a new RMAB model with \"long-term fairness constraints\", where the\nobjective now is to maximize the long term reward while a minimum long-term\nactivation fraction for each arm must be satisfied. For the online RMAB-F\nsetting (i.e., the underlying MDPs associated with each arm are unknown to the\nDM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL.\nWe prove that Fair-UCRL ensures probabilistic sublinear bounds on both the\nreward regret and the fairness violation regret. Compared with off-the-shelf RL\nmethods, our Fair-UCRL is much more computationally efficient since it contains\na novel exploitation that leverages a low-complexity index policy for making\ndecisions. Experimental results further demonstrate the effectiveness of our\nFair-UCRL.\n","authors":["Shufan Wang","Guojun Xiong","Jian Li"],"pdf_url":"https://arxiv.org/pdf/2312.10303v2.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2209.11899v2","updated":"2023-12-22T01:21:03Z","published":"2022-09-24T01:16:51Z","title":"Two Bicomplex and One Multicomplex Least Mean Square algorithms","summary":" We study and introduce new gradient operators in the complex and bicomplex\nsettings, inspired from the well-known Least Mean Square (LMS) algorithm\ninvented in 1960 by Widrow and Hoff for Adaptive Linear Neuron (ADALINE).\n These gradient operators will be used to formulate new learning rules for the\nBicomplex Least Mean Square (BLMS) algorithms and we will also formulate these\nlearning rules will for the case of multicomplex LMS algorithms (MLMS). This\napproach extends both the classical real and complex LMS algorithms.\n","authors":["Daniel Alpay","Kamal Diki","Mihaela Vajiac"],"pdf_url":"https://arxiv.org/pdf/2209.11899v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14359v1","updated":"2023-12-22T01:19:08Z","published":"2023-12-22T01:19:08Z","title":"Training Neural Networks with Internal State, Unconstrained\n Connectivity, and Discrete Activations","summary":" Today's most powerful machine learning approaches are typically designed to\ntrain stateless architectures with predefined layers and differentiable\nactivation functions. While these approaches have led to unprecedented\nsuccesses in areas such as natural language processing and image recognition,\nthe trained models are also susceptible to making mistakes that a human would\nnot. In this paper, we take the view that true intelligence may require the\nability of a machine learning model to manage internal state, but that we have\nnot yet discovered the most effective algorithms for training such models. We\nfurther postulate that such algorithms might not necessarily be based on\ngradient descent over a deep architecture, but rather, might work best with an\narchitecture that has discrete activations and few initial topological\nconstraints (such as multiple predefined layers). We present one attempt in our\nongoing efforts to design such a training algorithm, applied to an architecture\nwith binary activations and only a single matrix of weights, and show that it\nis able to form useful representations of natural language text, but is also\nlimited in its ability to leverage large quantities of training data. We then\nprovide ideas for improving the algorithm and for designing other training\nalgorithms for similar architectures. Finally, we discuss potential benefits\nthat could be gained if an effective training algorithm is found, and suggest\nexperiments for evaluating whether these benefits exist in practice.\n","authors":["Alexander Grushin"],"pdf_url":"https://arxiv.org/pdf/2312.14359v1.pdf","comment":"5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2309.01108v3","updated":"2023-12-22T00:34:56Z","published":"2023-09-03T07:44:38Z","title":"Acoustic-to-articulatory inversion for dysarthric speech: Are\n pre-trained self-supervised representations favorable?","summary":" Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic\nto the articulatory space. Signal-processing features like the MFCCs, have been\nwidely used for the AAI task. For subjects with dysarthric speech, AAI is\nchallenging because of an imprecise and indistinct pronunciation. In this work,\nwe perform AAI for dysarthric speech using representations from pre-trained\nself-supervised learning (SSL) models. We demonstrate the impact of different\npre-trained features on this challenging AAI task, at low-resource conditions.\nIn addition, we also condition x-vectors to the extracted SSL features to train\na BLSTM network. In the seen case, we experiment with three AAI training\nschemes (subject-specific, pooled, and fine-tuned). The results, consistent\nacross training schemes, reveal that DeCoAR, in the fine-tuned scheme, achieves\na relative improvement of the Pearson Correlation Coefficient (CC) by ~1.81%\nand ~4.56% for healthy controls and patients, respectively, over MFCCs. We\nobserve similar average trends for different SSL features in the unseen case.\nOverall, SSL networks like wav2vec, APC, and DeCoAR, trained with feature\nreconstruction or future timestep prediction tasks, perform well in predicting\ndysarthric articulatory trajectories.\n","authors":["Sarthak Kumar Maharana","Krishna Kamal Adidam","Shoumik Nandi","Ajitesh Srivastava"],"pdf_url":"https://arxiv.org/pdf/2309.01108v3.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2307.16184v2","updated":"2023-12-22T18:07:41Z","published":"2023-07-30T09:48:36Z","title":"UnIVAL: Unified Model for Image, Video, Audio and Language Tasks","summary":" Large Language Models (LLMs) have made the ambitious quest for generalist\nagents significantly far from being a fantasy. A key hurdle for building such\ngeneral models is the diversity and heterogeneity of tasks and modalities. A\npromising solution is unification, allowing the support of a myriad of tasks\nand modalities within one unified framework. While few large models (e.g.,\nFlamingo (Alayrac et al., 2022), trained on massive datasets, can support more\nthan two modalities, current small to mid-scale unified models are still\nlimited to 2 modalities, usually image-text or video-text. The question that we\nask is: is it possible to build efficiently a unified model that can support\nall modalities? To answer this, we propose UnIVAL, a step further towards this\nambitious goal. Without relying on fancy datasets sizes or models with billions\nof parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities\nand unifies text, images, video, and audio into a single model. Our model is\nefficiently pretrained on many tasks, based on task balancing and multimodal\ncurriculum learning. UnIVAL shows competitive performance to existing\nstate-of-the-art approaches, across image and video-text tasks. The feature\nrepresentations learned from image and video-text modalities, allows the model\nto achieve competitive performance when finetuned on audio-text tasks, despite\nnot being pretrained on audio. Thanks to the unified model, we propose a novel\nstudy on multimodal model merging via weight interpolation of models trained on\ndifferent multimodal tasks, showing their benefits in particular for\nout-of-distribution generalization. Finally, we motivate unification by showing\nthe synergy between tasks. The model weights and code are released here:\nhttps://github.com/mshukor/UnIVAL.\n","authors":["Mustafa Shukor","Corentin Dancette","Alexandre Rame","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2307.16184v2.pdf","comment":"Accepted at TMLR 2023. 40 pages. Project page:\n https://unival-model.github.io/"},{"id":"http://arxiv.org/abs/2312.14867v1","updated":"2023-12-22T17:45:19Z","published":"2023-12-22T17:45:19Z","title":"VIEScore: Towards Explainable Metrics for Conditional Image Synthesis\n Evaluation","summary":" In the rapidly advancing field of conditional image generation research,\nchallenges such as limited explainability lie in effectively evaluating the\nperformance and capabilities of various models. This paper introduces VIESCORE,\na Visual Instruction-guided Explainable metric for evaluating any conditional\nimage generation tasks. VIESCORE leverages general knowledge from Multimodal\nLarge Language Models (MLLMs) as the backbone and does not require training or\nfine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image\ntasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of\n0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)\nVIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in\nevaluating synthetic images. (3) VIESCORE achieves a correlation on par with\nhuman ratings in the generation tasks but struggles in editing tasks. With\nthese results, we believe VIESCORE shows its great potential to replace human\njudges in evaluating image synthesis tasks.\n","authors":["Max Ku","Dongfu Jiang","Cong Wei","Xiang Yue","Wenhu Chen"],"pdf_url":"https://arxiv.org/pdf/2312.14867v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.06978v4","updated":"2023-12-22T14:16:59Z","published":"2023-09-13T14:13:08Z","title":"Differentiable JPEG: The Devil is in the Details","summary":" JPEG remains one of the most widespread lossy image coding methods. However,\nthe non-differentiable nature of JPEG restricts the application in deep\nlearning pipelines. Several differentiable approximations of JPEG have recently\nbeen proposed to address this issue. This paper conducts a comprehensive review\nof existing diff. JPEG approaches and identifies critical details that have\nbeen missed by previous methods. To this end, we propose a novel diff. JPEG\napproach, overcoming previous limitations. Our approach is differentiable\nw.r.t. the input image, the JPEG quality, the quantization tables, and the\ncolor conversion parameters. We evaluate the forward and backward performance\nof our diff. JPEG approach against existing methods. Additionally, extensive\nablations are performed to evaluate crucial design choices. Our proposed diff.\nJPEG resembles the (non-diff.) reference implementation best, significantly\nsurpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For\nstrong compression rates, we can even improve PSNR by $9.51$dB. Strong\nadversarial attack results are yielded by our diff. JPEG, demonstrating the\neffective gradient approximation. Our code is available at\nhttps://github.com/necla-ml/Diff-JPEG.\n","authors":["Christoph Reich","Biplob Debnath","Deep Patel","Srimat Chakradhar"],"pdf_url":"https://arxiv.org/pdf/2309.06978v4.pdf","comment":"Accepted at WACV 2024. Project page:\n https://christophreich1996.github.io/differentiable_jpeg/ WACV paper:\n https://openaccess.thecvf.com/content/WACV2024/html/Reich_Differentiable_JPEG_The_Devil_Is_in_the_Details_WACV_2024_paper.html"},{"id":"http://arxiv.org/abs/2312.14667v1","updated":"2023-12-22T13:03:23Z","published":"2023-12-22T13:03:23Z","title":"Token-Level Contrastive Learning with Modality-Aware Prompting for\n Multimodal Intent Recognition","summary":" Multimodal intent recognition aims to leverage diverse modalities such as\nexpressions, body movements and tone of speech to comprehend user's intent,\nconstituting a critical task for understanding human language and behavior in\nreal-world multimodal scenarios. Nevertheless, the majority of existing methods\nignore potential correlations among different modalities and own limitations in\neffectively learning semantic features from nonverbal modalities. In this\npaper, we introduce a token-level contrastive learning method with\nmodality-aware prompting (TCL-MAP) to address the above challenges. To\nestablish an optimal multimodal semantic environment for text modality, we\ndevelop a modality-aware prompting module (MAP), which effectively aligns and\nfuses features from text, video and audio modalities with similarity-based\nmodality alignment and cross-modality attention mechanism. Based on the\nmodality-aware prompt and ground truth labels, the proposed token-level\ncontrastive learning framework (TCL) constructs augmented samples and employs\nNT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal\ntextual semantic insights derived from intent labels to guide the learning\nprocesses of other modalities in return. Extensive experiments show that our\nmethod achieves remarkable improvements compared to state-of-the-art methods.\nAdditionally, ablation analyses demonstrate the superiority of the\nmodality-aware prompt over the handcrafted prompt, which holds substantial\nsignificance for multimodal prompt learning. The codes are released at\nhttps://github.com/thuiar/TCL-MAP.\n","authors":["Qianrui Zhou","Hua Xu","Hao Li","Hanlei Zhang","Xiaohan Zhang","Yifan Wang","Kai Gao"],"pdf_url":"https://arxiv.org/pdf/2312.14667v1.pdf","comment":"Accepted by AAAI 2024 (Main Track, Long Paper)"},{"id":"http://arxiv.org/abs/2312.14433v1","updated":"2023-12-22T04:46:21Z","published":"2023-12-22T04:46:21Z","title":"Attribute-driven Disentangled Representation Learning for Multimodal\n Recommendation","summary":" Recommendation algorithms forecast user preferences by correlating user and\nitem representations derived from historical interaction patterns. In pursuit\nof enhanced performance, many methods focus on learning robust and independent\nrepresentations by disentangling the intricate factors within interaction data\nacross various modalities in an unsupervised manner. However, such an approach\nobfuscates the discernment of how specific factors (e.g., category or brand)\ninfluence the outcomes, making it challenging to regulate their effects. In\nresponse to this challenge, we introduce a novel method called Attribute-Driven\nDisentangled Representation Learning (short for AD-DRL), which explicitly\nincorporates attributes from different modalities into the disentangled\nrepresentation learning process. By assigning a specific attribute to each\nfactor in multimodal features, AD-DRL can disentangle the factors at both\nattribute and attribute-value levels. To obtain robust and independent\nrepresentations for each factor associated with a specific attribute, we first\ndisentangle the representations of features both within and across different\nmodalities. Moreover, we further enhance the robustness of the representations\nby fusing the multimodal features of the same factor. Empirical evaluations\nconducted on three public real-world datasets substantiate the effectiveness of\nAD-DRL, as well as its interpretability and controllability.\n","authors":["Zhenyang Li","Fan Liu","Yinwei Wei","Zhiyong Cheng","Liqiang Nie","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2312.14433v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14385v1","updated":"2023-12-22T02:21:26Z","published":"2023-12-22T02:21:26Z","title":"Generative AI Beyond LLMs: System Implications of Multi-Modal Generation","summary":" As the development of large-scale Generative AI models evolve beyond text\n(1D) generation to include image (2D) and video (3D) generation, processing\nspatial and temporal information presents unique challenges to quality,\nperformance, and efficiency. We present the first work towards understanding\nthis new system design space for multi-modal text-to-image (TTI) and\ntext-to-video (TTV) generation models. Current model architecture designs are\nbifurcated into 2 categories: Diffusion- and Transformer-based models. Our\nsystematic performance characterization on a suite of eight representative\nTTI/TTV models shows that after state-of-the-art optimization techniques such\nas Flash Attention are applied, Convolution accounts for up to 44% of execution\ntime for Diffusion-based TTI models, while Linear layers consume up to 49% of\nexecution time for Transformer-based models. We additionally observe that\nDiffusion-based TTI models resemble the Prefill stage of LLM inference, and\nbenefit from 1.1-2.5x greater speedup from Flash Attention than\nTransformer-based TTI models that resemble the Decode phase. Since\noptimizations designed for LLMs do not map directly onto TTI/TTV models, we\nmust conduct a thorough characterization of these workloads to gain insights\nfor new optimization opportunities. In doing so, we define sequence length in\nthe context of TTI/TTV models and observe sequence length can vary up to 4x in\nDiffusion model inference. We additionally observe temporal aspects of TTV\nworkloads pose unique system bottlenecks, with Temporal Attention accounting\nfor over 60% of total Attention time. Overall, our in-depth system performance\ncharacterization is a critical first step towards designing efficient and\ndeployable systems for emerging TTI/TTV workloads.\n","authors":["Alicia Golden","Samuel Hsia","Fei Sun","Bilge Acun","Basil Hosmer","Yejin Lee","Zachary DeVito","Jeff Johnson","Gu-Yeon Wei","David Brooks","Carole-Jean Wu"],"pdf_url":"https://arxiv.org/pdf/2312.14385v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.14383v1","updated":"2023-12-22T02:19:23Z","published":"2023-12-22T02:19:23Z","title":"Removing Interference and Recovering Content Imaginatively for Visible\n Watermark Removal","summary":" Visible watermarks, while instrumental in protecting image copyrights,\nfrequently distort the underlying content, complicating tasks like scene\ninterpretation and image editing. Visible watermark removal aims to eliminate\nthe interference of watermarks and restore the background content. However,\nexisting methods often implement watermark component removal and background\nrestoration tasks within a singular branch, leading to residual watermarks in\nthe predictions and ignoring cases where watermarks heavily obscure the\nbackground. To address these limitations, this study introduces the Removing\nInterference and Recovering Content Imaginatively (RIRCI) framework. RIRCI\nembodies a two-stage approach: the initial phase centers on discerning and\nsegregating the watermark component, while the subsequent phase focuses on\nbackground content restoration. To achieve meticulous background restoration,\nour proposed model employs a dual-path network capable of fully exploring the\nintrinsic background information beneath semi-transparent watermarks and\nperipheral contextual information from unaffected regions. Moreover, a Global\nand Local Context Interaction module is built upon multi-layer perceptrons and\nbidirectional feature transformation for comprehensive representation modeling\nin the background restoration phase. The efficacy of our approach is\nempirically validated across two large-scale datasets, and our findings reveal\na marked enhancement over existing watermark removal techniques.\n","authors":["Yicheng Leng","Chaowei Fang","Gen Li","Yixiang Fang","Guanbin Li"],"pdf_url":"https://arxiv.org/pdf/2312.14383v1.pdf","comment":"Accepted by AAAI2024"}]}}
\ No newline at end of file
diff --git a/favicon.ico b/favicon.ico
new file mode 100644
index 00000000..7f5166c7
Binary files /dev/null and b/favicon.ico differ
diff --git a/index.css b/index.css
new file mode 100644
index 00000000..9ded9d94
--- /dev/null
+++ b/index.css
@@ -0,0 +1,355 @@
+:root {
+ /* Palette: Nord (https://www.nordtheme.com)*/
+ --nord00: #2e3440;
+ --nord01: #3b4252;
+ --nord02: #434c5e;
+ --nord03: #4c566a;
+ --nord04: #d8dee9;
+ --nord05: #e5e9f0;
+ --nord06: #eceff4;
+ --nord07: #8fbcbb;
+ --nord08: #88c0d0;
+ --nord09: #81a1c1;
+ --nord0A: #5e81ac;
+ --nord0B: #bf616a;
+ --nord0C: #d08770;
+ --nord0D: #ebcb8b;
+ --nord0E: #a3be8c;
+ --nord0F: #b48ead;
+
+
+ /* Typograph */
+ --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue",
+ sans-serif;
+ --font-size-scaler: 62.5%;
+ --font-size-m: 1.6rem;
+ --font-size-s: 1.4rem;
+
+ /* Components */
+ --body-color: var(--nord06);
+ --body-bg: var(--nord00);
+
+ --header-title: var(--nord06);
+ --header-container: var(--nord00);
+ --header-title-preffix: var(--nord0F);
+
+ --chip-font: var(--nord08);
+ --chip-color: var(--nord0B);
+
+ --icons: var(--nord06);
+ --icons-hover: var(--nord0F);
+
+ --day-container: var(--nord01);
+ --date: var(--nord09);
+
+ --summary: var(--nord0E);
+ --summary-hover: var(--nord0F);
+
+ --details-open: var(--nord02);
+ --details-content: var(--nord05);
+ --details-a: var(--nord07);
+ --details-a-hover: var(--nord0F);
+
+ --highlight-title: var(--nord0B);
+ --highlight-author: var(--nord0B);
+
+ --article-summary-hover-color: var(--nord0D);
+ --article-summary-color: var(--nord04);
+
+ --article-title-color: var(--nord05);
+ --article-title-hover-color: var(--nord0E);
+
+ --accordion-content-rail-color: var(--nord01);
+ --accordion-content-hover-rail-color: var(--nord0D);
+ --accordion-title-marker-color: var(--nord01);
+ --accordion-title-hover-marker-color: var(--nord0E);
+
+ --footer-color: var(--nord04);
+ --footer-link-hover-color: var(--nord0D);
+}
+
+[data-theme="light"] {
+ /* Theme design */
+
+ --color-primary: var(--nord07);
+ --color-primary-second: var(--nord00);
+ --color-info: var(--nord0A);
+ --color-success: var(--nord0E);
+ --color-warning: var(--nord0C);
+ --color-danger: var(--nord0B);
+
+ --color-text: var(--nord00);
+ --color-hover: var(--nord0D);
+ --color-shadow: var(--nord03);
+
+ --color-primary-h: var(--nord09);
+ --color-primary-s: var(--nord08);
+ --color-primary-l: var(--nord07);
+
+ --color-contrast-higher-h: var(--nord01);
+ --color-contrast-higher-l: var(--nord02);
+ --color-contrast-higher-s: var(--nord03);
+
+ --color-content: white;
+
+ --background: var(--nord06);
+ --background-content: var(--nord05);
+ --background-color: var(--nord04);
+
+ /* Components */
+
+ --chip-font: var(--nord06);
+ --chip-color: var(--nord09);
+
+ --body-color: var(--background-color);
+ --body-bg: var(--background);
+
+ --header-title: var(--color-shadow);
+ --header-container: var(--background);
+ --header-title-preffix: var(--color-primary-h);
+
+ --icons: var(--color-shadow);
+ --icons-hover: var(--color-hover);
+
+ --day-container: var(--background-content);
+ --date: var(--color-primary-l);
+
+ --summary: var(--color-info);
+ --summary-hover: var(--color-success);
+
+ --details-open: var(--color-content);
+ --details-content: var(--color-text);
+ --details-a: var(--color-primary-h);
+ --details-a-hover: var(--color-hover);
+
+ --highlight-title: var(--color-danger);
+ --highlight-author: var(--color-warning);
+
+ --article-summary-color: var(--color-text);
+ --article-summary-hover-color: var(--color-primary-s);
+
+ --article-title-color: var(--color-primary);
+ --article-title-hover-color: var(--color-success);
+
+ --accordion-content-rail-color: var(--color-warning);
+ --accordion-content-hover-rail-color: var(--color-warning);
+ --accordion-title-marker-color: var(--color-success);
+ --accordion-title-hover-marker-color: var(--color-success);
+
+ --footer-color: var(--color-text);
+ --footer-link-hover-color: var(--color-hover);
+}
+
+html {
+ font-size: var(--font-size-scaler);
+}
+
+body {
+ background-color: var(--body-bg);
+ font-family: var(--font-family-default);
+ color: var(--body-color);
+ margin: 0;
+ padding-top: 16px;
+ display: grid;
+}
+
+.header-container {
+ width: 90%;
+ max-width: 1200px;
+ background: var(--header-container);
+ margin: 0 auto;
+}
+
+.header-title {
+ font-size: 32px;
+ font-weight: bold;
+ color: var(--header-title);
+ margin: 0;
+ padding-bottom: 14px;
+}
+
+.header-title-preffix {
+ color: var(--header-title-preffix);
+}
+
+.icons {
+ color: var(--icons);
+ padding-bottom: 16px;
+}
+
+.icons a {
+ color: var(--icons);
+ text-decoration: none;
+}
+
+.icons a:hover {
+ color: var(--icons-hover);
+}
+
+.day-container {
+ padding: 16px 16px 16px 16px;
+ background: var(--day-container);
+ width: 90%;
+ max-width: 1200px;
+ margin: 0 auto;
+ margin-bottom: 8px;
+ border-radius: 10px;
+}
+
+.date {
+ font-size: 24px;
+ font-weight: 700;
+ margin: 0;
+ color: var(--date);
+}
+
+p {
+ margin: 0;
+}
+
+summary {
+ font-weight: 600;
+ color: var(--summary);
+}
+
+summary:hover {
+ text-decoration: underline;
+ cursor: pointer;
+ color: var(--summary-hover);
+}
+
+details {
+ --border-color: transparent;
+
+ padding: 2px 4px;
+ font-size: 20px;
+ border: 1px solid var(--border-color);
+ border-radius: 4px;
+}
+
+details[open] {
+ background-color: var(--details-open);
+ margin-bottom: 8px;
+}
+
+.details-content {
+ padding: 12px 3px;
+ gap: 16px;
+ color: var(--details-content);
+}
+
+details a {
+ color: var(--details-a);
+}
+
+details a:hover {
+ color: var(--details-a-hover);
+}
+
+footer {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ justify-content: space-between;
+}
+
+.description {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ text-align: center;
+}
+
+.highlight-author {
+ color: var(--highlight-author);
+ font-weight: bold;
+}
+
+.highlight-title {
+ color: var(--highlight-title);
+ font-weight: bold;
+}
+
+.channel-description {
+ text-align: center;
+ font-size: var(--font-size-scaler);
+}
+
+.article-summary-link {
+ color: var(--article-summary-color);
+ font-size: var(--font-size-s);
+ text-decoration: none;
+}
+
+.article-summary-link:hover {
+ color: var(--article-summary-hover-color);
+ --accordion-content-rail-color: var(--accordion-content-hover-rail-color);
+}
+
+.article-summary-box-outer {
+ display: block;
+ padding: 4px 8px 8px 4px;
+}
+
+.article-summary-box-inner {
+ padding-left: 8px;
+ border-left: 1px solid var(--accordion-content-rail-color);
+ font-size: var(--font-size-m);
+}
+
+.article-expander {
+ padding: 10px 4px;
+ border-radius: 4px;
+}
+
+.article-authors {
+ font-size: var(--font-size-m);
+ padding: 0.25em 1em;
+}
+
+.article-authors a {
+ text-decoration: none;
+}
+
+.article-expander-title {
+ font-size: var(--font-size-m);
+ font-weight: 600;
+}
+
+.article-expander-title:hover {
+ cursor: pointer;
+}
+
+.article-expander-title::marker {
+ color: var(--accordion-title-marker-color);
+}
+
+.article-expander-title:hover::marker {
+ color: var(--accordion-title-hover-marker-color);
+}
+
+/* for switcher */
+.theme-switch {
+ display: inline-block;
+ position: relative;
+}
+
+.theme-switch input {
+ display: none;
+}
+
+/* chip */
+.chip {
+ font-size: 90%;
+ align-items: center;
+ color: var(--chip-font);
+ background: var(--chip-color);
+ border-radius: 5rem;
+ display: inline-flex;
+ padding: .2rem .4rem;
+ vertical-align: middle;
+}
\ No newline at end of file
diff --git a/index.html b/index.html
new file mode 100644
index 00000000..237122cb
--- /dev/null
+++ b/index.html
@@ -0,0 +1,62722 @@
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 34
+
+
+
+
+
+ ☆ NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
+ Models via Complexity Classes
+
+
+ Complex reasoning ability is one of the most important features of current
+LLMs, which has also been leveraged to play an integral role in complex
+decision-making tasks. Therefore, the investigation into the reasoning
+capabilities of Large Language Models (LLMs) is critical: numerous benchmarks
+have been established to assess the reasoning abilities of LLMs. However,
+current benchmarks are inadequate in offering a rigorous evaluation of the full
+extent of reasoning abilities that LLMs are capable of achieving. They are also
+prone to the risk of overfitting, as these benchmarks, being publicly
+accessible and static, allow models to potentially tailor their responses to
+specific benchmark metrics, thereby inflating their performance. Addressing
+these limitations, our research introduces a new benchmark, named NPHardEval.
+This benchmark is designed to evaluate the reasoning abilities of LLMs across a
+broad spectrum of 900 algorithmic questions, extending up to the NP-Hard
+complexity class. These questions are meticulously chosen to represent a wide
+range of complexity class below the NP-hard complexity class, offering a
+rigorous measure of the reasoning ability of LLMs. Through this study, we shed
+light on the current state of reasoning in LLMs, providing an objective and
+rigorous perspective through the comparison of LLMs' performance across complex
+classes. Moreover, this benchmark is designed with a dynamic update mechanism,
+where the datapoints are refreshed on a monthly basis. Such regular updates
+play a crucial role in mitigating the risk of LLMs overfitting to the
+benchmark, promoting a more accurate and reliable assessment of their reasoning
+capabilities. The benchmark dataset and code of NPHardEval are available at
+https://github.com/casmlab/NPHardEval.
+
+
+
+ comment: 22 pages, 6 figures, 2 tables
+
+
+
+
+
+
+ ☆ Robust Knowledge Extraction from Large Language Models using Social
+ Choice Theory AAMAS 2024
+
+
+ Large-language models (LLMs) have the potential to support a wide range of
+applications like conversational agents, creative writing, text improvement,
+and general query answering. However, they are ill-suited for query answering
+in high-stake domains like medicine because they generate answers at random and
+their answers are typically not robust - even the same query can result in
+different answers when prompted multiple times. In order to improve the
+robustness of LLM queries, we propose using ranking queries repeatedly and to
+aggregate the queries using methods from social choice theory. We study ranking
+queries in diagnostic settings like medical and fault diagnosis and discuss how
+the Partial Borda Choice function from the literature can be applied to merge
+multiple query results. We discuss some additional interesting properties in
+our setting and evaluate the robustness of our approach empirically.
+
+
+
+ comment: Accepted by AAMAS 2024 as a full paper
+
+ Financial reports offer critical insights into a company's operations, yet
+their extensive length typically spanning 30 40 pages poses challenges for
+swift decision making in dynamic markets. To address this, we leveraged
+finetuned Large Language Models (LLMs) to distill key indicators and
+operational metrics from these reports basis questions from the user. We
+devised a method to locate critical data, and leverage the FinQA dataset to
+fine-tune both Llama-2 7B and T5 models for customized question answering. We
+achieved results comparable to baseline on the final numerical answer, a
+competitive accuracy in numerical reasoning and calculation.
+
+
+
+ comment: 10 pages, 11 figures, 6 tables
+
+
+
+
+
+
+ ☆ VIEScore: Towards Explainable Metrics for Conditional Image Synthesis
+ Evaluation
+
+
+
+
+
+
+
+
+ Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen
+
+
+ In the rapidly advancing field of conditional image generation research,
+challenges such as limited explainability lie in effectively evaluating the
+performance and capabilities of various models. This paper introduces VIESCORE,
+a Visual Instruction-guided Explainable metric for evaluating any conditional
+image generation tasks. VIESCORE leverages general knowledge from Multimodal
+Large Language Models (MLLMs) as the backbone and does not require training or
+fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image
+tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of
+0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)
+VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in
+evaluating synthetic images. (3) VIESCORE achieves a correlation on par with
+human ratings in the generation tasks but struggles in editing tasks. With
+these results, we believe VIESCORE shows its great potential to replace human
+judges in evaluating image synthesis tasks.
+
+
+
+
+
+
+
+ ☆ YAYI 2: Multilingual Open-Source Large Language Models
+
+
+ As the latest advancements in natural language processing, large language
+models (LLMs) have achieved human-level language understanding and generation
+abilities in many real-world tasks, and even have been regarded as a potential
+path to the artificial general intelligence. To better facilitate research on
+LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been
+proposed and gained comparable performances to proprietary models. However,
+these models are primarily designed for English scenarios and exhibit poor
+performances in Chinese contexts. In this technical report, we propose YAYI 2,
+including both base and chat models, with 30 billion parameters. YAYI 2 is
+pre-trained from scratch on a multilingual corpus which contains 2.65 trillion
+tokens filtered by our pre-training data processing pipeline. The base model is
+aligned with human values through supervised fine-tuning with millions of
+instructions and reinforcement learning from human feedback. Extensive
+experiments on multiple benchmarks, such as MMLU and CMMLU, consistently
+demonstrate that the proposed YAYI 2 outperforms other similar sized
+open-source models.
+
+
+
+
+
+
+
+ ☆ On the Use of Metaphor Translation in Psychiatry
+
+
+ Providing mental healthcare to individuals with limited English proficiency
+(LEP) remains a pressing problem within psychiatry. Because the majority of
+individuals trained in providing psychiatric care are English speakers, the
+quality of mental healthcare given to LEP patients is significantly lower than
+that provided for English speakers. The provision of mental healthcare is
+contingent on communication and understanding between the patient and
+healthcare provider, much more so than in the realm of physical healthcare, and
+English speakers are often unable to comprehend figurative language such as
+metaphors used by LEPs. Hence, Figurative Language Translation is invaluable to
+providing equitable psychiatric care. Now, metaphor has been shown to be
+paramount in both identifying individuals struggling with mental problems and
+helping those individuals understand and communicate their experiences.
+Therefore, this paper aims to survey the potential of Machine Translation for
+providing equitable psychiatric healthcare and highlights the need for further
+research on the transferability of existing machine and metaphor translation
+research in the domain of psychiatry.
+
+
+
+
+
+
+
+ ☆ Semantic Parsing for Complex Data Retrieval: Targeting Query Plans vs.
+ SQL for No-Code Access to Relational Databases
+
+
+
+
+
+
+
+
+ Ben Eyal, Amir Bachar, Ophir Haroche, Michael Elhadad
+
+
+ Large Language Models (LLMs) have spurred progress in text-to-SQL, the task
+of generating SQL queries from natural language questions based on a given
+database schema. Despite the declarative nature of SQL, it continues to be a
+complex programming language. In this paper, we investigate the potential of an
+alternative query language with simpler syntax and modular specification of
+complex queries. The purpose is to create a query language that can be learned
+more easily by modern neural semantic parsing architectures while also enabling
+non-programmers to better assess the validity of the query plans produced by an
+interactive query plan assistant.
+ The proposed alternative query language is called Query Plan Language (QPL).
+It is designed to be modular and can be translated into a restricted form of
+SQL Common Table Expressions (CTEs). The aim of QPL is to make complex data
+retrieval accessible to non-programmers by allowing users to express their
+questions in natural language while also providing an easier-to-verify target
+language. The paper demonstrates how neural LLMs can benefit from QPL's
+modularity to generate complex query plans in a compositional manner. This
+involves a question decomposition strategy and a planning stage.
+ We conduct experiments on a version of the Spider text-to-SQL dataset that
+has been converted to QPL. The hierarchical structure of QPL programs enables
+us to measure query complexity naturally. Based on this assessment, we identify
+the low accuracy of existing text-to-SQL systems on complex compositional
+queries. We present ways to address the challenge of complex queries in an
+iterative, user-controlled manner, using fine-tuned LLMs and a variety of
+prompting strategies in a compositional manner.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2310.13575
+
+
+
+
+
+
+ ☆ Large Language Model (LLM) Bias Index -- LLMBI
+
+
+ The Large Language Model Bias Index (LLMBI) is a pioneering approach designed
+to quantify and address biases inherent in large language models (LLMs), such
+as GPT-4. We recognise the increasing prevalence and impact of LLMs across
+diverse sectors. This research introduces a novel metric, LLMBI, to
+systematically measure and mitigate biases potentially skewing model responses.
+We formulated LLMBI using a composite scoring system incorporating multiple
+dimensions of bias, including but not limited to age, gender, and racial
+biases.
+ To operationalise this metric, we engaged in a multi-step process involving
+collecting and annotating LLM responses, applying sophisticated Natural
+Language Processing (NLP) techniques for bias detection, and computing the
+LLMBI score through a specially crafted mathematical formula. The formula
+integrates weighted averages of various bias dimensions, a penalty for dataset
+diversity deficiencies, and a correction for sentiment biases. Our empirical
+analysis, conducted using responses from OpenAI's API, employs advanced
+sentiment analysis as a representative method for bias detection.
+ The research reveals LLMs, whilst demonstrating impressive capabilities in
+text generation, exhibit varying degrees of bias across different dimensions.
+LLMBI provides a quantifiable measure to compare biases across models and over
+time, offering a vital tool for systems engineers, researchers and regulators
+in enhancing the fairness and reliability of LLMs. It highlights the potential
+of LLMs in mimicking unbiased human-like responses. Additionally, it
+underscores the necessity of continuously monitoring and recalibrating such
+models to align with evolving societal norms and ethical standards.
+
+
+
+
+
+
+
+ ☆ Computational Semantics and Evaluation Benchmark for Interrogative
+ Sentences via Combinatory Categorial Grammar ACL
+
+
+ We present a compositional semantics for various types of polar questions and
+wh-questions within the framework of Combinatory Categorial Grammar (CCG). To
+assess the explanatory power of our proposed analysis, we introduce a
+question-answering dataset QSEM specifically designed to evaluate the semantics
+of interrogative sentences. We implement our analysis using existing CCG
+parsers and conduct evaluations using the dataset. Through the evaluation, we
+have obtained annotated data with CCG trees and semantic representations for
+about half of the samples included in QSEM. Furthermore, we discuss the
+discrepancy between the theoretical capacity of CCG and the capabilities of
+existing CCG parsers.
+
+
+
+ comment: 11 pages, to appear in the Proceedings of PACLIC37
+
+
+
+
+
+
+ ☆ Balancing the Style-Content Trade-Off in Sentiment Transfer Using
+ Polarity-Aware Denoising
+
+
+ Text sentiment transfer aims to flip the sentiment polarity of a sentence
+(positive to negative or vice versa) while preserving its sentiment-independent
+content. Although current models show good results at changing the sentiment,
+content preservation in transferred sentences is insufficient. In this paper,
+we present a sentiment transfer model based on polarity-aware denoising, which
+accurately controls the sentiment attributes in generated text, preserving the
+content to a great extent and helping to balance the style-content trade-off.
+Our proposed model is structured around two key stages in the sentiment
+transfer process: better representation learning using a shared encoder and
+sentiment-controlled generation using separate sentiment-specific decoders.
+Empirical results show that our methods outperforms state-of-the-art baselines
+in terms of content preservation while staying competitive in terms of style
+transfer accuracy and fluency.
+
+
+
+ comment: Published in 25th International Conference on Text, Speech and
+ Dialogue (TSD 2022)
+
+
+
+
+
+
+ ☆ Collaborative Synthesis of Patient Records through Multi-Visit Health
+ State Inference AAAI 2024
+
+
+ Electronic health records (EHRs) have become the foundation of machine
+learning applications in healthcare, while the utility of real patient records
+is often limited by privacy and security concerns. Synthetic EHR generation
+provides an additional perspective to compensate for this limitation. Most
+existing methods synthesize new records based on real EHR data, without
+consideration of different types of events in EHR data, which cannot control
+the event combinations in line with medical common sense. In this paper, we
+propose MSIC, a Multi-visit health Status Inference model for Collaborative EHR
+synthesis to address these limitations. First, we formulate the synthetic EHR
+generation process as a probabilistic graphical model and tightly connect
+different types of events by modeling the latent health states. Then, we derive
+a health state inference method tailored for the multi-visit scenario to
+effectively utilize previous records to synthesize current and future records.
+Furthermore, we propose to generate medical reports to add textual descriptions
+for each medical event, providing broader applications for synthesized EHR
+data. For generating different paragraphs in each visit, we incorporate a
+multi-generator deliberation framework to collaborate the message passing of
+multiple generators and employ a two-phase decoding strategy to generate
+high-quality reports. Our extensive experiments on the widely used benchmarks,
+MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results
+on the quality of synthetic data while maintaining low privacy risks.
+
+
+ Confidence estimation, in which we estimate the reliability of each
+recognized token (e.g., word, sub-word, and character) in automatic speech
+recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an
+important function for developing ASR applications. In this study, we perform
+confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR
+systems show high performance (e.g., around 5% token error rates) for various
+ASR tasks. In such situations, confidence estimation becomes difficult since we
+need to detect infrequent incorrect tokens from mostly correct token sequences.
+To tackle this imbalanced dataset problem, we employ a bidirectional long
+short-term memory (BLSTM)-based model as a strong binary-class
+(correct/incorrect) sequence labeler that is trained with a class balancing
+objective. We experimentally confirmed that, by utilizing several types of ASR
+decoding scores as its auxiliary features, the model steadily shows high
+confidence estimation performance under highly imbalanced settings. We also
+confirmed that the BLSTM-based model outperforms Transformer-based confidence
+estimation models, which greatly underestimate incorrect tokens.
+
+
+
+ comment: Accepted to ICASSP 2021
+
+
+
+
+
+
+ ☆ Reasons to Reject? Aligning Language Models with Judgments
+
+
+
+
+
+
+
+
+ Weiwen Xu, Deng Cai, Zhisong Zhang, Wai Lam, Shuming Shi
+
+
+ As humans, we consistently engage in interactions with our peers and receive
+feedback in the form of natural language. This language feedback allows us to
+reflect on our actions, maintain appropriate behavior, and rectify our errors.
+The question arises naturally: can we use language feedback to align large
+language models (LLMs)? In contrast to previous research that aligns LLMs with
+reward or preference data, we present the first systematic exploration of
+alignment through the lens of language feedback (i.e., judgment). We commence
+with an in-depth investigation of potential methods that can be adapted for
+aligning LLMs with judgments, revealing that these methods are unable to fully
+capitalize on the judgments. To facilitate more effective utilization of
+judgments, we propose a novel framework, Contrastive Unlikelihood Training
+(CUT), that allows for fine-grained inappropriate content detection and
+correction based on judgments. Our offline alignment results show that, with
+merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B
+DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The
+online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b)
+in an iterative fashion using model-specific judgment data, with a steady
+performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis
+further suggests that judgments exhibit greater potential than rewards for LLM
+alignment and warrant future research.
+
+
+
+ comment: Our source codes and models are publicly available at
+ https://github.com/wwxu21/CUT
+
+
+
+
+
+
+ ☆ SIG: Speaker Identification in Literature via Prompt-Based Generation AAAI 2024
+
+
+
+
+
+
+
+
+ Zhenlin Su, Liyan Xu, Jin Xu, Jiangnan Li, Mingdu Huangfu
+
+
+ Identifying speakers of quotations in narratives is an important task in
+literary analysis, with challenging scenarios including the out-of-domain
+inference for unseen speakers, and non-explicit cases where there are no
+speaker mentions in surrounding context. In this work, we propose a simple and
+effective approach SIG, a generation-based method that verbalizes the task and
+quotation input based on designed prompt templates, which also enables easy
+integration of other auxiliary tasks that further bolster the speaker
+identification performance. The prediction can either come from direct
+generation by the model, or be determined by the highest generation probability
+of each speaker candidate. Based on our approach design, SIG supports
+out-of-domain evaluation, and achieves open-world classification paradigm that
+is able to accept any forms of candidate input. We perform both cross-domain
+evaluation and in-domain evaluation on PDNC, the largest dataset of this task,
+where empirical results suggest that SIG outperforms previous baselines of
+complicated designs, as well as the zero-shot ChatGPT, especially excelling at
+those hard non-explicit scenarios by up to 17% improvement. Additional
+experiments on another dataset WP further corroborate the efficacy of SIG.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ Aurora:Activating Chinese chat capability for Mistral-8x7B sparse
+ Mixture-of-Experts through Instruction-Tuning
+
+
+
+
+
+
+
+
+ Rongsheng Wang, Haoming Chen, Ruizhe Zhou, Yaofei Duan, Kunyan Cai, Han Ma, Jiaxi Cui, Jian Li, Patrick Cheong-Iao Pang, Yapeng Wang, Tao Tan
+
+
+ Existing research has demonstrated that refining large language models (LLMs)
+through the utilization of machine-generated instruction-following data
+empowers these models to exhibit impressive zero-shot capabilities for novel
+tasks, without requiring human-authored instructions. In this paper, we
+systematically investigate, preprocess, and integrate three Chinese
+instruction-following datasets with the aim of enhancing the Chinese
+conversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model.
+Through instruction fine-tuning on this carefully processed dataset, we
+successfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named
+"Aurora." To assess the performance of Aurora, we utilize three widely
+recognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate
+the effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse
+Mixture-of-Experts model. This work is pioneering in the execution of
+instruction fine-tuning on a sparse expert-mixed model, marking a significant
+breakthrough in enhancing the capabilities of this model architecture. Our
+code, data and model are publicly available at:
+https://github.com/WangRongsheng/Aurora
+
+
+
+ comment: 10 pages, 2 figures
+
+
+
+
+
+
+ ☆ Automatic Data Retrieval for Cross Lingual Summarization
+
+
+ Cross-lingual summarization involves the summarization of text written in one
+language to a different one. There is a body of research addressing
+cross-lingual summarization from English to other European languages. In this
+work, we aim to perform cross-lingual summarization from English to Hindi. We
+propose pairing up the coverage of newsworthy events in textual and video
+format can prove to be helpful for data acquisition for cross lingual
+summarization. We analyze the data and propose methods to match articles to
+video descriptions that serve as document and summary pairs. We also outline
+filtering methods over reasonable thresholds to ensure the correctness of the
+summaries. Further, we make available 28,583 mono and cross-lingual
+article-summary pairs https://github.com/tingc9/Cross-Sum-News-Aligned. We also
+build and analyze multiple baselines on the collected data and report error
+analysis.
+
+
+ Equivariance is an important feature in machine learning, including language
+models. It ensures that any sequences of phrases with the same meanings are
+interpreted consistently. For example, the sentence 'There is a cat on the
+table' should be interpreted by language models as it is, regardless of
+variations in its token-level expression. Building on this insight, I propose a
+new theory suggesting that insufficient equivariance in language models can
+lead to hallucinations. According to this theory, which is both intuitive and
+novel, language models trained on relatively small datasets tend to
+misinterpret input texts and/or generate incorrect texts (i.e.,
+hallucinations). To test this theory, I developed a toy model known as 'dancing
+men', which is a character-level substitution cipher. Additionally, I propose a
+novel technique based on the T5 (Text To Text Transfer Transformer) model to
+efficiently decipher these codes without relying on frequency analysis. I have
+found that this T5 model can almost completely solve the cipher, demonstrating
+its ability to acquire equivariance in this frame. This method could be scaled
+up to word-level and sentence-level substitution ciphers, analogous to large
+language models without tokenizers or dictionaries. This scalability makes it
+suitable for investigating the proposed link between inadequate equivariance
+acquisition and the emergence of hallucinations.
+
+
+
+
+
+
+
+ ☆ Language Model is a Branch Predictor for Simultaneous Machine
+ Translation ICASSP 2024
+
+
+ The primary objective of simultaneous machine translation (SiMT) is to
+minimize latency while preserving the quality of the final translation. Drawing
+inspiration from CPU branch prediction techniques, we propose incorporating
+branch prediction techniques in SiMT tasks to reduce translation latency.
+Specifically, we utilize a language model as a branch predictor to predict
+potential branch directions, namely, future source words. Subsequently, we
+utilize the predicted source words to decode the output in advance. When the
+actual source word deviates from the predicted source word, we use the real
+source word to decode the output again, replacing the predicted output. To
+further reduce computational costs, we share the parameters of the encoder and
+the branch predictor, and utilize a pre-trained language model for
+initialization. Our proposed method can be seamlessly integrated with any SiMT
+model. Extensive experimental results demonstrate that our approach can improve
+translation quality and latency at the same time. Our code is available at
+https://github.com/YinAoXiong/simt_branch_predictor .
+
+
+
+ comment: Accepted by IEEE ICASSP 2024
+
+
+
+
+
+
+ ☆ MetaAID 2.5: A Secure Framework for Developing Metaverse Applications
+ via Large Language Models
+
+
+ Large language models (LLMs) are increasingly being used in Metaverse
+environments to generate dynamic and realistic content and to control the
+behavior of non-player characters (NPCs). However, the cybersecurity concerns
+associated with LLMs have become increasingly prominent. Previous research has
+primarily focused on patching system vulnerabilities to enhance cybersecurity,
+but these approaches are not well-suited to the Metaverse, where the virtual
+space is more complex, LLMs are vulnerable, and ethical user interaction is
+critical. Moreover, the scope of cybersecurity in the Metaverse is expected to
+expand significantly. This paper proposes a method for enhancing cybersecurity
+through the simulation of user interaction with LLMs. Our goal is to educate
+users and strengthen their defense capabilities through exposure to a
+comprehensive simulation system. This system includes extensive Metaverse
+cybersecurity Q&A and attack simulation scenarios. By engaging with these,
+users will improve their ability to recognize and withstand risks.
+Additionally, to address the ethical implications of user input, we propose
+using LLMs as evaluators to assess user content across five dimensions. We
+further adapt the models through vocabulary expansion training to better
+understand personalized inputs and emoticons. We conduct experiments on
+multiple LLMs and find that our approach is effective.
+
+
+ Large "instruction-tuned" language models (i.e., finetuned to respond to
+instructions) have demonstrated a remarkable ability to generalize zero-shot to
+new tasks. Nevertheless, they depend heavily on human-written instruction data
+that is often limited in quantity, diversity, and creativity, therefore
+hindering the generality of the tuned model. We conducted a quantitative study
+to figure out the efficacy of machine-generated annotations, where we compare
+the results of a fine-tuned BERT model with human v/s machine-generated
+annotations. Applying our methods to the vanilla GPT-3 model, we saw that
+machine generated annotations were 78.54% correct and the fine-tuned model
+achieved a 96.01% model performance compared to the performance with
+human-labelled annotations. This result shows that machine-generated
+annotations are a resource and cost effective way to fine-tune down-stream
+models.
+
+
+
+
+
+
+
+ ☆ Don't Believe Everything You Read: Enhancing Summarization
+ Interpretability through Automatic Identification of Hallucinations in Large
+ Language Models
+
+
+ Large Language Models (LLMs) are adept at text manipulation -- tasks such as
+machine translation and text summarization. However, these models can also be
+prone to hallucination, which can be detrimental to the faithfulness of any
+answers that the model provides. Recent works in combating hallucinations in
+LLMs deal with identifying hallucinated sentences and categorizing the
+different ways in which models hallucinate. This paper takes a deep dive into
+LLM behavior with respect to hallucinations, defines a token-level approach to
+identifying different kinds of hallucinations, and further utilizes this
+token-level tagging to improve the interpretability and faithfulness of LLMs in
+dialogue summarization tasks. Through this, the paper presents a new, enhanced
+dataset and a new training paradigm.
+
+
+
+ comment: All authors contributed equally to this work
+
+ The unique capabilities of Large Language Models (LLMs), such as the natural
+language text generation ability, position them as strong candidates for
+providing explanation for recommendations. However, despite the size of the
+LLM, most existing models struggle to produce zero-shot explanations reliably.
+To address this issue, we propose a framework called Logic-Scaffolding, that
+combines the ideas of aspect-based explanation and chain-of-thought prompting
+to generate explanations through intermediate reasoning steps. In this paper,
+we share our experience in building the framework and present an interactive
+demonstration for exploring our results.
+
+
+
+ comment: The 17th ACM International Conference on Web Search and Data Mining
+ (WSDM 2024)
+
+
+
+
+
+
+ ♻ ☆ Next Steps for Human-Centered Generative AI: A Technical Perspective
+
+
+
+
+
+
+
+
+ Xiang 'Anthony' Chen, Jeff Burke, Ruofei Du, Matthew K. Hong, Jennifer Jacobs, Philippe Laban, Dingzeyu Li, Nanyun Peng, Karl D. D. Willis, Chien-Sheng Wu, Bolei Zhou
+
+
+ Through iterative, cross-disciplinary discussions, we define and propose
+next-steps for Human-centered Generative AI (HGAI). We contribute a
+comprehensive research agenda that lays out future directions of Generative AI
+spanning three levels: aligning with human values; assimilating human intents;
+and augmenting human abilities. By identifying these next-steps, we intend to
+draw interdisciplinary research teams to pursue a coherent set of emergent
+ideas in HGAI, focusing on their interested topics while maintaining a coherent
+big picture of the future work landscape.
+
+
+
+
+
+
+
+ ♻ ☆ Are Structural Concepts Universal in Transformer Language Models?
+ Towards Interpretable Cross-Lingual Generalization EMNLP 2023
+
+
+ Large language models (LLMs) have exhibited considerable cross-lingual
+generalization abilities, whereby they implicitly transfer knowledge across
+languages. However, the transfer is not equally successful for all languages,
+especially for low-resource ones, which poses an ongoing challenge. It is
+unclear whether we have reached the limits of implicit cross-lingual
+generalization and if explicit knowledge transfer is viable. In this paper, we
+investigate the potential for explicitly aligning conceptual correspondence
+between languages to enhance cross-lingual generalization. Using the syntactic
+aspect of language as a testbed, our analyses of 43 languages reveal a high
+degree of alignability among the spaces of structural concepts within each
+language for both encoder-only and decoder-only LLMs. We then propose a
+meta-learning-based method to learn to align conceptual spaces of different
+languages, which facilitates zero-shot and few-shot generalization in concept
+classification and also offers insights into the cross-lingual in-context
+learning phenomenon. Experiments on syntactic analysis tasks show that our
+approach achieves competitive results with state-of-the-art methods and narrows
+the performance gap between languages, particularly benefiting those with
+limited resources.
+
+
+
+ comment: Findings of EMNLP 2023 (Camera-Ready)
+
+ Automatic melody-to-lyric generation is a task in which song lyrics are
+generated to go with a given melody. It is of significant practical interest
+and more challenging than unconstrained lyric generation as the music imposes
+additional constraints onto the lyrics. The training data is limited as most
+songs are copyrighted, resulting in models that underfit the complicated
+cross-modal relationship between melody and lyrics. In this work, we propose a
+method for generating high-quality lyrics without training on any aligned
+melody-lyric data. Specifically, we design a hierarchical lyric generation
+framework that first generates a song outline and second the complete lyrics.
+The framework enables disentanglement of training (based purely on text) from
+inference (melody-guided text generation) to circumvent the shortage of
+parallel data.
+ We leverage the segmentation and rhythm alignment between melody and lyrics
+to compile the given melody into decoding constraints as guidance during
+inference. The two-step hierarchical design also enables content control via
+the lyric outline, a much-desired feature for democratizing collaborative song
+creation. Experimental results show that our model can generate high-quality
+lyrics that are more on-topic, singable, intelligible, and coherent than strong
+baselines, for example SongMASS, a SOTA model trained on a parallel dataset,
+with a 24% relative overall quality improvement based on human ratings.
+
+
+
+ comment: ACL 2023. arXiv admin note: substantial text overlap with
+ arXiv:2305.07760
+
+
+
+
+
+
+ ♻ ☆ How Far Have We Gone in Vulnerability Detection Using Large Language
+ Models
+
+
+ As software becomes increasingly complex and prone to vulnerabilities,
+automated vulnerability detection is critically important, yet challenging.
+Given the significant successes of large language models (LLMs) in various
+tasks, there is growing anticipation of their efficacy in vulnerability
+detection. However, a quantitative understanding of their potential in
+vulnerability detection is still missing. To bridge this gap, we introduce a
+comprehensive vulnerability benchmark VulBench. This benchmark aggregates
+high-quality data from a wide range of CTF (Capture-the-Flag) challenges and
+real-world applications, with annotations for each vulnerable function
+detailing the vulnerability type and its root cause. Through our experiments
+encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models
+and static analyzers, we find that several LLMs outperform traditional deep
+learning approaches in vulnerability detection, revealing an untapped potential
+in LLMs. This work contributes to the understanding and utilization of LLMs for
+enhanced software security.
+
+
+
+
+
+
+
+ ♻ ☆ In-Context Probing: Toward Building Robust Classifiers via Probing Large
+ Language Models
+
+
+ Large language models are able to learn new tasks in context, where they are
+provided with instructions and a few annotated examples. However, the
+effectiveness of in-context learning is dependent on the provided context, and
+the performance on a downstream task can vary considerably, depending on the
+instruction. Importantly, such dependency on the context can surface in
+unpredictable ways, e.g., a seemingly more informative instruction might lead
+to a worse performance. In this paper, we propose an alternative approach,
+which we term In-Context Probing (ICP). Similar to in-context learning, we
+contextualize the representation of the input with an instruction, but instead
+of decoding the output prediction, we probe the contextualized representation
+to predict the label. Through a series of experiments on a diverse set of
+classification tasks, we show that in-context probing is significantly more
+robust to changes in instructions. We further show that ICP performs
+competitive or superior to finetuning and can be particularly helpful to build
+classifiers on top of smaller models, with less than a hundred training
+examples.
+
+
+
+
+
+
+
+ ♻ ☆ Aligning Language Models with Human Preferences via a Bayesian Approach NeurIPS 2023
+
+
+ In the quest to advance human-centric natural language generation (NLG)
+systems, ensuring alignment between NLG models and human preferences is
+crucial. For this alignment, current popular methods leverage a reinforcement
+learning (RL) approach with a reward model trained on feedback from humans.
+However, inherent disagreements due to the subjective nature of human
+preferences pose a significant challenge for training the reward model,
+resulting in a deterioration of the NLG performance. To tackle this issue,
+previous approaches typically rely on majority voting or averaging to
+consolidate multiple inconsistent preferences into a merged one. Although
+straightforward to understand and execute, such methods suffer from an
+inability to capture the nuanced degrees of disaggregation among humans and may
+only represent a specialized subset of individuals, thereby lacking the ability
+to quantitatively disclose the universality of human preferences. To address
+this challenge, this paper proposes a novel approach, which employs a Bayesian
+framework to account for the distribution of disagreements among human
+preferences as training a preference model, and names it as d-PM. Besides,
+considering the RL strategy's inefficient and complex training process over the
+training efficiency, we further propose utilizing the contrastive learning
+strategy to train the NLG model with the preference scores derived from the
+d-PM model. Extensive experiments on two human-centric NLG tasks, i.e.,
+emotional support conversation and integrity "Rule-of-Thumb" generation, show
+that our method consistently exceeds previous SOTA models in both automatic and
+human evaluations.
+
+
+
+ comment: NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Text normalization for low-resource languages: the case of Ligurian
+
+
+ Text normalization is a crucial technology for low-resource languages which
+lack rigid spelling conventions or that have undergone multiple spelling
+reforms. Low-resource text normalization has so far relied upon hand-crafted
+rules, which are perceived to be more data efficient than neural methods. In
+this paper we examine the case of text normalization for Ligurian, an
+endangered Romance language. We collect 4,394 Ligurian sentences paired with
+their normalized versions, as well as the first open source monolingual corpus
+for Ligurian. We show that, in spite of the small amounts of data available, a
+compact transformer-based model can be trained to achieve very low error rates
+by the use of backtranslation and appropriate tokenization.
+
+
+
+
+
+
+
+ ♻ ☆ Prompt-Based Editing for Text Style Transfer EMNLP
+
+
+ Prompting approaches have been recently explored in text style transfer,
+where a textual prompt is used to query a pretrained language model to generate
+style-transferred texts word by word in an autoregressive manner. However, such
+a generation process is less controllable and early prediction errors may
+affect future word predictions. In this paper, we present a prompt-based
+editing approach for text style transfer. Specifically, we prompt a pretrained
+language model for style classification and use the classification probability
+to compute a style score. Then, we perform discrete search with word-level
+editing to maximize a comprehensive scoring function for the style-transfer
+task. In this way, we transform a prompt-based generation problem into a
+classification one, which is a training-free process and more controllable than
+the autoregressive generation of sentences. In our experiments, we performed
+both automatic and human evaluation on three style-transfer benchmark datasets,
+and show that our approach largely outperforms the state-of-the-art systems
+that have 20 times more parameters. Additional empirical analyses further
+demonstrate the effectiveness of our approach.
+
+
+
+ comment: Accepted by EMNLP Findings 2023
+
+
+
+
+
+
+ ♻ ☆ Is ChatGPT A Good Keyphrase Generator? A Preliminary Study
+
+
+ The emergence of ChatGPT has recently garnered significant attention from the
+computational linguistics community. To demonstrate its capabilities as a
+keyphrase generator, we conduct a preliminary evaluation of ChatGPT for the
+keyphrase generation task. We evaluate its performance in various aspects,
+including keyphrase generation prompts, keyphrase generation diversity, and
+long document understanding. Our evaluation is based on six benchmark datasets,
+and we adopt the prompt suggested by OpenAI while extending it to six candidate
+prompts. We find that ChatGPT performs exceptionally well on all six candidate
+prompts, with minor performance differences observed across the datasets. Based
+on our findings, we conclude that ChatGPT has great potential for keyphrase
+generation. Moreover, we discover that ChatGPT still faces challenges when it
+comes to generating absent keyphrases. Meanwhile, in the final section, we also
+present some limitations and future expansions of this report.
+
+
+
+ comment: Technical Report, 6 pages
+
+
+
+
+
+
+ ♻ ☆ Guiding Language Model Reasoning with Planning Tokens
+
+
+ Large language models (LLMs) have recently attracted considerable interest
+for their ability to perform complex reasoning tasks, such as chain-of-thought
+reasoning. However, most of the existing approaches to enhance this ability
+rely heavily on data-driven methods, while neglecting the structural aspects of
+the model's reasoning capacity. We find that while LLMs can manage individual
+reasoning steps well, they struggle with maintaining consistency across an
+entire reasoning chain. To solve this, we introduce 'planning tokens' at the
+start of each reasoning step, serving as a guide for the model. These token
+embeddings are then fine-tuned along with the rest of the model parameters. Our
+approach requires a negligible increase in trainable parameters (just 0.001%)
+and can be applied through either full fine-tuning or a more
+parameter-efficient scheme. We demonstrate our method's effectiveness by
+applying it to three different LLMs, showing notable accuracy improvements
+across three math word problem datasets w.r.t. plain chain-of-thought
+fine-tuning baselines.
+
+
+
+ comment: 10 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Developing Interactive Tourism Planning: A Dialogue Robot System Powered
+ by a Large Language Model
+
+
+ In recent years, large language models (LLMs) have rapidly proliferated and
+have been utilized in various tasks, including research in dialogue systems. We
+aimed to construct a system that not only leverages the flexible conversational
+abilities of LLMs but also their advanced planning capabilities to reduce the
+speaking load on human interlocutors and efficiently plan trips. Furthermore,
+we propose a method that divides the complex task of a travel agency into
+multiple subtasks, managing each as a separate phase to effectively accomplish
+the task. Our proposed system confirmed a certain level of success by achieving
+fourth place in the Dialogue Robot Competition 2023 preliminaries rounds. We
+report on the challenges identified through the competition.
+
+
+
+ comment: This paper is part of the proceedings of the Dialogue Robot
+ Competition 2023
+
+
+
+
+
+
+ ♻ ☆ NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional,
+ and Explainable Reasoning
+
+
+
+
+
+
+
+
+ Nathaniel Weir, Peter Clark, Benjamin Van Durme
+
+
+ Our goal is a modern approach to answering questions via systematic reasoning
+where answers are supported by human interpretable proof trees grounded in an
+NL corpus of authoritative facts. Such a system would help alleviate the
+challenges of interpretability and hallucination with modern LMs, and the lack
+of grounding of current explanation methods (e.g., Chain-of-Thought). This
+paper proposes a new take on Prolog-based inference engines, where we replace
+handcrafted rules with a combination of neural language modeling, guided
+generation, and semiparametric dense retrieval. Our implementation, NELLIE, is
+the first system to demonstrate fully interpretable, end-to-end grounded QA as
+entailment tree proof search, going beyond earlier work explaining
+known-to-be-true facts from text. In experiments, NELLIE outperforms a
+similar-sized state-of-the-art reasoner [Tafjord et al., 2022] while producing
+knowledge-grounded explanations. We also find NELLIE can exploit both
+semi-structured and NL text corpora to guide reasoning. Together these suggest
+a new way to jointly reap the benefits of both modern neural methods and
+traditional symbolic reasoning.
+
+
+
+
+
+ ☆ MACS: Mass Conditioned 3D Hand and Object Motion Synthesis
+
+
+
+
+
+
+
+
+ Soshi Shimada, Franziska Mueller, Jan Bednarik, Bardia Doosti, Bernd Bickel, Danhang Tang, Vladislav Golyanik, Jonathan Taylor, Christian Theobalt, Thabo Beeler
+
+
+ The physical properties of an object, such as mass, significantly affect how
+we manipulate it with our hands. Surprisingly, this aspect has so far been
+neglected in prior work on 3D motion synthesis. To improve the naturalness of
+the synthesized 3D hand object motions, this work proposes MACS the first MAss
+Conditioned 3D hand and object motion Synthesis approach. Our approach is based
+on cascaded diffusion models and generates interactions that plausibly adjust
+based on the object mass and interaction type. MACS also accepts a manually
+drawn 3D object trajectory as input and synthesizes the natural 3D hand motions
+conditioned by the object mass. This flexibility enables MACS to be used for
+various downstream applications, such as generating synthetic training data for
+ML tasks, fast animation of hands for graphics workflows, and generating
+character interactions for computer games. We show experimentally that a
+small-scale dataset is sufficient for MACS to reasonably generalize across
+interpolated and extrapolated object masses unseen during the training.
+Furthermore, MACS shows moderate generalization to unseen objects, thanks to
+the mass-conditioned contact labels generated by our surface contact synthesis
+model ConNet. Our comprehensive user study confirms that the synthesized 3D
+hand-object interactions are highly plausible and realistic.
+
+
+
+
+
+
+
+ ☆ Training Convolutional Neural Networks with the Forward-Forward
+ algorithm
+
+
+ The recent successes in analyzing images with deep neural networks are almost
+exclusively achieved with Convolutional Neural Networks (CNNs). The training of
+these CNNs, and in fact of all deep neural network architectures, uses the
+backpropagation algorithm where the output of the network is compared with the
+desired result and the difference is then used to tune the weights of the
+network towards the desired outcome. In a 2022 preprint, Geoffrey Hinton
+suggested an alternative way of training which passes the desired results
+together with the images at the input of the network. This so called Forward
+Forward (FF) algorithm has up to now only been used in fully connected
+networks. In this paper, we show how the FF paradigm can be extended to CNNs.
+Our FF-trained CNN, featuring a novel spatially-extended labeling technique,
+achieves a classification accuracy of 99.0% on the MNIST hand-written digits
+dataset. We show how different hyperparameters affect the performance of the
+proposed algorithm and compare the results with CNN trained with the standard
+backpropagation approach. Furthermore, we use Class Activation Maps to
+investigate which type of features are learnt by the FF algorithm.
+
+
+
+
+
+
+
+
+ James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, Romain Mueller
+
+
+ Combining complementary sensor modalities is crucial to providing robust
+perception for safety-critical robotics applications such as autonomous driving
+(AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on
+monocular depth estimation which is a notoriously difficult task compared to
+using depth information from the lidar directly. Here, we find that this
+approach does not leverage depth as expected and show that naively improving
+depth estimation does not lead to improvements in object detection performance
+and that, strikingly, removing depth estimation altogether does not degrade
+object detection performance. This suggests that relying on monocular depth
+could be an unnecessary architectural bottleneck during camera-lidar fusion. In
+this work, we introduce a novel fusion method that bypasses monocular depth
+estimation altogether and instead selects and fuses camera and lidar features
+in a bird's-eye-view grid using a simple attention mechanism. We show that our
+model can modulate its use of camera features based on the availability of
+lidar features and that it yields better 3D object detection on the nuScenes
+dataset than baselines relying on monocular depth estimation.
+
+
+
+
+
+
+
+ ☆ PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF
+
+
+
+
+
+
+
+
+ Mohsen Gholami, Rabab Ward, Z. Jane Wang
+
+
+ This paper proposes an end-to-end framework for generating 3D human pose
+datasets using Neural Radiance Fields (NeRF). Public datasets generally have
+limited diversity in terms of human poses and camera viewpoints, largely due to
+the resource-intensive nature of collecting 3D human pose data. As a result,
+pose estimators trained on public datasets significantly underperform when
+applied to unseen out-of-distribution samples. Previous works proposed
+augmenting public datasets by generating 2D-3D pose pairs or rendering a large
+amount of random data. Such approaches either overlook image rendering or
+result in suboptimal datasets for pre-trained models. Here we propose PoseGen,
+which learns to generate a dataset (human 3D poses and images) with a feedback
+loss from a given pre-trained pose estimator. In contrast to prior art, our
+generated data is optimized to improve the robustness of the pre-trained model.
+The objective of PoseGen is to learn a distribution of data that maximizes the
+prediction error of a given pre-trained model. As the learned data distribution
+contains OOD samples of the pre-trained model, sampling data from such a
+distribution for further fine-tuning a pre-trained model improves the
+generalizability of the model. This is the first work that proposes NeRFs for
+3D human data generation. NeRFs are data-driven and do not require 3D scans of
+humans. Therefore, using NeRF for data generation is a new direction for
+convenient user-specific data generation. Our extensive experiments show that
+the proposed PoseGen improves two baseline models (SPIN and HybrIK) on four
+datasets with an average 6% relative improvement.
+
+
+
+
+
+
+
+ ☆ DRStageNet: Deep Learning for Diabetic Retinopathy Staging from Fundus
+ Images
+
+
+
+
+
+
+
+
+ Yevgeniy Men, Jonathan Fhima, Leo Anthony Celi, Lucas Zago Ribeiro, Luis Filipe Nakayama, Joachim A. Behar
+
+
+ Diabetic retinopathy (DR) is a prevalent complication of diabetes associated
+with a significant risk of vision loss. Timely identification is critical to
+curb vision impairment. Algorithms for DR staging from digital fundus images
+(DFIs) have been recently proposed. However, models often fail to generalize
+due to distribution shifts between the source domain on which the model was
+trained and the target domain where it is deployed. A common and particularly
+challenging shift is often encountered when the source- and target-domain
+supports do not fully overlap. In this research, we introduce DRStageNet, a
+deep learning model designed to mitigate this challenge. We used seven publicly
+available datasets, comprising a total of 93,534 DFIs that cover a variety of
+patient demographics, ethnicities, geographic origins and comorbidities. We
+fine-tune DINOv2, a pretrained model of self-supervised vision transformer, and
+implement a multi-source domain fine-tuning strategy to enhance generalization
+performance. We benchmark and demonstrate the superiority of our method to two
+state-of-the-art benchmarks, including a recently published foundation model.
+We adapted the grad-rollout method to our regression task in order to provide
+high-resolution explainability heatmaps. The error analysis showed that 59\% of
+the main errors had incorrect reference labels. DRStageNet is accessible at URL
+[upon acceptance of the manuscript].
+
+
+
+
+
+
+
+ ☆ BrainVis: Exploring the Bridge between Brain and Visual Signals via
+ Image Reconstruction
+
+
+
+
+
+
+
+
+ Honghao Fu, Zhiqi Shen, Jing Jih Chin, Hao Wang
+
+
+ Analyzing and reconstructing visual stimuli from brain signals effectively
+advances understanding of the human visual system. However, the EEG signals are
+complex and contain a amount of noise. This leads to substantial limitations in
+existing works of visual stimuli reconstruction from EEG, such as difficulties
+in aligning EEG embeddings with the fine-grained semantic information and a
+heavy reliance on additional large self-collected dataset for training. To
+address these challenges, we propose a novel approach called BrainVis. Firstly,
+we divide the EEG signals into various units and apply a self-supervised
+approach on them to obtain EEG time-domain features, in an attempt to ease the
+training difficulty. Additionally, we also propose to utilize the
+frequency-domain features to enhance the EEG representations. Then, we
+simultaneously align EEG time-frequency embeddings with the interpolation of
+the coarse and fine-grained semantics in the CLIP space, to highlight the
+primary visual components and reduce the cross-modal alignment difficulty.
+Finally, we adopt the cascaded diffusion models to reconstruct images. Our
+proposed BrainVis outperforms state of the arts in both semantic fidelity
+reconstruction and generation quality. Notably, we reduce the training data
+scale to 10% of the previous work.
+
+
+
+
+
+
+
+ ☆ VIEScore: Towards Explainable Metrics for Conditional Image Synthesis
+ Evaluation
+
+
+
+
+
+
+
+
+ Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen
+
+
+ In the rapidly advancing field of conditional image generation research,
+challenges such as limited explainability lie in effectively evaluating the
+performance and capabilities of various models. This paper introduces VIESCORE,
+a Visual Instruction-guided Explainable metric for evaluating any conditional
+image generation tasks. VIESCORE leverages general knowledge from Multimodal
+Large Language Models (MLLMs) as the backbone and does not require training or
+fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image
+tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of
+0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)
+VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in
+evaluating synthetic images. (3) VIESCORE achieves a correlation on par with
+human ratings in the generation tasks but struggles in editing tasks. With
+these results, we believe VIESCORE shows its great potential to replace human
+judges in evaluating image synthesis tasks.
+
+
+
+
+
+
+
+ ☆ Prototype-Guided Text-based Person Search based on Rich Chinese
+ Descriptions
+
+
+ Text-based person search aims to simultaneously localize and identify the
+target person based on query text from uncropped scene images, which can be
+regarded as the unified task of person detection and text-based person
+retrieval task. In this work, we propose a large-scale benchmark dataset named
+PRW-TPS-CN based on the widely used person search dataset PRW. Our dataset
+contains 47,102 sentences, which means there is quite more information than
+existing dataset. These texts precisely describe the person images from top to
+bottom, which in line with the natural description order. We also provide both
+Chinese and English descriptions in our dataset for more comprehensive
+evaluation. These characteristics make our dataset more applicable. To
+alleviate the inconsistency between person detection and text-based person
+retrieval, we take advantage of the rich texts in PRW-TPS-CN dataset. We
+propose to aggregate multiple texts as text prototypes to maintain the
+prominent text features of a person, which can better reflect the whole
+character of a person. The overall prototypes lead to generating the image
+attention map to eliminate the detection misalignment causing the decrease of
+text-based person retrieval. Thus, the inconsistency between person detection
+and text-based person retrieval is largely alleviated. We conduct extensive
+experiments on the PRW-TPS-CN dataset. The experimental results show the
+PRW-TPS-CN dataset's effectiveness and the state-of-the-art performance of our
+approach.
+
+
+
+ comment: 11 pages, 5 figures
+
+
+
+
+
+
+ ☆ Dreaming of Electrical Waves: Generative Modeling of Cardiac Excitation
+ Waves using Diffusion Models
+
+
+
+
+
+
+
+
+ Tanish Baranwal, Jan Lebert, Jan Christoph
+
+
+ Electrical waves in the heart form rotating spiral or scroll waves during
+life-threatening arrhythmias such as atrial or ventricular fibrillation. The
+wave dynamics are typically modeled using coupled partial differential
+equations, which describe reaction-diffusion dynamics in excitable media. More
+recently, data-driven generative modeling has emerged as an alternative to
+generate spatio-temporal patterns in physical and biological systems. Here, we
+explore denoising diffusion probabilistic models for the generative modeling of
+electrical wave patterns in cardiac tissue. We trained diffusion models with
+simulated electrical wave patterns to be able to generate such wave patterns in
+unconditional and conditional generation tasks. For instance, we explored
+inpainting tasks, such as reconstructing three-dimensional wave dynamics from
+superficial two-dimensional measurements, and evolving and generating
+parameter-specific dynamics. We characterized and compared the
+diffusion-generated solutions to solutions obtained with biophysical models and
+found that diffusion models learn to replicate spiral and scroll waves dynamics
+so well that they could serve as an alternative data-driven approach for the
+modeling of excitation waves in cardiac tissue. For instance, we found that it
+is possible to initiate ventricular fibrillation (VF) dynamics instantaneously
+without having to apply pacing protocols in order to induce wavebreak. The VF
+dynamics can be created in arbitrary ventricular geometries and can be evolved
+over time. However, we also found that diffusion models `hallucinate' wave
+patterns when given insufficient constraints. Regardless of these limitations,
+diffusion models are an interesting and powerful tool with many potential
+applications in cardiac arrhythmia research and diagnostics.
+
+
+
+
+
+
+
+ ☆ Plan, Posture and Go: Towards Open-World Text-to-Motion Generation
+
+
+ Conventional text-to-motion generation methods are usually trained on limited
+text-motion pairs, making them hard to generalize to open-world scenarios. Some
+works use the CLIP model to align the motion space and the text space, aiming
+to enable motion generation from natural language motion descriptions. However,
+they are still constrained to generate limited and unrealistic in-place
+motions. To address these issues, we present a divide-and-conquer framework
+named PRO-Motion, which consists of three modules as motion planner,
+posture-diffuser and go-diffuser. The motion planner instructs Large Language
+Models (LLMs) to generate a sequence of scripts describing the key postures in
+the target motion. Differing from natural languages, the scripts can describe
+all possible postures following very simple text templates. This significantly
+reduces the complexity of posture-diffuser, which transforms a script to a
+posture, paving the way for open-world generation. Finally, go-diffuser,
+implemented as another diffusion model, estimates whole-body translations and
+rotations for all postures, resulting in realistic motions. Experimental
+results have shown the superiority of our method with other counterparts, and
+demonstrated its capability of generating diverse and realistic motions from
+complex open-world prompts such as "Experiencing a profound sense of joy". The
+project page is available at https://moonsliu.github.io/Pro-Motion.
+
+
+
+
+
+
+
+ ☆ PARDINUS: Weakly supervised discarding of photo-trapping empty images
+ based on autoencoders
+
+
+
+
+
+
+
+
+ David de la Rosa, Antonio J Rivera, María J del Jesus, Francisco Charte
+
+
+ Photo-trapping cameras are widely employed for wildlife monitoring. Those
+cameras take photographs when motion is detected to capture images where
+animals appear. A significant portion of these images are empty - no wildlife
+appears in the image. Filtering out those images is not a trivial task since it
+requires hours of manual work from biologists. Therefore, there is a notable
+interest in automating this task. Automatic discarding of empty photo-trapping
+images is still an open field in the area of Machine Learning. Existing
+solutions often rely on state-of-the-art supervised convolutional neural
+networks that require the annotation of the images in the training phase.
+PARDINUS (Weakly suPervised discARDINg of photo-trapping empty images based on
+aUtoencoderS) is constructed on the foundation of weakly supervised learning
+and proves that this approach equals or even surpasses other fully supervised
+methods that require further labeling work.
+
+
+
+
+
+
+
+ ☆ The Rate-Distortion-Perception-Classification Tradeoff: Joint Source
+ Coding and Modulation via Inverse-Domain GANs
+
+
+
+
+
+
+
+
+ Junli Fang, João F. C. Mota, Baoshan Lu, Weicheng Zhang, Xuemin Hong
+
+
+ The joint source coding and modulation (JSCM) framework was enabled by recent
+developments in deep learning, which allows to automatically learn from data,
+and in an end-to-end fashion, the best compression codes and modulation
+schemes. In this paper, we show the existence of a strict tradeoff between
+channel rate, distortion, perception, and classification accuracy in a JSCM
+scenario. We then propose two image compression methods to navigate that
+tradeoff: an inverse-domain generative adversarial network (ID-GAN), which
+achieves extreme compression, and a simpler, heuristic method that reveals
+insights about the performance of ID-GAN. Experiment results not only
+corroborate the theoretical findings, but also demonstrate that the proposed
+ID-GAN algorithm significantly improves system performance compared to
+traditional separation-based methods and recent deep JSCM architectures.
+
+
+
+
+
+
+
+ ☆ Compressing Image-to-Image Translation GANs Using Local Density
+ Structures on Their Learned Manifold AAAI
+
+
+ Generative Adversarial Networks (GANs) have shown remarkable success in
+modeling complex data distributions for image-to-image translation. Still,
+their high computational demands prohibit their deployment in practical
+scenarios like edge devices. Existing GAN compression methods mainly rely on
+knowledge distillation or convolutional classifiers' pruning techniques. Thus,
+they neglect the critical characteristic of GANs: their local density structure
+over their learned manifold. Accordingly, we approach GAN compression from a
+new perspective by explicitly encouraging the pruned model to preserve the
+density structure of the original parameter-heavy model on its learned
+manifold. We facilitate this objective for the pruned model by partitioning the
+learned manifold of the original generator into local neighborhoods around its
+generated samples. Then, we propose a novel pruning objective to regularize the
+pruned model to preserve the local density structure over each neighborhood,
+resembling the kernel density estimation method. Also, we develop a
+collaborative pruning scheme in which the discriminator and generator are
+pruned by two pruning agents. We design the agents to capture interactions
+between the generator and discriminator by exchanging their peer's feedback
+when determining corresponding models' architectures. Thanks to such a design,
+our pruning method can efficiently find performant sub-networks and can
+maintain the balance between the generator and discriminator more effectively
+compared to baselines during pruning, thereby showing more stable pruning
+dynamics. Our experiments on image translation GAN models, Pix2Pix and
+CycleGAN, with various benchmark datasets and architectures demonstrate our
+method's effectiveness.
+
+
+
+ comment: The 38th Annual AAAI Conference on Artificial Intelligence, AAAI 2024
+
+
+
+
+
+
+ ☆ Cross-Age and Cross-Site Domain Shift Impacts on Deep Learning-Based
+ White Matter Fiber Estimation in Newborn and Baby Brains
+
+
+ Deep learning models have shown great promise in estimating tissue
+microstructure from limited diffusion magnetic resonance imaging data. However,
+these models face domain shift challenges when test and train data are from
+different scanners and protocols, or when the models are applied to data with
+inherent variations such as the developing brains of infants and children
+scanned at various ages. Several techniques have been proposed to address some
+of these challenges, such as data harmonization or domain adaptation in the
+adult brain. However, those techniques remain unexplored for the estimation of
+fiber orientation distribution functions in the rapidly developing brains of
+infants. In this work, we extensively investigate the age effect and domain
+shift within and across two different cohorts of 201 newborns and 165 babies
+using the Method of Moments and fine-tuning strategies. Our results show that
+reduced variations in the microstructural development of babies in comparison
+to newborns directly impact the deep learning models' cross-age performance. We
+also demonstrate that a small number of target domain samples can significantly
+mitigate domain shift problems.
+
+
+ The issue of generative pretraining for vision models has persisted as a
+long-standing conundrum. At present, the text-to-image (T2I) diffusion model
+demonstrates remarkable proficiency in generating high-definition images
+matching textual inputs, a feat made possible through its pre-training on
+large-scale image-text pairs. This leads to a natural inquiry: can diffusion
+models be utilized to tackle visual perception tasks? In this paper, we propose
+a simple yet effective scheme to harness a diffusion model for visual
+perception tasks. Our key insight is to introduce learnable embeddings (meta
+prompts) to the pre-trained diffusion models to extract proper features for
+perception. The effect of meta prompts are two-fold. First, as a direct
+replacement of the text embeddings in the T2I models, it can activate
+task-relevant features during feature extraction. Second, it will be used to
+re-arrange the extracted features to ensures that the model focuses on the most
+pertinent features for the task on hand. Additionally, we design a recurrent
+refinement training strategy that fully leverages the property of diffusion
+models, thereby yielding stronger visual features. Extensive experiments across
+various benchmarks validate the effectiveness of our approach. Our approach
+achieves new performance records in depth estimation tasks on NYU depth V2 and
+KITTI, and in semantic segmentation task on CityScapes. Concurrently, the
+proposed method attains results comparable to the current state-of-the-art in
+semantic segmentation on ADE20K and pose estimation on COCO datasets, further
+exemplifying its robustness and versatility.
+
+
+
+
+
+
+
+ ☆ Images in Discrete Choice Modeling: Addressing Data Isomorphism in
+ Multi-Modality Inputs
+
+
+ This paper explores the intersection of Discrete Choice Modeling (DCM) and
+machine learning, focusing on the integration of image data into DCM's utility
+functions and its impact on model interpretability. We investigate the
+consequences of embedding high-dimensional image data that shares isomorphic
+information with traditional tabular inputs within a DCM framework. Our study
+reveals that neural network (NN) components learn and replicate tabular
+variable representations from images when co-occurrences exist, thereby
+compromising the interpretability of DCM parameters. We propose and benchmark
+two methodologies to address this challenge: architectural design adjustments
+to segregate redundant information, and isomorphic information mitigation
+through source information masking and inpainting. Our experiments, conducted
+on a semi-synthetic dataset, demonstrate that while architectural modifications
+prove inconclusive, direct mitigation at the data source shows to be a more
+effective strategy in maintaining the integrity of DCM's interpretable
+parameters. The paper concludes with insights into the applicability of our
+findings in real-world settings and discusses the implications for future
+research in hybrid modeling that combines complex data modalities. Full control
+of tabular and image data congruence is attained by using the MIT moral machine
+dataset, and both inputs are merged into a choice model by deploying the
+Learning Multinomial Logit (L-MNL) framework.
+
+
+
+ comment: 17 pages, 7 figures, 3 tables
+
+
+
+
+
+
+ ☆ BonnBeetClouds3D: A Dataset Towards Point Cloud-based Organ-level
+ Phenotyping of Sugar Beet Plants under Field Conditions
+
+
+
+
+
+
+
+
+ Elias Marks, Jonas Bömer, Federico Magistri, Anurag Sah, Jens Behley, Cyrill Stachniss
+
+
+ Agricultural production is facing severe challenges in the next decades
+induced by climate change and the need for sustainability, reducing its impact
+on the environment. Advancements in field management through non-chemical
+weeding by robots in combination with monitoring of crops by autonomous
+unmanned aerial vehicles (UAVs) and breeding of novel and more resilient crop
+varieties are helpful to address these challenges. The analysis of plant
+traits, called phenotyping, is an essential activity in plant breeding, it
+however involves a great amount of manual labor. With this paper, we address
+the problem of automatic fine-grained organ-level geometric analysis needed for
+precision phenotyping. As the availability of real-world data in this domain is
+relatively scarce, we propose a novel dataset that was acquired using UAVs
+capturing high-resolution images of a real breeding trial containing 48 plant
+varieties and therefore covering great morphological and appearance diversity.
+This enables the development of approaches for autonomous phenotyping that
+generalize well to different varieties. Based on overlapping high-resolution
+images from multiple viewing angles, we compute photogrammetric dense point
+clouds and provide detailed and accurate point-wise labels for plants, leaves,
+and salient points as the tip and the base. Additionally, we include
+measurements of phenotypic traits performed by experts from the German Federal
+Plant Variety Office on the real plants, allowing the evaluation of new
+approaches not only on segmentation and keypoint detection but also directly on
+the downstream tasks. The provided labeled point clouds enable fine-grained
+plant analysis and support further progress in the development of automatic
+phenotyping approaches, but also enable further research in surface
+reconstruction, point cloud completion, and semantic interpretation of point
+clouds.
+
+
+ Pulmonary embolism (PE) is a prevalent lung disease that can lead to right
+ventricular hypertrophy and failure in severe cases, ranking second in severity
+only to myocardial infarction and sudden death. Pulmonary artery CT angiography
+(CTPA) is a widely used diagnostic method for PE. However, PE detection
+presents challenges in clinical practice due to limitations in imaging
+technology. CTPA can produce noises similar to PE, making confirmation of its
+presence time-consuming and prone to overdiagnosis. Nevertheless, the
+traditional segmentation method of PE can not fully consider the hierarchical
+structure of features, local and global spatial features of PE CT images. In
+this paper, we propose an automatic PE segmentation method called SCUNet++
+(Swin Conv UNet++). This method incorporates multiple fusion dense skip
+connections between the encoder and decoder, utilizing the Swin Transformer as
+the encoder. And fuses features of different scales in the decoder subnetwork
+to compensate for spatial information loss caused by the inevitable
+downsampling in Swin-UNet or other state-of-the-art methods, effectively
+solving the above problem. We provide a theoretical analysis of this method in
+detail and validate it on publicly available PE CT image datasets FUMPE and
+CAD-PE. The experimental results indicate that our proposed method achieved a
+Dice similarity coefficient (DSC) of 83.47% and a Hausdorff distance 95th
+percentile (HD95) of 3.83 on the FUMPE dataset, as well as a DSC of 83.42% and
+an HD95 of 5.10 on the CAD-PE dataset. These findings demonstrate that our
+method exhibits strong performance in PE segmentation tasks, potentially
+enhancing the accuracy of automatic segmentation of PE and providing a powerful
+diagnostic tool for clinical physicians. Our source code and new FUMPE dataset
+are available at https://github.com/JustlfC03/SCUNet-plusplus.
+
+
+
+ comment: 10 pages, 7 figures, accept wacv2024
+
+
+
+
+
+
+ ☆ Pola4All: survey of polarimetric applications and an open-source toolkit
+ to analyze polarization
+
+
+ Polarization information of the light can provide rich cues for computer
+vision and scene understanding tasks, such as the type of material, pose, and
+shape of the objects. With the advent of new and cheap polarimetric sensors,
+this imaging modality is becoming accessible to a wider public for solving
+problems such as pose estimation, 3D reconstruction, underwater navigation, and
+depth estimation. However, we observe several limitations regarding the usage
+of this sensorial modality, as well as a lack of standards and publicly
+available tools to analyze polarization images. Furthermore, although
+polarization camera manufacturers usually provide acquisition tools to
+interface with their cameras, they rarely include processing algorithms that
+make use of the polarization information. In this paper, we review recent
+advances in applications that involve polarization imaging, including a
+comprehensive survey of recent advances on polarization for vision and robotics
+perception tasks. We also introduce a complete software toolkit that provides
+common standards to communicate with and process information from most of the
+existing micro-grid polarization cameras on the market. The toolkit also
+implements several image processing algorithms for this modality, and it is
+publicly available on GitHub: https://github.com/vibot-lab/Pola4all_JEI_2023.
+
+
+
+
+
+
+
+ ☆ Density Uncertainty Quantification with NeRF-Ensembles: Impact of Data
+ and Scene Constraints
+
+
+
+
+
+
+
+
+ Miriam Jäger, Steven Landgraf, Boris Jutzi
+
+
+ In the fields of computer graphics, computer vision and photogrammetry,
+Neural Radiance Fields (NeRFs) are a major topic driving current research and
+development. However, the quality of NeRF-generated 3D scene reconstructions
+and subsequent surface reconstructions, heavily relies on the network output,
+particularly the density. Regarding this critical aspect, we propose to utilize
+NeRF-Ensembles that provide a density uncertainty estimate alongside the mean
+density. We demonstrate that data constraints such as low-quality images and
+poses lead to a degradation of the training process, increased density
+uncertainty and decreased predicted density. Even with high-quality input data,
+the density uncertainty varies based on scene constraints such as acquisition
+constellations, occlusions and material properties. NeRF-Ensembles not only
+provide a tool for quantifying the uncertainty but exhibit two promising
+advantages: Enhanced robustness and artifact removal. Through the utilization
+of NeRF-Ensembles instead of single NeRFs, small outliers are removed, yielding
+a smoother output with improved completeness of structures. Furthermore,
+applying percentile-based thresholds on density uncertainty outliers proves to
+be effective for the removal of large (foggy) artifacts in post-processing. We
+conduct our methodology on 3 different datasets: (i) synthetic benchmark
+dataset, (ii) real benchmark dataset, (iii) real data under realistic recording
+conditions and sensors.
+
+
+
+ comment: 21 pages, 12 figures, 5 tables
+
+
+
+
+
+
+ ☆ Global Occlusion-Aware Transformer for Robust Stereo Matching
+
+
+ Despite the remarkable progress facilitated by learning-based stereo-matching
+algorithms, the performance in the ill-conditioned regions, such as the
+occluded regions, remains a bottleneck. Due to the limited receptive field,
+existing CNN-based methods struggle to handle these ill-conditioned regions
+effectively. To address this issue, this paper introduces a novel
+attention-based stereo-matching network called Global Occlusion-Aware
+Transformer (GOAT) to exploit long-range dependency and occlusion-awareness
+global context for disparity estimation. In the GOAT architecture, a parallel
+disparity and occlusion estimation module PDO is proposed to estimate the
+initial disparity map and the occlusion mask using a parallel attention
+mechanism. To further enhance the disparity estimates in the occluded regions,
+an occlusion-aware global aggregation module (OGA) is proposed. This module
+aims to refine the disparity in the occluded regions by leveraging restricted
+global correlation within the focus scope of the occluded areas. Extensive
+experiments were conducted on several public benchmark datasets including
+SceneFlow, KITTI 2015, and Middlebury. The results show that the proposed GOAT
+demonstrates outstanding performance among all benchmarks, particularly in the
+occluded regions.
+
+
+ We introduce Neural Flow Maps, a novel simulation method bridging the
+emerging paradigm of implicit neural representations with fluid simulation
+based on the theory of flow maps, to achieve state-of-the-art simulation of
+inviscid fluid phenomena. We devise a novel hybrid neural field representation,
+Spatially Sparse Neural Fields (SSNF), which fuses small neural networks with a
+pyramid of overlapping, multi-resolution, and spatially sparse grids, to
+compactly represent long-term spatiotemporal velocity fields at high accuracy.
+With this neural velocity buffer in hand, we compute long-term, bidirectional
+flow maps and their Jacobians in a mechanistically symmetric manner, to
+facilitate drastic accuracy improvement over existing solutions. These
+long-range, bidirectional flow maps enable high advection accuracy with low
+dissipation, which in turn facilitates high-fidelity incompressible flow
+simulations that manifest intricate vortical structures. We demonstrate the
+efficacy of our neural fluid simulation in a variety of challenging simulation
+scenarios, including leapfrogging vortices, colliding vortices, vortex
+reconnections, as well as vortex generation from moving obstacles and density
+differences. Our examples show increased performance over existing methods in
+terms of energy conservation, visual complexity, adherence to experimental
+observations, and preservation of detailed vortical structures.
+
+
+
+
+
+
+
+ ☆ A Language-based solution to enable Metaverse Retrieval
+
+
+
+
+
+
+
+
+ Ali Abdari, Alex Falcon, Giuseppe Serra
+
+
+ Recently, the Metaverse is becoming increasingly attractive, with millions of
+users accessing the many available virtual worlds. However, how do users find
+the one Metaverse which best fits their current interests? So far, the search
+process is mostly done by word of mouth, or by advertisement on
+technology-oriented websites. However, the lack of search engines similar to
+those available for other multimedia formats (e.g., YouTube for videos) is
+showing its limitations, since it is often cumbersome to find a Metaverse based
+on some specific interests using the available methods, while also making it
+difficult to discover user-created ones which lack strong advertisement. To
+address this limitation, we propose to use language to naturally describe the
+desired contents of the Metaverse a user wishes to find. Second, we highlight
+that, differently from more conventional 3D scenes, Metaverse scenarios
+represent a more complex data format since they often contain one or more types
+of multimedia which influence the relevance of the scenario itself to a user
+query. Therefore, in this work, we create a novel task, called
+Text-to-Metaverse retrieval, which aims at modeling these aspects while also
+taking the cross-modal relations with the textual data into account. Since we
+are the first ones to tackle this problem, we also collect a dataset of 33000
+Metaverses, each of which consists of a 3D scene enriched with multimedia
+content. Finally, we design and implement a deep learning framework based on
+contrastive learning, resulting in a thorough experimental setup.
+
+
+
+ comment: Accepted at 30th International Conference on Multimedia Modeling-
+ MMM2024
+
+
+
+
+
+
+ ☆ DSAP: Analyzing Bias Through Demographic Comparison of Datasets
+
+
+
+
+
+
+
+
+ Iris Dominguez-Catena, Daniel Paternain, Mikel Galar
+
+
+ In the last few years, Artificial Intelligence systems have become
+increasingly widespread. Unfortunately, these systems can share many biases
+with human decision-making, including demographic biases. Often, these biases
+can be traced back to the data used for training, where large uncurated
+datasets have become the norm. Despite our knowledge of these biases, we still
+lack general tools to detect and quantify them, as well as to compare the
+biases in different datasets. Thus, in this work, we propose DSAP (Demographic
+Similarity from Auxiliary Profiles), a two-step methodology for comparing the
+demographic composition of two datasets. DSAP can be deployed in three key
+applications: to detect and characterize demographic blind spots and bias
+issues across datasets, to measure dataset demographic bias in single datasets,
+and to measure dataset demographic shift in deployment scenarios. An essential
+feature of DSAP is its ability to robustly analyze datasets without explicit
+demographic labels, offering simplicity and interpretability for a wide range
+of situations. To show the usefulness of the proposed methodology, we consider
+the Facial Expression Recognition task, where demographic bias has previously
+been found. The three applications are studied over a set of twenty datasets
+with varying properties. The code is available at
+https://github.com/irisdominguez/DSAP.
+
+
+
+ comment: 18 pages, 11 figures
+
+
+
+
+
+
+ ☆ Towards Loose-Fitting Garment Animation via Generative Model of
+ Deformation Decomposition
+
+
+ Existing data-driven methods for garment animation, usually driven by linear
+skinning, although effective on tight garments, do not handle loose-fitting
+garments with complex deformations well. To address these limitations, we
+develop a garment generative model based on deformation decomposition to
+efficiently simulate loose garment deformation without directly using linear
+skinning. Specifically, we learn a garment generative space with the proposed
+generative model, where we decouple the latent representation into unposed
+deformed garments and dynamic offsets during the decoding stage. With explicit
+garment deformations decomposition, our generative model is able to generate
+complex pose-driven deformations on canonical garment shapes. Furthermore, we
+learn to transfer the body motions and previous state of the garment to the
+latent space to regenerate dynamic results. In addition, we introduce a detail
+enhancement module in an adversarial training setup to learn high-frequency
+wrinkles. We demonstrate our method outperforms state-of-the-art data-driven
+alternatives through extensive experiments and show qualitative and
+quantitative analysis of results.
+
+
+
+
+
+
+
+ ☆ Tuning-Free Inversion-Enhanced Control for Consistent Image Editing
+
+
+ Consistent editing of real images is a challenging task, as it requires
+performing non-rigid edits (e.g., changing postures) to the main objects in the
+input image without changing their identity or attributes. To guarantee
+consistent attributes, some existing methods fine-tune the entire model or the
+textual embedding for structural consistency, but they are time-consuming and
+fail to perform non-rigid edits. Other works are tuning-free, but their
+performances are weakened by the quality of Denoising Diffusion Implicit Model
+(DDIM) reconstruction, which often fails in real-world scenarios. In this
+paper, we present a novel approach called Tuning-free Inversion-enhanced
+Control (TIC), which directly correlates features from the inversion process
+with those from the sampling process to mitigate the inconsistency in DDIM
+reconstruction. Specifically, our method effectively obtains inversion features
+from the key and value features in the self-attention layers, and enhances the
+sampling process by these inversion features, thus achieving accurate
+reconstruction and content-consistent editing. To extend the applicability of
+our method to general editing scenarios, we also propose a mask-guided
+attention concatenation strategy that combines contents from both the inversion
+and the naive DDIM editing processes. Experiments show that the proposed method
+outperforms previous works in reconstruction and consistent editing, and
+produces impressive results in various settings.
+
+
+
+
+
+
+
+ ☆ Explainable Multi-Camera 3D Object Detection with Transformer-Based
+ Saliency Maps
+
+
+ Vision Transformers (ViTs) have achieved state-of-the-art results on various
+computer vision tasks, including 3D object detection. However, their end-to-end
+implementation also makes ViTs less explainable, which can be a challenge for
+deploying them in safety-critical applications, such as autonomous driving,
+where it is important for authorities, developers, and users to understand the
+model's reasoning behind its predictions. In this paper, we propose a novel
+method for generating saliency maps for a DetR-like ViT with multiple camera
+inputs used for 3D object detection. Our method is based on the raw attention
+and is more efficient than gradient-based methods. We evaluate the proposed
+method on the nuScenes dataset using extensive perturbation tests and show that
+it outperforms other explainability methods in terms of visual quality and
+quantitative metrics. We also demonstrate the importance of aggregating
+attention across different layers of the transformer. Our work contributes to
+the development of explainable AI for ViTs, which can help increase trust in AI
+applications by establishing more transparency regarding the inner workings of
+AI models.
+
+
+
+
+
+
+
+ ☆ Environment-Specific People
+
+
+
+
+
+
+
+
+ Mirela Ostrek, Soubhik Sanyal, Carol O'Sullivan, Michael J. Black, Justus Thies
+
+
+ Despite significant progress in generative image synthesis and full-body
+generation in particular, state-of-the-art methods are either
+context-independent, overly reliant to text prompts, or bound to the curated
+training datasets, such as fashion images with monotonous backgrounds. Here,
+our goal is to generate people in clothing that is semantically appropriate for
+a given scene. To this end, we present ESP, a novel method for context-aware
+full-body generation, that enables photo-realistic inpainting of people into
+existing "in-the-wild" photographs. ESP is conditioned on a 2D pose and
+contextual cues that are extracted from the environment photograph and
+integrated into the generation process. Our models are trained on a dataset
+containing a set of in-the-wild photographs of people covering a wide range of
+different environments. The method is analyzed quantitatively and
+qualitatively, and we show that ESP outperforms state-of-the-art on the task of
+contextual full-body generation.
+
+
+ Driver distraction is a principal cause of traffic accidents. In a study
+conducted by the National Highway Traffic Safety Administration, engaging in
+activities such as interacting with in-car menus, consuming food or beverages,
+or engaging in telephonic conversations while operating a vehicle can be
+significant sources of driver distraction. From this viewpoint, this paper
+introduces a novel method for detection of driver distraction using multi-view
+driver action images. The proposed method is a vision transformer-based
+framework with pose estimation and action inference, namely PoseViNet. The
+motivation for adding posture information is to enable the transformer to focus
+more on key features. As a result, the framework is more adept at identifying
+critical actions. The proposed framework is compared with various
+state-of-the-art models using SFD3 dataset representing 10 behaviors of
+drivers. It is found from the comparison that the PoseViNet outperforms these
+models. The proposed framework is also evaluated with the SynDD1 dataset
+representing 16 behaviors of driver. As a result, the PoseViNet achieves 97.55%
+validation accuracy and 90.92% testing accuracy with the challenging dataset.
+
+
+
+ comment: This is revised draft submitted to IEEE Sensors Journal
+
+
+
+
+
+
+ ☆ MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning
+
+
+ Prompt learning has demonstrated impressive efficacy in the fine-tuning of
+multimodal large models to a wide range of downstream tasks. Nonetheless,
+applying existing prompt learning methods for the diagnosis of neurological
+disorder still suffers from two issues: (i) existing methods typically treat
+all patches equally, despite the fact that only a small number of patches in
+neuroimaging are relevant to the disease, and (ii) they ignore the structural
+information inherent in the brain connection network which is crucial for
+understanding and diagnosing neurological disorders. To tackle these issues, we
+introduce a novel prompt learning model by learning graph prompts during the
+fine-tuning process of multimodal large models for diagnosing neurological
+disorders. Specifically, we first leverage GPT-4 to obtain relevant disease
+concepts and compute semantic similarity between these concepts and all
+patches. Secondly, we reduce the weight of irrelevant patches according to the
+semantic similarity between each patch and disease-related concepts. Moreover,
+we construct a graph among tokens based on these concepts and employ a graph
+convolutional network layer to extract the structural information of the graph,
+which is used to prompt the pre-trained multimodal large models for diagnosing
+neurological disorders. Extensive experiments demonstrate that our method
+achieves superior performance for neurological disorder diagnosis compared with
+state-of-the-art methods and validated by clinicians.
+
+
+
+
+
+
+
+ ☆ BSS-Bench: Towards Reproducible and Effective Band Selection Search
+
+
+ The key technology to overcome the drawbacks of hyperspectral imaging
+(expensive, high capture delay, and low spatial resolution) and make it widely
+applicable is to select only a few representative bands from hundreds of bands.
+However, current band selection (BS) methods face challenges in fair
+comparisons due to inconsistent train/validation settings, including the number
+of bands, dataset splits, and retraining settings. To make BS methods easy and
+reproducible, this paper presents the first band selection search benchmark
+(BSS-Bench) containing 52k training and evaluation records of numerous band
+combinations (BC) with different backbones for various hyperspectral analysis
+tasks. The creation of BSS-Bench required a significant computational effort of
+1.26k GPU days. By querying BSS-Bench, BS experiments can be performed easily
+and reproducibly, and the gap between the searched result and the best
+achievable performance can be measured. Based on BSS-Bench, we further discuss
+the impact of various factors on BS, such as the number of bands, unsupervised
+statistics, and different backbones. In addition to BSS-Bench, we present an
+effective one-shot BS method called Single Combination One Shot (SCOS), which
+learns the priority of any BCs through one-time training, eliminating the need
+for repetitive retraining on different BCs. Furthermore, the search process of
+SCOS is flexible and does not require training, making it efficient and
+effective. Our extensive evaluations demonstrate that SCOS outperforms current
+BS methods on multiple tasks, even with much fewer bands. Our BSS-Bench and
+codes are available in the supplementary material and will be publicly
+available.
+
+
+
+ comment: 11 pages,6 figures
+
+
+
+
+
+
+ ☆ CaptainCook4D: A dataset for understanding errors in procedural
+ activities ICML
+
+
+ Following step-by-step procedures is an essential component of various
+activities carried out by individuals in their daily lives. These procedures
+serve as a guiding framework that helps to achieve goals efficiently, whether
+it is assembling furniture or preparing a recipe. However, the complexity and
+duration of procedural activities inherently increase the likelihood of making
+errors. Understanding such procedural activities from a sequence of frames is a
+challenging task that demands an accurate interpretation of visual information
+and the ability to reason about the structure of the activity. To this end, we
+collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings
+(94.5 hours) of people performing recipes in real kitchen environments. This
+dataset consists of two distinct types of activity: one in which participants
+adhere to the provided recipe instructions and another in which they deviate
+and induce errors. We provide 5.3K step annotations and 10K fine-grained action
+annotations and benchmark the dataset for the following tasks: supervised error
+recognition, multistep localization, and procedure learning
+
+
+
+ comment: Accepted to the 2023 International Conference on Machine
+ Learning(ICML) workshop on Data-centric Machine Learning Research(DMLR),
+ Project Page: https://captaincook4d.github.io/captain-cook/
+
+
+
+
+
+
+ ☆ Inclusive normalization of face images to passport format
+
+
+ Face recognition has been used more and more in real world applications in
+recent years. However, when the skin color bias is coupled with intra-personal
+variations like harsh illumination, the face recognition task is more likely to
+fail, even during human inspection. Face normalization methods try to deal with
+such challenges by removing intra-personal variations from an input image while
+keeping the identity the same. However, most face normalization methods can
+only remove one or two variations and ignore dataset biases such as skin color
+bias. The outputs of many face normalization methods are also not realistic to
+human observers. In this work, a style based face normalization model
+(StyleFNM) is proposed to remove most intra-personal variations including large
+changes in pose, bad or harsh illumination, low resolution, blur, facial
+expressions, and accessories like sunglasses among others. The dataset bias is
+also dealt with in this paper by controlling a pretrained GAN to generate a
+balanced dataset of passport-like images. The experimental results show that
+StyleFNM can generate more realistic outputs and can improve significantly the
+accuracy and fairness of face recognition systems.
+
+
+
+
+
+
+
+ ☆ Joint Learning Neuronal Skeleton and Brain Circuit Topology with
+ Permutation Invariant Encoders for Neuron Classification
+
+
+ Determining the types of neurons within a nervous system plays a significant
+role in the analysis of brain connectomics and the investigation of
+neurological diseases. However, the efficiency of utilizing anatomical,
+physiological, or molecular characteristics of neurons is relatively low and
+costly. With the advancements in electron microscopy imaging and analysis
+techniques for brain tissue, we are able to obtain whole-brain connectome
+consisting neuronal high-resolution morphology and connectivity information.
+However, few models are built based on such data for automated neuron
+classification. In this paper, we propose NeuNet, a framework that combines
+morphological information of neurons obtained from skeleton and topological
+information between neurons obtained from neural circuit. Specifically, NeuNet
+consists of three components, namely Skeleton Encoder, Connectome Encoder, and
+Readout Layer. Skeleton Encoder integrates the local information of neurons in
+a bottom-up manner, with a one-dimensional convolution in neural skeleton's
+point data; Connectome Encoder uses a graph neural network to capture the
+topological information of neural circuit; finally, Readout Layer fuses the
+above two information and outputs classification results. We reprocess and
+release two new datasets for neuron classification task from volume electron
+microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments
+on these two datasets demonstrated the effectiveness of our model with accuracy
+of 0.9169 and 0.9363, respectively. Code and data are available at:
+https://github.com/WHUminghui/NeuNet.
+
+
+
+ comment: 18 pages,8 figures,
+
+
+
+
+
+
+ ☆ ViStripformer: A Token-Efficient Transformer for Versatile Video
+ Restoration
+
+
+ Video restoration is a low-level vision task that seeks to restore clean,
+sharp videos from quality-degraded frames. One would use the temporal
+information from adjacent frames to make video restoration successful.
+Recently, the success of the Transformer has raised awareness in the
+computer-vision community. However, its self-attention mechanism requires much
+memory, which is unsuitable for high-resolution vision tasks like video
+restoration. In this paper, we propose ViStripformer (Video Stripformer), which
+utilizes spatio-temporal strip attention to catch long-range data correlations,
+consisting of intra-frame strip attention (Intra-SA) and inter-frame strip
+attention (Inter-SA) for extracting spatial and temporal information. It
+decomposes video frames into strip-shaped features in horizontal and vertical
+directions for Intra-SA and Inter-SA to address degradation patterns with
+various orientations and magnitudes. Besides, ViStripformer is an effective and
+efficient transformer architecture with much lower memory usage than the
+vanilla transformer. Extensive experiments show that the proposed model
+achieves superior results with fast inference time on video restoration tasks,
+including video deblurring, demoireing, and deraining.
+
+
+
+
+
+
+
+
+ Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan
+
+
+ Few-shot object detection (FSOD) benchmarks have advanced techniques for
+detecting new categories with limited annotations. Existing benchmarks
+repurpose well-established datasets like COCO by partitioning categories into
+base and novel classes for pre-training and fine-tuning respectively. However,
+these benchmarks do not reflect how FSOD is deployed in practice. Rather than
+only pre-training on a small number of base categories, we argue that it is
+more practical to fine-tune a foundation model (e.g., a vision-language model
+(VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find
+that zero-shot inference from VLMs like GroundingDINO significantly outperforms
+the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models
+can still be misaligned to target concepts of interest. For example, trailers
+on the web may be different from trailers in the context of autonomous
+vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol
+that evaluates detectors pre-trained on any external datasets and fine-tuned on
+K-shots per target class. Further, we note that current FSOD benchmarks are
+actually federated datasets containing exhaustive annotations for each category
+on a subset of the data. We leverage this insight to propose simple strategies
+for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of
+our approach on LVIS and nuImages, improving over prior work by 5.9 AP.
+
+
+
+
+
+
+
+ ☆ Context Enhanced Transformer for Single Image Object Detection
+
+
+
+
+
+
+
+
+ Seungjun An, Seonghoon Park, Gyeongnyeon Kim, Jeongyeol Baek, Byeongwon Lee, Seungryong Kim
+
+
+ With the increasing importance of video data in real-world applications,
+there is a rising need for efficient object detection methods that utilize
+temporal information. While existing video object detection (VOD) techniques
+employ various strategies to address this challenge, they typically depend on
+locally adjacent frames or randomly sampled images within a clip. Although
+recent Transformer-based VOD methods have shown promising results, their
+reliance on multiple inputs and additional network complexity to incorporate
+temporal information limits their practical applicability. In this paper, we
+propose a novel approach to single image object detection, called Context
+Enhanced TRansformer (CETR), by incorporating temporal context into DETR using
+a newly designed memory module. To efficiently store temporal information, we
+construct a class-wise memory that collects contextual information across data.
+Additionally, we present a classification-based sampling technique to
+selectively utilize the relevant memory for the current image. In the testing,
+We introduce a test-time memory adaptation method that updates individual
+memory functions by considering the test distribution. Experiments with CityCam
+and ImageNet VID datasets exhibit the efficiency of the framework on various
+video systems. The project page and code will be made available at:
+https://ku-cvlab.github.io/CETR.
+
+
+
+ comment: The project page and code will be made available at:
+ https://ku-cvlab.github.io/CETR
+
+
+
+
+
+
+ ☆ Part to Whole: Collaborative Prompting for Surgical Instrument
+ Segmentation
+
+
+ Foundation models like the Segment Anything Model (SAM) have demonstrated
+promise in generic object segmentation. However, directly applying SAM to
+surgical instrument segmentation presents key challenges. First, SAM relies on
+per-frame point-or-box prompts which complicate surgeon-computer interaction.
+Also, SAM yields suboptimal performance on segmenting surgical instruments,
+owing to insufficient surgical data in its pre-training as well as the complex
+structure and fine-grained details of various surgical instruments. To address
+these challenges, in this paper, we investigate text promptable surgical
+instrument segmentation and propose SP-SAM (SurgicalPart-SAM), a novel
+efficient-tuning approach that integrates surgical instrument structure
+knowledge with the generic segmentation knowledge of SAM. Specifically, we
+achieve this by proposing (1) collaborative prompts in the text form "[part
+name] of [instrument category name]" that decompose instruments into
+fine-grained parts; (2) a Cross-Modal Prompt Encoder that encodes text prompts
+jointly with visual embeddings into discriminative part-level representations;
+and (3) a Part-to-Whole Selective Fusion and a Hierarchical Decoding strategy
+that selectively assemble the part-level representations into a whole for
+accurate instrument segmentation. Built upon them, SP-SAM acquires a better
+capability to comprehend surgical instrument structures and distinguish between
+various categories. Extensive experiments on both the EndoVis2018 and
+EndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with
+minimal tunable parameters. Code is at
+https://github.com/wenxi-yue/SurgicalPart-SAM.
+
+
+
+ comment: Technical Report. The source code will be released at
+ https://github.com/wenxi-yue/SurgicalPart-SAM
+
+
+
+
+
+
+ ☆ MonoLSS: Learnable Sample Selection For Monocular 3D Detection
+
+
+ In the field of autonomous driving, monocular 3D detection is a critical task
+which estimates 3D properties (depth, dimension, and orientation) of objects in
+a single RGB image. Previous works have used features in a heuristic way to
+learn 3D properties, without considering that inappropriate features could have
+adverse effects. In this paper, sample selection is introduced that only
+suitable samples should be trained to regress the 3D properties. To select
+samples adaptively, we propose a Learnable Sample Selection (LSS) module, which
+is based on Gumbel-Softmax and a relative-distance sample divider. The LSS
+module works under a warm-up strategy leading to an improvement in training
+stability. Additionally, since the LSS module dedicated to 3D property sample
+selection relies on object-level features, we further develop a data
+augmentation method named MixUp3D to enrich 3D property samples which conforms
+to imaging principles without introducing ambiguity. As two orthogonal methods,
+the LSS module and MixUp3D can be utilized independently or in conjunction.
+Sufficient experiments have shown that their combined use can lead to
+synergistic effects, yielding improvements that transcend the mere sum of their
+individual applications. Leveraging the LSS module and the MixUp3D, without any
+extra data, our method named MonoLSS ranks 1st in all three categories (Car,
+Cyclist, and Pedestrian) on KITTI 3D object detection benchmark, and achieves
+competitive results on both the Waymo dataset and KITTI-nuScenes cross-dataset
+evaluation. The code is included in the supplementary material and will be
+released to facilitate related academic and industrial studies.
+
+
+
+
+
+
+
+
+ Lei Liu, Chenglong Li, Futian Wang, Longfeng Shen, Jin Tang
+
+
+ Cross-modal object tracking is an important research topic in the field of
+information fusion, and it aims to address imaging limitations in challenging
+scenarios by integrating switchable visible and near-infrared modalities.
+However, existing tracking methods face some difficulties in adapting to
+significant target appearance variations in the presence of modality switch.
+For instance, model update based tracking methods struggle to maintain stable
+tracking results during modality switching, leading to error accumulation and
+model drift. Template based tracking methods solely rely on the template
+information from first frame and/or last frame, which lacks sufficient
+representation ability and poses challenges in handling significant target
+appearance changes. To address this problem, we propose a prototype-based
+cross-modal object tracker called ProtoTrack, which introduces a novel
+prototype learning scheme to adapt to significant target appearance variations,
+for cross-modal object tracking. In particular, we design a multi-modal
+prototype to represent target information by multi-kind samples, including a
+fixed sample from the first frame and two representative samples from different
+modalities. Moreover, we develop a prototype generation algorithm based on two
+new modules to ensure the prototype representative in different
+challenges......
+
+
+
+ comment: In Peer Review
+
+
+
+
+
+
+ ☆ FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for
+ Open-Vocabulary 3D Detection AAAI 2024
+
+
+ The superior performances of pre-trained foundation models in various visual
+tasks underscore their potential to enhance the 2D models' open-vocabulary
+ability. Existing methods explore analogous applications in the 3D space.
+However, most of them only center around knowledge extraction from singular
+foundation models, which limits the open-vocabulary ability of 3D models. We
+hypothesize that leveraging complementary pre-trained knowledge from various
+foundation models can improve knowledge transfer from 2D pre-trained visual
+language models to the 3D space. In this work, we propose FM-OV3D, a method of
+Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D
+Detection, which improves the open-vocabulary localization and recognition
+abilities of 3D model by blending knowledge from multiple pre-trained
+foundation models, achieving true open-vocabulary without facing constraints
+from original 3D datasets. Specifically, to learn the open-vocabulary 3D
+localization ability, we adopt the open-vocabulary localization knowledge of
+the Grounded-Segment-Anything model. For open-vocabulary 3D recognition
+ability, We leverage the knowledge of generative foundation models, including
+GPT-3 and Stable Diffusion models, and cross-modal discriminative models like
+CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D
+object detection show that our model efficiently learns knowledge from multiple
+foundation models to enhance the open-vocabulary ability of the 3D model and
+successfully achieves state-of-the-art performance in open-vocabulary 3D object
+detection tasks. Code is released at
+https://github.com/dmzhang0425/FM-OV3D.git.
+
+
+
+ comment: Accepted by AAAI 2024. Code will be released at
+ https://github.com/dmzhang0425/FM-OV3D.git
+
+
+
+
+
+
+ ☆ QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
+
+
+
+
+
+
+
+
+ Pengxiang Ding, Han Zhao, Zhitao Wang, Zhenyu Wei, Shangke Lyu, Donglin Wang
+
+
+ The important manifestation of robot intelligence is the ability to naturally
+interact and autonomously make decisions. Traditional approaches to robot
+control often compartmentalize perception, planning, and decision-making,
+simplifying system design but limiting the synergy between different
+information streams. This compartmentalization poses challenges in achieving
+seamless autonomous reasoning, decision-making, and action execution. To
+address these limitations, a novel paradigm, named Vision-Language-Action tasks
+for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This
+approach tightly integrates visual information and instructions to generate
+executable actions, effectively merging perception, planning, and
+decision-making. The central idea is to elevate the overall intelligence of the
+robot. Within this framework, a notable challenge lies in aligning fine-grained
+instructions with visual perception information. This emphasizes the complexity
+involved in ensuring that the robot accurately interprets and acts upon
+detailed instructions in harmony with its visual observations. Consequently, we
+propose QUAdruped Robotic Transformer (QUART), a family of VLA models to
+integrate visual information and instructions from diverse modalities as input
+and generates executable actions for real-world robots and present QUAdruped
+Robot Dataset (QUARD), a large-scale multi-task dataset including navigation,
+complex terrain locomotion, and whole-body manipulation tasks for training
+QUART models. Our extensive evaluation (4000 evaluation trials) shows that our
+approach leads to performant robotic policies and enables QUART to obtain a
+range of emergent capabilities.
+
+
+
+
+
+
+
+ ☆ Cross-Modal Object Tracking via Modality-Aware Fusion Network and A
+ Large-Scale Dataset
+
+
+
+
+
+
+
+
+ Lei Liu, Mengya Zhang, Cheng Li, Chenglong Li, Jin Tang
+
+
+ Visual tracking often faces challenges such as invalid targets and decreased
+performance in low-light conditions when relying solely on RGB image sequences.
+While incorporating additional modalities like depth and infrared data has
+proven effective, existing multi-modal imaging platforms are complex and lack
+real-world applicability. In contrast, near-infrared (NIR) imaging, commonly
+used in surveillance cameras, can switch between RGB and NIR based on light
+intensity. However, tracking objects across these heterogeneous modalities
+poses significant challenges, particularly due to the absence of modality
+switch signals during tracking. To address these challenges, we propose an
+adaptive cross-modal object tracking algorithm called Modality-Aware Fusion
+Network (MAFNet). MAFNet efficiently integrates information from both RGB and
+NIR modalities using an adaptive weighting mechanism, effectively bridging the
+appearance gap and enabling a modality-aware target representation. It consists
+of two key components: an adaptive weighting module and a modality-specific
+representation module......
+
+
+
+ comment: In Peer Review
+
+
+
+
+
+
+ ☆ Scalable 3D Reconstruction From Single Particle X-Ray Diffraction Images
+ Based on Online Machine Learning
+
+
+
+
+
+
+
+
+ Jay Shenoy, Axel Levy, Frédéric Poitevin, Gordon Wetzstein
+
+
+ X-ray free-electron lasers (XFELs) offer unique capabilities for measuring
+the structure and dynamics of biomolecules, helping us understand the basic
+building blocks of life. Notably, high-repetition-rate XFELs enable single
+particle imaging (X-ray SPI) where individual, weakly scattering biomolecules
+are imaged under near-physiological conditions with the opportunity to access
+fleeting states that cannot be captured in cryogenic or crystallized
+conditions. Existing X-ray SPI reconstruction algorithms, which estimate the
+unknown orientation of a particle in each captured image as well as its shared
+3D structure, are inadequate in handling the massive datasets generated by
+these emerging XFELs. Here, we introduce X-RAI, an online reconstruction
+framework that estimates the structure of a 3D macromolecule from large X-ray
+SPI datasets. X-RAI consists of a convolutional encoder, which amortizes pose
+estimation over large datasets, as well as a physics-based decoder, which
+employs an implicit neural representation to enable high-quality 3D
+reconstruction in an end-to-end, self-supervised manner. We demonstrate that
+X-RAI achieves state-of-the-art performance for small-scale datasets in
+simulation and challenging experimental settings and demonstrate its
+unprecedented ability to process large datasets containing millions of
+diffraction images in an online fashion. These abilities signify a paradigm
+shift in X-ray SPI towards real-time capture and reconstruction.
+
+
+ Deep neural networks (DNNs) often fail silently with over-confident
+predictions on out-of-distribution (OOD) samples, posing risks in real-world
+deployments. Existing techniques predominantly emphasize either the feature
+representation space or the gradient norms computed with respect to DNN
+parameters, yet they overlook the intricate gradient distribution and the
+topology of classification regions. To address this gap, we introduce
+GRadient-aware Out-Of-Distribution detection in interpolated manifolds (GROOD),
+a novel framework that relies on the discriminative power of gradient space to
+distinguish between in-distribution (ID) and OOD samples. To build this space,
+GROOD relies on class prototypes together with a prototype that specifically
+captures OOD characteristics. Uniquely, our approach incorporates a targeted
+mix-up operation at an early intermediate layer of the DNN to refine the
+separation of gradient spaces between ID and OOD samples. We quantify OOD
+detection efficacy using the distance to the nearest neighbor gradients derived
+from the training set, yielding a robust OOD score. Experimental evaluations
+substantiate that the introduction of targeted input mix-upamplifies the
+separation between ID and OOD in the gradient space, yielding impressive
+results across diverse datasets. Notably, when benchmarked against ImageNet-1k,
+GROOD surpasses the established robustness of state-of-the-art baselines.
+Through this work, we establish the utility of leveraging gradient spaces and
+class prototypes for enhanced OOD detection for DNN in image classification.
+
+
+
+ comment: 11 pages, 5 figures, preprint under review
+
+ Gait recognition is a biometric technology that has received extensive
+attention. Most existing gait recognition algorithms are unimodal, and a few
+multimodal gait recognition algorithms perform multimodal fusion only once.
+None of these algorithms may fully exploit the complementary advantages of the
+multiple modalities. In this paper, by considering the temporal and spatial
+characteristics of gait data, we propose a multi-stage feature fusion strategy
+(MSFFS), which performs multimodal fusions at different stages in the feature
+extraction process. Also, we propose an adaptive feature fusion module (AFFM)
+that considers the semantic association between silhouettes and skeletons. The
+fusion process fuses different silhouette areas with their more related
+skeleton joints. Since visual appearance changes and time passage co-occur in a
+gait period, we propose a multiscale spatial-temporal feature extractor
+(MSSTFE) to learn the spatial-temporal linkage features thoroughly.
+Specifically, MSSTFE extracts and aggregates spatial-temporal linkages
+information at different spatial scales. Combining the strategy and modules
+mentioned above, we propose a multi-stage adaptive feature fusion (MSAFF)
+neural network, which shows state-of-the-art performance in many experiments on
+three datasets. Besides, MSAFF is equipped with feature dimensional pooling (FD
+Pooling), which can significantly reduce the dimension of the gait
+representations without hindering the accuracy.
+https://github.com/ShinanZou/MSAFF
+
+
+
+ comment: This paper has been accepted by IJCB2023
+
+ With extensive face images being shared on social media, there has been a
+notable escalation in privacy concerns. In this paper, we propose AdvCloak, an
+innovative framework for privacy protection using generative models. AdvCloak
+is designed to automatically customize class-wise adversarial masks that can
+maintain superior image-level naturalness while providing enhanced
+feature-level generalization ability. Specifically, AdvCloak sequentially
+optimizes the generative adversarial networks by employing a two-stage training
+strategy. This strategy initially focuses on adapting the masks to the unique
+individual faces via image-specific training and then enhances their
+feature-level generalization ability to diverse facial variations of
+individuals via person-specific training. To fully utilize the limited training
+data, we combine AdvCloak with several general geometric modeling methods, to
+better describe the feature subspace of source identities. Extensive
+quantitative and qualitative evaluations on both common and celebrity datasets
+demonstrate that AdvCloak outperforms existing state-of-the-art methods in
+terms of efficiency and effectiveness.
+
+
+ Gait datasets are essential for gait research. However, this paper observes
+that present benchmarks, whether conventional constrained or emerging
+real-world datasets, fall short regarding covariate diversity. To bridge this
+gap, we undertake an arduous 20-month effort to collect a cross-covariate gait
+recognition (CCGR) dataset. The CCGR dataset has 970 subjects and about 1.6
+million sequences; almost every subject has 33 views and 53 different
+covariates. Compared to existing datasets, CCGR has both population and
+individual-level diversity. In addition, the views and covariates are well
+labeled, enabling the analysis of the effects of different factors. CCGR
+provides multiple types of gait data, including RGB, parsing, silhouette, and
+pose, offering researchers a comprehensive resource for exploration. In order
+to delve deeper into addressing cross-covariate gait recognition, we propose
+parsing-based gait recognition (ParsingGait) by utilizing the newly proposed
+parsing data. We have conducted extensive experiments. Our main results show:
+1) Cross-covariate emerges as a pivotal challenge for practical applications of
+gait recognition. 2) ParsingGait demonstrates remarkable potential for further
+advancement. 3) Alarmingly, existing SOTA methods achieve less than 43%
+accuracy on the CCGR, highlighting the urgency of exploring cross-covariate
+gait recognition. Link: https://github.com/ShinanZou/CCGR.
+
+
+
+ comment: This paper has been accepted by AAAI2024
+
+
+
+
+
+
+ ☆ Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
+ and Variances
+
+
+
+
+
+
+
+
+ Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Ehsan Abbasnejad, Hamed Damirchi, Ignacio M. Jara, Felipe Bravo-Marquez, Anton van den Hengel
+
+
+ Contrastive Language-Image Pretraining (CLIP) stands out as a prominent
+method for image representation learning. Various neural architectures,
+spanning Transformer-based models like Vision Transformers (ViTs) to
+Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve
+as universal backbones across diverse vision tasks. Despite utilizing the same
+data and training objectives, the effectiveness of representations learned by
+these architectures raises a critical question. Our investigation explores the
+differences in CLIP performance among these backbone architectures, revealing
+significant disparities in their classifications. Notably, normalizing these
+representations results in substantial performance variations. Our findings
+showcase a remarkable possible synergy between backbone predictions that could
+reach an improvement of over 20% through informed selection of the appropriate
+backbone. Moreover, we propose a simple, yet effective approach to combine
+predictions from multiple backbones, leading to a notable performance boost of
+up to 6.34\%. We will release the code for reproducing the results.
+
+
+ Although deep learning are commonly employed for image recognition, usually
+huge amount of labeled training data is required, which may not always be
+readily available. This leads to a noticeable performance disparity when
+compared to state-of-the-art unsupervised face verification techniques. In this
+work, we propose a method to narrow this gap by leveraging an autoencoder to
+convert the face image vector into a novel representation. Notably, the
+autoencoder is trained to reconstruct neighboring face image vectors rather
+than the original input image vectors. These neighbor face image vectors are
+chosen through an unsupervised process based on the highest cosine scores with
+the training face image vectors. The proposed method achieves a relative
+improvement of 56\% in terms of EER over the baseline system on Labeled Faces
+in the Wild (LFW) dataset. This has successfully narrowed down the performance
+gap between cosine and PLDA scoring systems.
+
+
+
+
+
+
+
+ ☆ StyleRetoucher: Generalized Portrait Image Retouching with GAN Priors
+
+
+
+
+
+
+
+
+ Wanchao Su, Can Wang, Chen Liu, Hangzhou Han, Hongbo Fu, Jing Liao
+
+
+ Creating fine-retouched portrait images is tedious and time-consuming even
+for professional artists. There exist automatic retouching methods, but they
+either suffer from over-smoothing artifacts or lack generalization ability. To
+address such issues, we present StyleRetoucher, a novel automatic portrait
+image retouching framework, leveraging StyleGAN's generation and generalization
+ability to improve an input portrait image's skin condition while preserving
+its facial details. Harnessing the priors of pretrained StyleGAN, our method
+shows superior robustness: a). performing stably with fewer training samples
+and b). generalizing well on the out-domain data. Moreover, by blending the
+spatial features of the input image and intermediate features of the StyleGAN
+layers, our method preserves the input characteristics to the largest extent.
+We further propose a novel blemish-aware feature selection mechanism to
+effectively identify and remove the skin blemishes, improving the image skin
+condition. Qualitative and quantitative evaluations validate the great
+generalization capability of our method. Further experiments show
+StyleRetoucher's superior performance to the alternative solutions in the image
+retouching task. We also conduct a user perceptive study to confirm the
+superior retouching performance of our method over the existing
+state-of-the-art alternatives.
+
+
+
+ comment: 13 pages, 15 figures
+
+
+
+
+
+
+ ☆ Variance-insensitive and Target-preserving Mask Refinement for
+ Interactive Image Segmentation AAAI2024
+
+
+
+
+
+
+
+
+ Chaowei Fang, Ziyin Zhou, Junye Chen, Hanjing Su, Qingyao Wu, Guanbin Li
+
+
+ Point-based interactive image segmentation can ease the burden of mask
+annotation in applications such as semantic segmentation and image editing.
+However, fully extracting the target mask with limited user inputs remains
+challenging. We introduce a novel method, Variance-Insensitive and
+Target-Preserving Mask Refinement to enhance segmentation quality with fewer
+user inputs. Regarding the last segmentation result as the initial mask, an
+iterative refinement process is commonly employed to continually enhance the
+initial mask. Nevertheless, conventional techniques suffer from sensitivity to
+the variance in the initial mask. To circumvent this problem, our proposed
+method incorporates a mask matching algorithm for ensuring consistent
+inferences from different types of initial masks. We also introduce a
+target-aware zooming algorithm to preserve object information during
+downsampling, balancing efficiency and accuracy. Experiments on GrabCut,
+Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art
+performance in interactive image segmentation.
+
+
+
+
+
+
+
+
+ Yicheng Leng, Chaowei Fang, Gen Li, Yixiang Fang, Guanbin Li
+
+
+ Visible watermarks, while instrumental in protecting image copyrights,
+frequently distort the underlying content, complicating tasks like scene
+interpretation and image editing. Visible watermark removal aims to eliminate
+the interference of watermarks and restore the background content. However,
+existing methods often implement watermark component removal and background
+restoration tasks within a singular branch, leading to residual watermarks in
+the predictions and ignoring cases where watermarks heavily obscure the
+background. To address these limitations, this study introduces the Removing
+Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI
+embodies a two-stage approach: the initial phase centers on discerning and
+segregating the watermark component, while the subsequent phase focuses on
+background content restoration. To achieve meticulous background restoration,
+our proposed model employs a dual-path network capable of fully exploring the
+intrinsic background information beneath semi-transparent watermarks and
+peripheral contextual information from unaffected regions. Moreover, a Global
+and Local Context Interaction module is built upon multi-layer perceptrons and
+bidirectional feature transformation for comprehensive representation modeling
+in the background restoration phase. The efficacy of our approach is
+empirically validated across two large-scale datasets, and our findings reveal
+a marked enhancement over existing watermark removal techniques.
+
+
+ In order to predict a pedestrian's trajectory in a crowd accurately, one has
+to take into account her/his underlying socio-temporal interactions with other
+pedestrians consistently. Unlike existing work that represents the relevant
+information separately, partially, or implicitly, we propose a complete
+representation for it to be fully and explicitly captured and analyzed. In
+particular, we introduce a Directed Acyclic Graph-based structure, which we
+term Socio-Temporal Graph (STG), to explicitly capture pair-wise socio-temporal
+interactions among a group of people across both space and time. Our model is
+built on a time-varying generative process, whose latent variables determine
+the structure of the STGs. We design an attention-based model named STGformer
+that affords an end-to-end pipeline to learn the structure of the STGs for
+trajectory prediction. Our solution achieves overall state-of-the-art
+prediction accuracy in two large-scale benchmark datasets. Our analysis shows
+that a person's past trajectory is critical for predicting another person's
+future path. Our model learns this relationship with a strong notion of
+socio-temporal localities. Statistics show that utilizing this information
+explicitly for prediction yields a noticeable performance gain with respect to
+the trajectory-only approaches.
+
+
+
+
+
+
+
+
+ Christos Sakaridis, David Bruggemann, Fisher Yu, Luc Van Gool
+
+
+ Adaptation of semantic segmentation networks to different visual conditions
+is vital for robust perception in autonomous cars and robots. However, previous
+work has shown that most feature-level adaptation methods, which employ
+adversarial training and are validated on synthetic-to-real adaptation, provide
+marginal gains in condition-level adaptation, being outperformed by simple
+pixel-level adaptation via stylization. Motivated by these findings, we propose
+to leverage stylization in performing feature-level adaptation by aligning the
+internal network features extracted by the encoder of the network from the
+original and the stylized view of each input image with a novel feature
+invariance loss. In this way, we encourage the encoder to extract features that
+are already invariant to the style of the input, allowing the decoder to focus
+on parsing these features and not on further abstracting from the specific
+style of the input. We implement our method, named Condition-Invariant Semantic
+Segmentation (CISS), on the current state-of-the-art domain adaptation
+architecture and achieve outstanding results on condition-level adaptation. In
+particular, CISS sets the new state of the art in the popular
+daytime-to-nighttime Cityscapes$\to$Dark Zurich benchmark. Furthermore, our
+method achieves the second-best performance on the normal-to-adverse
+Cityscapes$\to$ACDC benchmark. CISS is shown to generalize well to domains
+unseen during training, such as BDD100K-night. Code is publicly available at
+https://github.com/SysCV/CISS .
+
+
+
+ comment: Submitted for review to IEEE T-PAMI
+
+
+
+
+
+
+ ♻ ☆ UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
+
+
+ Large Language Models (LLMs) have made the ambitious quest for generalist
+agents significantly far from being a fantasy. A key hurdle for building such
+general models is the diversity and heterogeneity of tasks and modalities. A
+promising solution is unification, allowing the support of a myriad of tasks
+and modalities within one unified framework. While few large models (e.g.,
+Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more
+than two modalities, current small to mid-scale unified models are still
+limited to 2 modalities, usually image-text or video-text. The question that we
+ask is: is it possible to build efficiently a unified model that can support
+all modalities? To answer this, we propose UnIVAL, a step further towards this
+ambitious goal. Without relying on fancy datasets sizes or models with billions
+of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities
+and unifies text, images, video, and audio into a single model. Our model is
+efficiently pretrained on many tasks, based on task balancing and multimodal
+curriculum learning. UnIVAL shows competitive performance to existing
+state-of-the-art approaches, across image and video-text tasks. The feature
+representations learned from image and video-text modalities, allows the model
+to achieve competitive performance when finetuned on audio-text tasks, despite
+not being pretrained on audio. Thanks to the unified model, we propose a novel
+study on multimodal model merging via weight interpolation of models trained on
+different multimodal tasks, showing their benefits in particular for
+out-of-distribution generalization. Finally, we motivate unification by showing
+the synergy between tasks. The model weights and code are released here:
+https://github.com/mshukor/UnIVAL.
+
+
+
+
+
+
+
+ ♻ ☆ Next Steps for Human-Centered Generative AI: A Technical Perspective
+
+
+
+
+
+
+
+
+ Xiang 'Anthony' Chen, Jeff Burke, Ruofei Du, Matthew K. Hong, Jennifer Jacobs, Philippe Laban, Dingzeyu Li, Nanyun Peng, Karl D. D. Willis, Chien-Sheng Wu, Bolei Zhou
+
+
+ Through iterative, cross-disciplinary discussions, we define and propose
+next-steps for Human-centered Generative AI (HGAI). We contribute a
+comprehensive research agenda that lays out future directions of Generative AI
+spanning three levels: aligning with human values; assimilating human intents;
+and augmenting human abilities. By identifying these next-steps, we intend to
+draw interdisciplinary research teams to pursue a coherent set of emergent
+ideas in HGAI, focusing on their interested topics while maintaining a coherent
+big picture of the future work landscape.
+
+
+
+
+
+
+
+
+ Yuming Gu, You Xie, Hongyi Xu, Guoxian Song, Yichun Shi, Di Chang, Jing Yang, Linjie Luo
+
+
+ We present DiffPortrait3D, a conditional diffusion model that is capable of
+synthesizing 3D-consistent photo-realistic novel views from as few as a single
+in-the-wild portrait. Specifically, given a single RGB input, we aim to
+synthesize plausible but consistent facial details rendered from novel camera
+views with retained both identity and facial expression. In lieu of
+time-consuming optimization and fine-tuning, our zero-shot method generalizes
+well to arbitrary face portraits with unposed camera views, extreme facial
+expressions, and diverse artistic depictions. At its core, we leverage the
+generative prior of 2D diffusion models pre-trained on large-scale image
+datasets as our rendering backbone, while the denoising is guided with
+disentangled attentive control of appearance and camera pose. To achieve this,
+we first inject the appearance context from the reference image into the
+self-attention layers of the frozen UNets. The rendering view is then
+manipulated with a novel conditional control module that interprets the camera
+pose by watching a condition image of a crossed subject from the same view.
+Furthermore, we insert a trainable cross-view attention module to enhance view
+consistency, which is further strengthened with a novel 3D-aware noise
+generation process during inference. We demonstrate state-of-the-art results
+both qualitatively and quantitatively on our challenging in-the-wild and
+multi-view benchmarks.
+
+
+
+
+
+
+
+ ♻ ☆ OsmLocator: locating overlapping scatter marks with a non-training
+ generative perspective
+
+
+
+
+
+
+
+
+ Yuming Qiu, Aleksandra Pizurica, Qi Ming, Nicolas Nadisic
+
+
+ Automated mark localization in scatter images, greatly helpful for
+discovering knowledge and understanding enormous document images and reasoning
+in visual question answering AI systems, is a highly challenging problem
+because of the ubiquity of overlapping marks. Locating overlapping marks faces
+many difficulties such as no texture, less contextual information, hallow shape
+and tiny size. Here, we formulate it as a combinatorial optimization problem on
+clustering-based re-visualization from a non-training generative perspective,
+to locate scatter marks by finding the status of multi-variables when an
+objective function reaches a minimum. The objective function is constructed on
+difference between binarized scatter images and corresponding generated
+re-visualization based on their clustering. Fundamentally, re-visualization
+tries to generate a new scatter graph only taking a rasterized scatter image as
+an input, and clustering is employed to provide the information for such
+re-visualization. This method could stably locate severely-overlapping,
+variable-size and variable-shape marks in scatter images without dependence of
+any training dataset or reference. Meanwhile, we propose an adaptive variant of
+simulated annealing which can works on various connected regions. In addition,
+we especially built a dataset named SML2023 containing hundreds of scatter
+images with different markers and various levels of overlapping severity, and
+tested the proposed method and compared it to existing methods. The results
+show that it can accurately locate most marks in scatter images with different
+overlapping severity and marker types, with about 0.3 absolute increase on an
+assignment-cost-based metric in comparison with state-of-the-art methods. This
+work is of value to data mining on massive web pages and literatures, and
+shedding new light on image measurement such as bubble counting.
+
+
+
+ comment: 22pages
+
+
+
+
+
+
+ ♻ ☆ Differentiable JPEG: The Devil is in the Details WACV 2024
+
+
+
+
+
+
+
+
+ Christoph Reich, Biplob Debnath, Deep Patel, Srimat Chakradhar
+
+
+ JPEG remains one of the most widespread lossy image coding methods. However,
+the non-differentiable nature of JPEG restricts the application in deep
+learning pipelines. Several differentiable approximations of JPEG have recently
+been proposed to address this issue. This paper conducts a comprehensive review
+of existing diff. JPEG approaches and identifies critical details that have
+been missed by previous methods. To this end, we propose a novel diff. JPEG
+approach, overcoming previous limitations. Our approach is differentiable
+w.r.t. the input image, the JPEG quality, the quantization tables, and the
+color conversion parameters. We evaluate the forward and backward performance
+of our diff. JPEG approach against existing methods. Additionally, extensive
+ablations are performed to evaluate crucial design choices. Our proposed diff.
+JPEG resembles the (non-diff.) reference implementation best, significantly
+surpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For
+strong compression rates, we can even improve PSNR by $9.51$dB. Strong
+adversarial attack results are yielded by our diff. JPEG, demonstrating the
+effective gradient approximation. Our code is available at
+https://github.com/necla-ml/Diff-JPEG.
+
+
+
+
+
+
+
+ ♻ ☆ Q-Segment: Segmenting Images In-Sensor for Vessel-Based Medical
+ Diagnosis
+
+
+
+
+
+
+
+
+ Pietro Bonazzi, Julian Moosmann, Yawei Li, Sizhen Bian, Michele Magno
+
+
+ This paper addresses the growing interest in deploying deep learning models
+directly in-sensor. We present "Q-Segment", a quantized real-time segmentation
+algorithm, and conduct a comprehensive evaluation on a low-power edge vision
+platform with an in-sensors processor, the Sony IMX500. One of the main goals
+of the model is to achieve end-to-end image segmentation for vessel-based
+medical diagnosis. Deployed on the IMX500 platform, Q-Segment achieves
+ultra-low inference time in-sensor only 0.23 ms and power consumption of only
+72mW. We compare the proposed network with state-of-the-art models, both float
+and quantized, demonstrating that the proposed solution outperforms existing
+networks on various platforms in computing efficiency, e.g., by a factor of 75x
+compared to ERFNet. The network employs an encoder-decoder structure with skip
+connections, and results in a binary accuracy of 97.25% and an Area Under the
+Receiver Operating Characteristic Curve (AUC) of 96.97% on the CHASE dataset.
+We also present a comparison of the IMX500 processing core with the Sony
+Spresense, a low-power multi-core ARM Cortex-M microcontroller, and a
+single-core ARM Cortex-M4 showing that it can achieve in-sensor processing with
+end-to-end low latency (17 ms) and power concumption (254mW). This research
+contributes valuable insights into edge-based image segmentation, laying the
+foundation for efficient algorithms tailored to low-power environments.
+
+
+
+
+
+
+
+ ♻ ☆ AutoNeRF: Training Implicit Scene Representations with Autonomous Agents
+
+
+ Implicit representations such as Neural Radiance Fields (NeRF) have been
+shown to be very effective at novel view synthesis. However, these models
+typically require manual and careful human data collection for training. In
+this paper, we present AutoNeRF, a method to collect data required to train
+NeRFs using autonomous embodied agents. Our method allows an agent to explore
+an unseen environment efficiently and use the experience to build an implicit
+map representation autonomously. We compare the impact of different exploration
+strategies including handcrafted frontier-based exploration, end-to-end and
+modular approaches composed of trained high-level planners and classical
+low-level path followers. We train these models with different reward functions
+tailored to this problem and evaluate the quality of the learned
+representations on four different downstream tasks: classical viewpoint
+rendering, map reconstruction, planning, and pose refinement. Empirical results
+show that NeRFs can be trained on actively collected data using just a single
+episode of experience in an unseen environment, and can be used for several
+downstream robotic tasks, and that modular trained exploration models
+outperform other classical and end-to-end baselines. Finally, we show that
+AutoNeRF can reconstruct large-scale scenes, and is thus a useful tool to
+perform scene-specific adaptation as the produced 3D environment models can be
+loaded into a simulator to fine-tune a policy of interest.
+
+
+ Cross-modal Retrieval methods build similarity relations between vision and
+language modalities by jointly learning a common representation space. However,
+the predictions are often unreliable due to the Aleatoric uncertainty, which is
+induced by low-quality data, e.g., corrupt images, fast-paced videos, and
+non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric
+Uncertainty Quantification (PAU) framework to provide trustworthy predictions
+by quantifying the uncertainty arisen from the inherent data ambiguity.
+Concretely, we first construct a set of various learnable prototypes for each
+modality to represent the entire semantics subspace. Then Dempster-Shafer
+Theory and Subjective Logic Theory are utilized to build an evidential
+theoretical framework by associating evidence with Dirichlet Distribution
+parameters. The PAU model induces accurate uncertainty and reliable predictions
+for cross-modal retrieval. Extensive experiments are performed on four major
+benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the
+effectiveness of our method. The code is accessible at
+https://github.com/leolee99/PAU.
+
+
+
+ comment: Accepted to NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ DG-TTA: Out-of-domain medical image segmentation through Domain
+ Generalization and Test-Time Adaptation
+
+
+
+
+
+
+
+
+ Christian Weihsbach, Christian N. Kruse, Alexander Bigalke, Mattias P. Heinrich
+
+
+ Applying pre-trained medical segmentation models on out-of-domain images
+often yields predictions of insufficient quality. Several strategies have been
+proposed to maintain model performance, such as finetuning or unsupervised- and
+source-free domain adaptation. These strategies set restrictive requirements
+for data availability. In this study, we propose to combine domain
+generalization and test-time adaptation to create a highly effective approach
+for reusing pre-trained models in unseen target domains. Domain-generalized
+pre-training on source data is used to obtain the best initial performance in
+the target domain. We introduce the MIND descriptor previously used in image
+registration tasks as a further technique to achieve generalization and present
+superior performance for small-scale datasets compared to existing approaches.
+At test-time, high-quality segmentation for every single unseen scan is ensured
+by optimizing the model weights for consistency given different image
+augmentations. That way, our method enables separate use of source and target
+data and thus removes current data availability barriers. Moreover, the
+presented method is highly modular as it does not require specific model
+architectures or prior knowledge of involved domains and labels. We demonstrate
+this by integrating it into the nnUNet, which is currently the most popular and
+accurate framework for medical image segmentation. We employ multiple datasets
+covering abdominal, cardiac, and lumbar spine scans and compose several
+out-of-domain scenarios in this study. We demonstrate that our method, combined
+with pre-trained whole-body CT models, can effectively segment MR images with
+high accuracy in all of the aforementioned scenarios. Open-source code can be
+found here: https://github.com/multimodallearning/DG-TTA
+
+
+
+ comment: This work has been submitted to the IEEE for possible publication.
+ Copyright may be transferred without notice, after which this version may no
+ longer be accessible
+
+
+
+
+
+
+ ♻ ☆ Sketch Beautification: Learning Part Beautification and Structure
+ Refinement for Sketches of Man-made Objects
+
+
+
+
+
+
+
+
+ Deng Yu, Manfred Lau, Lin Gao, Hongbo Fu
+
+
+ We present a novel freehand sketch beautification method, which takes as
+input a freely drawn sketch of a man-made object and automatically beautifies
+it both geometrically and structurally. Beautifying a sketch is challenging
+because of its highly abstract and heavily diverse drawing manner. Existing
+methods are usually confined to the distribution of their limited training
+samples and thus cannot beautify freely drawn sketches with rich variations. To
+address this challenge, we adopt a divide-and-combine strategy. Specifically,
+we first parse an input sketch into semantic components, beautify individual
+components by a learned part beautification module based on part-level implicit
+manifolds, and then reassemble the beautified components through a structure
+beautification module. With this strategy, our method can go beyond the
+training samples and handle novel freehand sketches. We demonstrate the
+effectiveness of our system with extensive experiments and a perceptive study.
+
+
+
+ comment: Accepted by IEEE Transactions on Visualization and Computer Graphics
+
+
+
+
+
+
+ ♻ ☆ Self-SupervisedPre-Training Boosts Semantic Scene Segmentation on LiDAR
+ Data
+
+
+ Airborne LiDAR systems have the capability to capture the Earth's surface by
+generating extensive point cloud data comprised of points mainly defined by 3D
+coordinates. However, labeling such points for supervised learning tasks is
+time-consuming. As a result, there is a need to investigate techniques that can
+learn from unlabeled data to significantly reduce the number of annotated
+samples. In this work, we propose to train a self-supervised encoder with
+Barlow Twins and use it as a pre-trained network in the task of semantic scene
+segmentation. The experimental results demonstrate that our unsupervised
+pre-training boosts performance once fine-tuned on the supervised task,
+especially for under-represented categories.
+
+
+
+ comment: International conference Machine Vision Applications 2023
+
+
+
+
+
+
+ ♻ ☆ Investigating the Corruption Robustness of Image Classifiers with Random
+ Lp-norm Corruptions
+
+
+ Robustness is a fundamental property of machine learning classifiers required
+to achieve safety and reliability. In the field of adversarial robustness of
+image classifiers, robustness is commonly defined as the stability of a model
+to all input changes within a p-norm distance. However, in the field of random
+corruption robustness, variations observed in the real world are used, while
+p-norm corruptions are rarely considered. This study investigates the use of
+random p-norm corruptions to augment the training and test data of image
+classifiers. We evaluate the model robustness against imperceptible random
+p-norm corruptions and propose a novel robustness metric. We empirically
+investigate whether robustness transfers across different p-norms and derive
+conclusions on which p-norm corruptions a model should be trained and
+evaluated. We find that training data augmentation with a combination of p-norm
+corruptions significantly improves corruption robustness, even on top of
+state-of-the-art data augmentation schemes.
+
+
+
+ comment: Camera-ready version submitted to VISAPP 2024
+
+
+
+
+
+
+ ♻ ☆ S.T.A.R.-Track: Latent Motion Models for End-to-End 3D Object Tracking
+ with Adaptive Spatio-Temporal Appearance Representations
+
+
+
+
+
+
+
+
+ Simon Doll, Niklas Hanselmann, Lukas Schneider, Richard Schulz, Markus Enzweiler, Hendrik P. A. Lensch
+
+
+ Following the tracking-by-attention paradigm, this paper introduces an
+object-centric, transformer-based framework for tracking in 3D. Traditional
+model-based tracking approaches incorporate the geometric effect of object- and
+ego motion between frames with a geometric motion model. Inspired by this, we
+propose S.T.A.R.-Track, which uses a novel latent motion model (LMM) to
+additionally adjust object queries to account for changes in viewing direction
+and lighting conditions directly in the latent space, while still modeling the
+geometric motion explicitly. Combined with a novel learnable track embedding
+that aids in modeling the existence probability of tracks, this results in a
+generic tracking framework that can be integrated with any query-based
+detector. Extensive experiments on the nuScenes benchmark demonstrate the
+benefits of our approach, showing \ac{sota} performance for DETR3D-based
+trackers while drastically reducing the number of identity switches of tracks
+at the same time.
+
+
+
+ comment: \c{opyright} 2023 IEEE. Personal use of this material is permitted.
+ Permission from IEEE must be obtained for all other uses, in any current or
+ future media, including reprinting/republishing this material for advertising
+ or promotional purposes, creating new collective works, for resale or
+ redistribution to servers or lists, or reuse of any copyrighted component of
+ this work in other works
+
+
+
+
+
+
+ ♻ ☆ On-the-Fly Guidance Training for Medical Image Registration
+
+
+
+
+
+
+
+
+ Yicheng Chen, Shengxiang Ji, Yuelin Xin, Kun Han, Xiaohui Xie
+
+
+ This research explores a novel approach in the realm of learning-based image
+registration, addressing the limitations inherent in weakly-supervised and
+unsupervised methods. Weakly-supervised techniques depend heavily on scarce
+labeled data, while unsupervised strategies rely on indirect measures of
+accuracy through image similarity. Notably, traditional supervised learning is
+not utilized due to the lack of precise deformation ground-truth in medical
+imaging. Our study introduces a unique training framework with On-the-Fly
+Guidance (OFG) to enhance existing models. This framework, during training,
+generates pseudo-ground truth a few steps ahead by refining the current
+deformation prediction with our custom optimizer. This pseudo-ground truth then
+serves to directly supervise the model in a supervised learning context. The
+process involves optimizing the predicted deformation with a limited number of
+steps, ensuring training efficiency and setting achievable goals for each
+training phase. OFG notably boosts the precision of existing image registration
+techniques while maintaining the speed of learning-based methods. We assessed
+our approach using various pseudo-ground truth generation strategies, including
+predictions and optimized outputs from established registration models. Our
+experiments spanned three benchmark datasets and three cutting-edge models,
+with OFG demonstrating significant and consistent enhancements, surpassing
+previous state-of-the-arts in the field. OFG offers an easily integrable
+plug-and-play solution to enhance the training effectiveness of learning-based
+image registration models. Code at
+https://github.com/miraclefactory/on-the-fly-guidance.
+
+
+ Analyzing keystroke dynamics (KD) for biometric verification has several
+advantages: it is among the most discriminative behavioral traits; keyboards
+are among the most common human-computer interfaces, being the primary means
+for users to enter textual data; its acquisition does not require additional
+hardware, and its processing is relatively lightweight; and it allows for
+transparently recognizing subjects. However, the heterogeneity of experimental
+protocols and metrics, and the limited size of the databases adopted in the
+literature impede direct comparisons between different systems, thus
+representing an obstacle in the advancement of keystroke biometrics. To
+alleviate this aspect, we present a new experimental framework to benchmark
+KD-based biometric verification performance and fairness based on tweet-long
+sequences of variable transcript text from over 185,000 subjects, acquired
+through desktop and mobile keyboards, extracted from the Aalto Keystroke
+Databases. The framework runs on CodaLab in the form of the Keystroke
+Verification Challenge (KVC). Moreover, we also introduce a novel fairness
+metric, the Skewed Impostor Ratio (SIR), to capture inter- and
+intra-demographic group bias patterns in the verification scores. We
+demonstrate the usefulness of the proposed framework by employing two
+state-of-the-art keystroke verification systems, TypeNet and TypeFormer, to
+compare different sets of input features, achieving a less privacy-invasive
+system, by discarding the analysis of text content (ASCII codes of the keys
+pressed) in favor of extended features in the time domain. Our experiments show
+that this approach allows to maintain satisfactory performance.
+
+
+
+ comment: 13 pages, 4 figure, 5 pages
+
+
+
+
+
+
+ ♻ ☆ Scene Text Image Super-resolution based on Text-conditional Diffusion
+ Models WACV 2024
+
+
+ Scene Text Image Super-resolution (STISR) has recently achieved great success
+as a preprocessing method for scene text recognition. STISR aims to transform
+blurred and noisy low-resolution (LR) text images in real-world settings into
+clear high-resolution (HR) text images suitable for scene text recognition. In
+this study, we leverage text-conditional diffusion models (DMs), known for
+their impressive text-to-image synthesis capabilities, for STISR tasks. Our
+experimental results revealed that text-conditional DMs notably surpass
+existing STISR methods. Especially when texts from LR text images are given as
+input, the text-conditional DMs are able to produce superior quality
+super-resolution text images. Utilizing this capability, we propose a novel
+framework for synthesizing LR-HR paired text image datasets. This framework
+consists of three specialized text-conditional DMs, each dedicated to text
+image synthesis, super-resolution, and image degradation. These three modules
+are vital for synthesizing distinct LR and HR paired images, which are more
+suitable for training STISR methods. Our experiments confirmed that these
+synthesized image pairs significantly enhance the performance of STISR methods
+in the TextZoom evaluation.
+
+
+
+ comment: WACV 2024
+
+
+
+
+
+
+ ♻ ☆ SeasFire as a Multivariate Earth System Datacube for Wildfire Dynamics
+
+
+
+
+
+
+
+
+ Ilektra Karasante, Lazaro Alonso, Ioannis Prapas, Akanksha Ahuja, Nuno Carvalhais, Ioannis Papoutsis
+
+
+ The global occurrence, scale, and frequency of wildfires pose significant
+threats to ecosystem services and human livelihoods. To effectively quantify
+and attribute the antecedent conditions for wildfires, a thorough understanding
+of Earth system dynamics is imperative. In response, we introduce the SeasFire
+datacube, a meticulously curated spatiotemporal dataset tailored for global
+sub-seasonal to seasonal wildfire modeling via Earth observation. The SeasFire
+datacube comprises of 59 variables encompassing climate, vegetation, oceanic
+indices, and human factors, has an 8-day temporal resolution and a spatial
+resolution of 0.25$^{\circ}$, and spans from 2001 to 2021. We showcase the
+versatility of SeasFire for exploring the variability and seasonality of
+wildfire drivers, modeling causal links between ocean-climate teleconnections
+and wildfires, and predicting sub-seasonal wildfire patterns across multiple
+timescales with a Deep Learning model. We publicly release the SeasFire
+datacube and appeal to Earth system scientists and Machine Learning
+practitioners to use it for an improved understanding and anticipation of
+wildfires.
+
+
+ Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of
+neural networks to capture the intricacies of 3D objects. By encoding the shape
+and color information within neural network weights, NeRFs excel at producing
+strikingly sharp novel views of 3D objects. Recently, numerous generalizations
+of NeRFs utilizing generative models have emerged, expanding its versatility.
+In contrast, Gaussian Splatting (GS) offers a similar renders quality with
+faster training and inference as it does not need neural networks to work. We
+encode information about the 3D objects in the set of Gaussian distributions
+that can be rendered in 3D similarly to classical meshes. Unfortunately, GS are
+difficult to condition since they usually require circa hundred thousand
+Gaussian components. To mitigate the caveats of both models, we propose a
+hybrid model that uses GS representation of the 3D object's shape and
+NeRF-based encoding of color and opacity. Our model uses Gaussian distributions
+with trainable positions (i.e. means of Gaussian), shape (i.e. covariance of
+Gaussian), color and opacity, and neural network, which takes parameters of
+Gaussian and viewing direction to produce changes in color and opacity.
+Consequently, our model better describes shadows, light reflections, and
+transparency of 3D objects.
+
+
+
+
+
+
+
+
+ Zimin Xia, Olaf Booij, Julian F. P. Kooij
+
+
+ We propose a novel end-to-end method for cross-view pose estimation. Given a
+ground-level query image and an aerial image that covers the query's local
+neighborhood, the 3 Degrees-of-Freedom camera pose of the query is estimated by
+matching its image descriptor to descriptors of local regions within the aerial
+image. The orientation-aware descriptors are obtained by using a
+translationally equivariant convolutional ground image encoder and contrastive
+learning. The Localization Decoder produces a dense probability distribution in
+a coarse-to-fine manner with a novel Localization Matching Upsampling module. A
+smaller Orientation Decoder produces a vector field to condition the
+orientation estimate on the localization. Our method is validated on the VIGOR
+and KITTI datasets, where it surpasses the state-of-the-art baseline by 72% and
+36% in median localization error for comparable orientation estimation
+accuracy. The predicted probability distribution can represent localization
+ambiguity, and enables rejecting possible erroneous predictions. Without
+re-training, the model can infer on ground images with different field of views
+and utilize orientation priors if available. On the Oxford RobotCar dataset,
+our method can reliably estimate the ego-vehicle's pose over time, achieving a
+median localization error under 1 meter and a median orientation error of
+around 1 degree at 14 FPS.
+
+
+
+
+
+
+
+ ♻ ☆ Self-distillation Regularized Connectionist Temporal Classification Loss
+ for Text Recognition: A Simple Yet Effective Approach AAAI2024
+
+
+ Text recognition methods are gaining rapid development. Some advanced
+techniques, e.g., powerful modules, language models, and un- and
+semi-supervised learning schemes, consecutively push the performance on public
+benchmarks forward. However, the problem of how to better optimize a text
+recognition model from the perspective of loss functions is largely overlooked.
+CTC-based methods, widely used in practice due to their good balance between
+performance and inference speed, still grapple with accuracy degradation. This
+is because CTC loss emphasizes the optimization of the entire sequence target
+while neglecting to learn individual characters. We propose a self-distillation
+scheme for CTC-based model to address this issue. It incorporates a framewise
+regularization term in CTC loss to emphasize individual supervision, and
+leverages the maximizing-a-posteriori of latent alignment to solve the
+inconsistency problem that arises in distillation between CTC-based models. We
+refer to the regularized CTC loss as Distillation Connectionist Temporal
+Classification (DCTC) loss. DCTC loss is module-free, requiring no extra
+parameters, longer inference lag, or additional training data or phases.
+Extensive experiments on public benchmarks demonstrate that DCTC can boost text
+recognition model accuracy by up to 2.6%, without any of these drawbacks.
+
+
+
+ comment: Ziyin Zhang and Ning Lu are co-first authors. Accepted by AAAI2024.
+ Repo: https://github.com/zzyhlyoko/DCTC
+
+
+
+
+
+
+ ♻ ☆ Review of AlexNet for Medical Image Classification
+
+
+ In recent years, the rapid development of deep learning has led to a wide
+range of applications in the field of medical image classification. The
+variants of neural network models with ever-increasing performance share some
+commonalities: to try to mitigate overfitting, improve generalization, avoid
+gradient vanishing and exploding, etc. AlexNet first utilizes the dropout
+technique to mitigate overfitting and the ReLU activation function to avoid
+gradient vanishing. Therefore, we focus our discussion on AlexNet, which has
+contributed greatly to the development of CNNs in 2012. After reviewing over 40
+papers, including journal papers and conference papers, we give a narrative on
+the technical details, advantages, and application areas of AlexNet.
+
+
+ State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS)
+using image-level labels exhibit severe performance degradation on driving
+scene datasets such as Cityscapes. To address this challenge, we develop a new
+WSSS framework tailored to driving scene datasets. Based on extensive analysis
+of dataset characteristics, we employ Contrastive Language-Image Pre-training
+(CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key
+challenges: (1) pseudo-masks from CLIP lack in representing small object
+classes, and (2) these masks contain notable noise. We propose solutions for
+each issue as follows. (1) We devise Global-Local View Training that seamlessly
+incorporates small-scale patches during model training, thereby enhancing the
+model's capability to handle small-sized yet critical objects in driving scenes
+(e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing
+(CARB), a novel technique that discerns reliable and noisy regions through
+evaluating the consistency between CLIP masks and segmentation predictions. It
+prioritizes reliable pixels over noisy pixels via adaptive loss weighting.
+Notably, the proposed method achieves 51.8\% mIoU on the Cityscapes test
+dataset, showcasing its potential as a strong WSSS baseline on driving scene
+datasets. Experimental results on CamVid and WildDash2 demonstrate the
+effectiveness of our method across diverse datasets, even with small-scale
+datasets or visually challenging conditions. The code is available at
+https://github.com/k0u-id/CARB.
+
+
+
+ comment: AAAI 2024 accepted. First two authors contributed equally
+
+
+
+
+
+
+ ♻ ☆ Backdoor Attack with Sparse and Invisible Trigger
+
+
+ Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the
+adversary manipulates a small portion of training data such that the victim
+model predicts normally on the benign samples but classifies the triggered
+samples as the target class. The backdoor attack is an emerging yet threatening
+training-phase threat, leading to serious risks in DNN-based applications. In
+this paper, we revisit the trigger patterns of existing backdoor attacks. We
+reveal that they are either visible or not sparse and therefore are not
+stealthy enough. More importantly, it is not feasible to simply combine
+existing methods to design an effective sparse and invisible backdoor attack.
+To address this problem, we formulate the trigger generation as a bi-level
+optimization problem with sparsity and invisibility constraints and propose an
+effective method to solve it. The proposed method is dubbed sparse and
+invisible backdoor attack (SIBA). We conduct extensive experiments on benchmark
+datasets under different settings, which verify the effectiveness of our attack
+and its resistance to existing backdoor defenses. The codes for reproducing
+main experiments are available at \url{https://github.com/YinghuaGao/SIBA}.
+
+
+
+ comment: The first two authors contributed equally to this work. 13 pages
+
+
+
+
+
+
+ ♻ ☆ Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models
+
+
+ This paper presents Paint3D, a novel coarse-to-fine generative framework that
+is capable of producing high-resolution, lighting-less, and diverse 2K UV
+texture maps for untextured 3D meshes conditioned on text or image inputs. The
+key challenge addressed is generating high-quality textures without embedded
+illumination information, which allows the textures to be re-lighted or
+re-edited within modern graphics pipelines. To achieve this, our method first
+leverages a pre-trained depth-aware 2D diffusion model to generate
+view-conditional images and perform multi-view texture fusion, producing an
+initial coarse texture map. However, as 2D models cannot fully represent 3D
+shapes and disable lighting effects, the coarse texture map exhibits incomplete
+areas and illumination artifacts. To resolve this, we train separate UV
+Inpainting and UVHD diffusion models specialized for the shape-aware refinement
+of incomplete areas and the removal of illumination artifacts. Through this
+coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that
+maintain semantic consistency while being lighting-less, significantly
+advancing the state-of-the-art in texturing 3D objects.
+
+
+
+
+
+
+
+
+ Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, Lan Xu
+
+
+ We have recently seen tremendous progress in diffusion advances for
+generating realistic human motions. Yet, they largely disregard the multi-human
+interactions. In this paper, we present InterGen, an effective diffusion-based
+approach that incorporates human-to-human interactions into the motion
+diffusion process, which enables layman users to customize high-quality
+two-person interaction motions, with only text guidance. We first contribute a
+multimodal dataset, named InterHuman. It consists of about 107M frames for
+diverse two-person interactions, with accurate skeletal motions and 23,337
+natural language descriptions. For the algorithm side, we carefully tailor the
+motion diffusion model to our two-person interaction setting. To handle the
+symmetry of human identities during interactions, we propose two cooperative
+transformer-based denoisers that explicitly share weights, with a mutual
+attention mechanism to further connect the two denoising processes. Then, we
+propose a novel representation for motion input in our interaction diffusion
+model, which explicitly formulates the global relations between the two
+performers in the world frame. We further introduce two novel regularization
+terms to encode spatial relations, equipped with a corresponding damping scheme
+during the training of our interaction diffusion model. Extensive experiments
+validate the effectiveness and generalizability of InterGen. Notably, it can
+generate more diverse and compelling two-person motions than previous methods
+and enables various downstream applications for human interactions.
+
+
+ In semi-supervised domain adaptation (SSDA), a few labeled target samples of
+each class help the model to transfer knowledge representation from the fully
+labeled source domain to the target domain. Many existing methods ignore the
+benefits of making full use of the labeled target samples from multi-level. To
+make better use of this additional data, we propose a novel Prototype-based
+Multi-level Learning (ProML) framework to better tap the potential of labeled
+target samples. To achieve intra-domain adaptation, we first introduce a
+pseudo-label aggregation based on the intra-domain optimal transport to help
+the model align the feature distribution of unlabeled target samples and the
+prototype. At the inter-domain level, we propose a cross-domain alignment loss
+to help the model use the target prototype for cross-domain knowledge transfer.
+We further propose a dual consistency based on prototype similarity and linear
+classifier to promote discriminative learning of compact target feature
+representation at the batch level. Extensive experiments on three datasets,
+including DomainNet, VisDA2017, and Office-Home demonstrate that our proposed
+method achieves state-of-the-art performance in SSDA.
+
+
+
+ comment: IJCAI 2023. To avoid confusion, update to a more complete version
+
+
+
+
+
+
+
+ Han Huang, Yulun Wu, Junsheng Zhou, Ge Gao, Ming Gu, Yu-Shen Liu
+
+
+ Recently, neural implicit functions have demonstrated remarkable results in
+the field of multi-view reconstruction. However, most existing methods are
+tailored for dense views and exhibit unsatisfactory performance when dealing
+with sparse views. Several latest methods have been proposed for generalizing
+implicit reconstruction to address the sparse view reconstruction task, but
+they still suffer from high training costs and are merely valid under carefully
+selected perspectives. In this paper, we propose a novel sparse view
+reconstruction framework that leverages on-surface priors to achieve highly
+faithful surface reconstruction. Specifically, we design several constraints on
+global geometry alignment and local geometry refinement for jointly optimizing
+coarse shapes and fine details. To achieve this, we train a neural network to
+learn a global implicit field from the on-surface points obtained from SfM and
+then leverage it as a coarse geometric constraint. To exploit local geometric
+consistency, we project on-surface points onto seen and unseen views, treating
+the consistent loss of projected features as a fine geometric constraint. The
+experimental results with DTU and BlendedMVS datasets in two prevalent sparse
+settings demonstrate significant improvements over the state-of-the-art
+methods.
+
+
+
+
+
+
+
+
+ Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu
+
+
+ Recent advancements in large language models (LLMs) have led to the creation
+of intelligent agents capable of performing complex tasks. This paper
+introduces a novel LLM-based multimodal agent framework designed to operate
+smartphone applications. Our framework enables the agent to operate smartphone
+applications through a simplified action space, mimicking human-like
+interactions such as tapping and swiping. This novel approach bypasses the need
+for system back-end access, thereby broadening its applicability across diverse
+apps. Central to our agent's functionality is its innovative learning method.
+The agent learns to navigate and use new apps either through autonomous
+exploration or by observing human demonstrations. This process generates a
+knowledge base that the agent refers to for executing complex tasks across
+different applications. To demonstrate the practicality of our agent, we
+conducted extensive testing over 50 tasks in 10 different applications,
+including social media, email, maps, shopping, and sophisticated image editing
+tools. The results affirm our agent's proficiency in handling a diverse array
+of high-level tasks.
+
+
+
+ comment: Project Page is https://appagent-official.github.io/
+
+
+
+
+
+
+ ♻ ☆ MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using
+ Differentiable Shading
+
+
+
+
+
+
+
+
+ Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amin Fadaeinejad, Rafael M. O. Cruz, Marc-Andre Carbonneau
+
+
+ Reconstructing an avatar from a portrait image has many applications in
+multimedia, but remains a challenging research problem. Extracting reflectance
+maps and geometry from one image is ill-posed: recovering geometry is a
+one-to-many mapping problem and reflectance and light are difficult to
+disentangle. Accurate geometry and reflectance can be captured under the
+controlled conditions of a light stage, but it is costly to acquire large
+datasets in this fashion. Moreover, training solely with this type of data
+leads to poor generalization with in-the-wild images. This motivates the
+introduction of MoSAR, a method for 3D avatar generation from monocular images.
+We propose a semi-supervised training scheme that improves generalization by
+learning from both light stage and in-the-wild datasets. This is achieved using
+a novel differentiable shading formulation. We show that our approach
+effectively disentangles the intrinsic face parameters, producing relightable
+avatars. As a result, MoSAR estimates a richer set of skin reflectance maps,
+and generates more realistic avatars than existing state-of-the-art methods. We
+also introduce a new dataset, named FFHQ-UV-Intrinsics, the first public
+dataset providing intrinsic face attributes at scale (diffuse, specular,
+ambient occlusion and translucency maps) for a total of 10k subjects. The
+project website and the dataset are available on the following link:
+https://ubisoft-laforge.github.io/character/mosar/
+
+
+ Nighttime unmanned aerial vehicle (UAV) tracking has been facilitated with
+indispensable plug-and-play low-light enhancers. However, the introduction of
+low-light enhancers increases the extra computational burden for the UAV,
+significantly hindering the development of real-time UAV applications.
+Meanwhile, these state-of-the-art (SOTA) enhancers lack tight coupling with the
+advanced daytime UAV tracking approach. To solve the above issues, this work
+proposes a novel mutual-learning knowledge distillation framework for nighttime
+UAV tracking, i.e., MLKD. This framework is constructed to learn a compact and
+fast nighttime tracker via knowledge transferring from the teacher and
+knowledge sharing among various students. Specifically, an advanced teacher
+based on a SOTA enhancer and a superior tracking backbone is adopted for
+guiding the student based only on the tight coupling-aware tracking backbone to
+directly extract nighttime object features. To address the biased learning of a
+single student, diverse lightweight students with different distillation
+methods are constructed to focus on various aspects of the teacher's knowledge.
+Moreover, an innovative mutual-learning room is designed to elect the superior
+student candidate to assist the remaining students frame-by-frame in the
+training phase. Furthermore, the final best student, i.e., MLKD-Track, is
+selected through the testing dataset. Extensive experiments demonstrate the
+effectiveness and superiority of MLKD and MLKD-Track. The practicality of the
+MLKD-Track is verified in real-world tests with different challenging
+situations. The code is available at https://github.com/lyfeng001/MLKD.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 8
+
+
+
+
+
+ ☆ Multi-view user representation learning for user matching without
+ personal information
+
+
+
+
+
+
+
+
+ Hongliu Cao, Ilias El Baamrani, Eoin Thomas
+
+
+ As the digitization of travel industry accelerates, analyzing and
+understanding travelers' behaviors becomes increasingly important. However,
+traveler data frequently exhibit high data sparsity due to the relatively low
+frequency of user interactions with travel providers. Compounding this effect
+the multiplication of devices, accounts and platforms while browsing travel
+products online also leads to data dispersion. To deal with these challenges,
+probabilistic traveler matching can be used. Most existing solutions for user
+matching are not suitable for traveler matching as a traveler's browsing
+history is typically short and URLs in the travel industry are very
+heterogeneous with many tokens. To deal with these challenges, we propose the
+similarity based multi-view information fusion to learn a better user
+representation from URLs by treating the URLs as multi-view data. The
+experimental results show that the proposed multi-view user representation
+learning can take advantage of the complementary information from different
+views, highlight the key information in URLs and perform significantly better
+than other representation learning solutions for the user matching task.
+
+
+
+
+
+
+
+ ☆ On the Effectiveness of Unlearning in Session-Based Recommendation
+
+
+ Session-based recommendation predicts users' future interests from previous
+interactions in a session. Despite the memorizing of historical samples, the
+request of unlearning, i.e., to remove the effect of certain training samples,
+also occurs for reasons such as user privacy or model fidelity. However,
+existing studies on unlearning are not tailored for the session-based
+recommendation. On the one hand, these approaches cannot achieve satisfying
+unlearning effects due to the collaborative correlations and sequential
+connections between the unlearning item and the remaining items in the session.
+On the other hand, seldom work has conducted the research to verify the
+unlearning effectiveness in the session-based recommendation scenario. In this
+paper, we propose SRU, a session-based recommendation unlearning framework,
+which enables high unlearning efficiency, accurate recommendation performance,
+and improved unlearning effectiveness in session-based recommendation.
+Specifically, we first partition the training sessions into separate sub-models
+according to the similarity across the sessions, then we utilize an
+attention-based aggregation layer to fuse the hidden states according to the
+correlations between the session and the centroid of the data in the sub-model.
+To improve the unlearning effectiveness, we further propose three extra data
+deletion strategies, including collaborative extra deletion (CED), neighbor
+extra deletion (NED), and random extra deletion (RED). Besides, we propose an
+evaluation metric that measures whether the unlearning sample can be inferred
+after the data deletion to verify the unlearning effectiveness. We implement
+SRU with three representative session-based recommendation models and conduct
+experiments on three benchmark datasets. Experimental results demonstrate the
+effectiveness of our methods.
+
+
+
+
+
+
+
+
+ Zhenyang Li, Fan Liu, Yinwei Wei, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli
+
+
+ Recommendation algorithms forecast user preferences by correlating user and
+item representations derived from historical interaction patterns. In pursuit
+of enhanced performance, many methods focus on learning robust and independent
+representations by disentangling the intricate factors within interaction data
+across various modalities in an unsupervised manner. However, such an approach
+obfuscates the discernment of how specific factors (e.g., category or brand)
+influence the outcomes, making it challenging to regulate their effects. In
+response to this challenge, we introduce a novel method called Attribute-Driven
+Disentangled Representation Learning (short for AD-DRL), which explicitly
+incorporates attributes from different modalities into the disentangled
+representation learning process. By assigning a specific attribute to each
+factor in multimodal features, AD-DRL can disentangle the factors at both
+attribute and attribute-value levels. To obtain robust and independent
+representations for each factor associated with a specific attribute, we first
+disentangle the representations of features both within and across different
+modalities. Moreover, we further enhance the robustness of the representations
+by fusing the multimodal features of the same factor. Empirical evaluations
+conducted on three public real-world datasets substantiate the effectiveness of
+AD-DRL, as well as its interpretability and controllability.
+
+
+
+
+
+
+
+ ♻ ☆ Zero-1-to-3: Domain-level Zero-shot Cognitive Diagnosis via One Batch of
+ Early-bird Students towards Three Diagnostic Objectives AAAI2024
+
+
+ Cognitive diagnosis seeks to estimate the cognitive states of students by
+exploring their logged practice quiz data. It plays a pivotal role in
+personalized learning guidance within intelligent education systems. In this
+paper, we focus on an important, practical, yet often underexplored task:
+domain-level zero-shot cognitive diagnosis (DZCD), which arises due to the
+absence of student practice logs in newly launched domains. Recent cross-domain
+diagnostic models have been demonstrated to be a promising strategy for DZCD.
+These methods primarily focus on how to transfer student states across domains.
+However, they might inadvertently incorporate non-transferable information into
+student representations, thereby limiting the efficacy of knowledge transfer.
+To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive
+diagnosis framework via one batch of early-bird students towards three
+diagnostic objectives. Our approach initiates with pre-training a diagnosis
+model with dual regularizers, which decouples student states into domain-shared
+and domain-specific parts. The shared cognitive signals can be transferred to
+the target domain, enriching the cognitive priors for the new domain, which
+ensures the cognitive state propagation objective. Subsequently, we devise a
+strategy to generate simulated practice logs for cold-start students through
+analyzing the behavioral patterns from early-bird students, fulfilling the
+domain-adaption goal. Consequently, we refine the cognitive states of
+cold-start students as diagnostic outcomes via virtual data, aligning with the
+diagnosis-oriented goal. Finally, extensive experiments on six real-world
+datasets highlight the efficacy of our model for DZCD and its practical
+application in question recommendation.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ♻ ☆ Temporally and Distributionally Robust Optimization for Cold-start
+ Recommendation AAAI'24
+
+
+ Collaborative Filtering (CF) recommender models highly depend on user-item
+interactions to learn CF representations, thus falling short of recommending
+cold-start items. To address this issue, prior studies mainly introduce item
+features (e.g., thumbnails) for cold-start item recommendation. They learn a
+feature extractor on warm-start items to align feature representations with
+interactions, and then leverage the feature extractor to extract the feature
+representations of cold-start items for interaction prediction. Unfortunately,
+the features of cold-start items, especially the popular ones, tend to diverge
+from those of warm-start ones due to temporal feature shifts, preventing the
+feature extractor from accurately learning feature representations of
+cold-start items.
+ To alleviate the impact of temporal feature shifts, we consider using
+Distributionally Robust Optimization (DRO) to enhance the generation ability of
+the feature extractor. Nonetheless, existing DRO methods face an inconsistency
+issue: the worse-case warm-start items emphasized during DRO training might not
+align well with the cold-start item distribution. To capture the temporal
+feature shifts and combat this inconsistency issue, we propose a novel temporal
+DRO with new optimization objectives, namely, 1) to integrate a worst-case
+factor to improve the worst-case performance, and 2) to devise a shifting
+factor to capture the shifting trend of item features and enhance the
+optimization of the potentially popular groups in cold-start items. Substantial
+experiments on three real-world datasets validate the superiority of our
+temporal DRO in enhancing the generalization ability of cold-start recommender
+models. The code is available at https://github.com/Linxyhaha/TDRO/.
+
+
+
+ comment: Accepted by AAAI'24
+
+
+
+
+
+
+ ♻ ☆ Unexplored Frontiers: A Review of Empirical Studies of Exploratory
+ Search
+
+
+ This article reviews how empirical research of exploratory search is
+conducted. We investigated aspects of interdisciplinarity, study settings and
+evaluation methodologies from a systematically selected sample of 231
+publications from 2010-2021, including a total of 172 articles with empirical
+studies. Our results show that exploratory search is highly interdisciplinary,
+with the most frequently occurring publication venues including high impact
+venues in information science, information systems and human-computer
+interaction. However, taken in aggregate, the breadth of study settings
+investigated was limited. We found that a majority of studies (77%) focused on
+evaluating novel retrieval systems as opposed to investigating users' search
+processes. Furthermore, a disproportionate number of studies were based on
+scientific literature search (20.7%), a majority of which only considered
+searching for Computer Science articles. Study participants were generally from
+convenience samples, with 75% of studies composed exclusively of students and
+other academics. The methodologies used for evaluation were mostly
+quantitative, but lacked consistency between studies and validated
+questionnaires were rarely used. In discussion, we offer a critical analysis of
+our findings and suggest potential improvements for future exploratory search
+studies.
+
+
+
+
+
+
+
+ ♻ ☆ Framework to Automatically Determine the Quality of Open Data Catalogs
+
+
+ Data catalogs play a crucial role in modern data-driven organizations by
+facilitating the discovery, understanding, and utilization of diverse data
+assets. However, ensuring their quality and reliability is complex, especially
+in open and large-scale data environments. This paper proposes a framework to
+automatically determine the quality of open data catalogs, addressing the need
+for efficient and reliable quality assessment mechanisms. Our framework can
+analyze various core quality dimensions, such as accuracy, completeness,
+consistency, scalability, and timeliness, offer several alternatives for the
+assessment of compatibility and similarity across such catalogs as well as the
+implementation of a set of non-core quality dimensions such as provenance,
+readability, and licensing. The goal is to empower data-driven organizations to
+make informed decisions based on trustworthy and well-curated data assets. The
+source code that illustrates our approach can be downloaded from
+https://www.github.com/jorge-martinez-gil/dataq/.
+
+
+
+ comment: 27 pages
+
+
+
+
+
+
+ ♻ ☆ Adapting Large Language Models by Integrating Collaborative Semantics
+ for Recommendation
+
+
+ Recently, large language models (LLMs) have shown great potential in
+recommender systems, either improving existing recommendation models or serving
+as the backbone. However, there exists a large semantic gap between LLMs and
+recommender systems, since items to be recommended are often indexed by
+discrete identifiers (item ID) out of the LLM's vocabulary. In essence, LLMs
+capture language semantics while recommender systems imply collaborative
+semantics, making it difficult to sufficiently leverage the model capacity of
+LLMs for recommendation. To address this challenge, in this paper, we propose a
+new LLM-based recommendation model called LC-Rec, which can better integrate
+language and collaborative semantics for recommender systems. Our approach can
+directly generate items from the entire item set for recommendation, without
+relying on candidate items. Specifically, we make two major contributions in
+our approach. For item indexing, we design a learning-based vector quantization
+method with uniform semantic mapping, which can assign meaningful and
+non-conflicting IDs (called item indices) for items. For alignment tuning, we
+propose a series of specially designed tuning tasks to enhance the integration
+of collaborative semantics in LLMs. Our fine-tuning tasks enforce LLMs to
+deeply integrate language and collaborative semantics (characterized by the
+learned item indices), so as to achieve an effective adaptation to recommender
+systems. Extensive experiments demonstrate the effectiveness of our method,
+showing that our approach can outperform a number of competitive baselines
+including traditional recommenders and existing LLM-based recommenders. Our
+code is available at https://github.com/RUCAIBox/LC-Rec/.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 115
+
+
+
+
+
+ ☆ A Survey of Reinforcement Learning from Human Feedback
+
+
+
+
+
+
+
+
+ Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier
+
+
+ Reinforcement learning from human feedback (RLHF) is a variant of
+reinforcement learning (RL) that learns from human feedback instead of relying
+on an engineered reward function. Building on prior work on the related setting
+of preference-based reinforcement learning (PbRL), it stands at the
+intersection of artificial intelligence and human-computer interaction. This
+positioning offers a promising avenue to enhance the performance and
+adaptability of intelligent systems while also improving the alignment of their
+objectives with human values. The training of Large Language Models (LLMs) has
+impressively demonstrated this potential in recent years, where RLHF played a
+decisive role in targeting the model's capabilities toward human objectives.
+This article provides a comprehensive overview of the fundamentals of RLHF,
+exploring the intricate dynamics between machine agents and human input. While
+recent focus has been on RLHF for LLMs, our survey adopts a broader
+perspective, examining the diverse applications and wide-ranging impact of the
+technique. We delve into the core principles that underpin RLHF, shedding light
+on the symbiotic relationship between algorithms and human feedback, and
+discuss the main research trends in the field. By synthesizing the current
+landscape of RLHF research, this article aims to provide researchers as well as
+practitioners with a comprehensive understanding of this rapidly growing field
+of research.
+
+
+ The rapid growth of machine learning has spurred legislative initiatives such
+as ``the Right to be Forgotten,'' allowing users to request data removal. In
+response, ``machine unlearning'' proposes the selective removal of unwanted
+data without the need for retraining from scratch. While the
+Neural-Tangent-Kernel-based (NTK-based) unlearning method excels in
+performance, it suffers from significant computational complexity, especially
+for large-scale models and datasets. Our work introduces ``Fast-NTK,'' a novel
+NTK-based unlearning algorithm that significantly reduces the computational
+complexity by incorporating parameter-efficient fine-tuning methods, such as
+fine-tuning batch normalization layers in a CNN or visual prompts in a vision
+transformer. Our experimental results demonstrate scalability to much larger
+neural networks and datasets (e.g., 88M parameters; 5k images), surpassing the
+limitations of previous full-model NTK-based approaches designed for smaller
+cases (e.g., 8M parameters; 500 images). Notably, our approach maintains a
+performance comparable to the traditional method of retraining on the retain
+set alone. Fast-NTK can thus enable for practical and scalable NTK-based
+unlearning in deep neural networks.
+
+
+
+ comment: 6 pages, 1 figure
+
+
+
+
+
+
+ ☆ Learning from higher-order statistics, efficiently: hypothesis tests,
+ random features, and neural networks
+
+
+
+
+
+
+
+
+ Eszter Székely, Lorenzo Bardone, Federica Gerace, Sebastian Goldt
+
+
+ Neural networks excel at discovering statistical patterns in high-dimensional
+data sets. In practice, higher-order cumulants, which quantify the non-Gaussian
+correlations between three or more variables, are particularly important for
+the performance of neural networks. But how efficient are neural networks at
+extracting features from higher-order cumulants? We study this question in the
+spiked cumulant model, where the statistician needs to recover a privileged
+direction or "spike" from the order-$p\ge 4$ cumulants of~$d$-dimensional
+inputs. We first characterise the fundamental statistical and computational
+limits of recovering the spike by analysing the number of samples~$n$ required
+to strongly distinguish between inputs from the spiked cumulant model and
+isotropic Gaussian inputs. We find that statistical distinguishability requires
+$n\gtrsim d$ samples, while distinguishing the two distributions in polynomial
+time requires $n \gtrsim d^2$ samples for a wide class of algorithms, i.e.
+those covered by the low-degree conjecture. These results suggest the existence
+of a wide statistical-to-computational gap in this problem. Numerical
+experiments show that neural networks learn to distinguish the two
+distributions with quadratic sample complexity, while "lazy" methods like
+random features are not better than random guessing in this regime. Our results
+show that neural networks extract information from higher-order correlations in
+the spiked cumulant model efficiently, and reveal a large gap in the amount of
+data required by neural networks and random features to learn from higher-order
+cumulants.
+
+
+
+
+
+
+
+ ☆ A Novel Sampled Clustering Algorithm for Rice Phenotypic Data
+
+
+ Phenotypic (or Physical) characteristics of plant species are commonly used
+to perform clustering. In one of our recent works (Shastri et al. (2021)), we
+used a probabilistically sampled (using pivotal sampling) and spectrally
+clustered algorithm to group soybean species. These techniques were used to
+obtain highly accurate clusterings at a reduced cost. In this work, we extend
+the earlier algorithm to cluster rice species. We improve the base algorithm in
+three ways. First, we propose a new function to build the similarity matrix in
+Spectral Clustering. Commonly, a natural exponential function is used for this
+purpose. Based upon the spectral graph theory and the involved Cheeger's
+inequality, we propose the use a base "a" exponential function instead. This
+gives a similarity matrix spectrum favorable for clustering, which we support
+via an eigenvalue analysis.
+ Second, the function used to build the similarity matrix in Spectral
+Clustering was earlier scaled with a fixed factor (called global scaling).
+Based upon the idea of Zelnik-Manor and Perona (2004), we now use a factor that
+varies with matrix elements (called local scaling) and works better. Third, to
+compute the inclusion probability of a specie in the pivotal sampling
+algorithm, we had earlier used the notion of deviation that captured how far
+specie's characteristic values were from their respective base values (computed
+over all species). A maximum function was used before to find the base values.
+We now use a median function, which is more intuitive. We support this choice
+using a statistical analysis. With experiments on 1865 rice species, we
+demonstrate that in terms of silhouette values, our new Sampled Spectral
+Clustering is 61% better than Hierarchical Clustering (currently prevalent).
+Also, our new algorithm is significantly faster than Hierarchical Clustering
+due to the involved sampling.
+
+
+
+
+
+
+
+
+ James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, Romain Mueller
+
+
+ Combining complementary sensor modalities is crucial to providing robust
+perception for safety-critical robotics applications such as autonomous driving
+(AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on
+monocular depth estimation which is a notoriously difficult task compared to
+using depth information from the lidar directly. Here, we find that this
+approach does not leverage depth as expected and show that naively improving
+depth estimation does not lead to improvements in object detection performance
+and that, strikingly, removing depth estimation altogether does not degrade
+object detection performance. This suggests that relying on monocular depth
+could be an unnecessary architectural bottleneck during camera-lidar fusion. In
+this work, we introduce a novel fusion method that bypasses monocular depth
+estimation altogether and instead selects and fuses camera and lidar features
+in a bird's-eye-view grid using a simple attention mechanism. We show that our
+model can modulate its use of camera features based on the availability of
+lidar features and that it yields better 3D object detection on the nuScenes
+dataset than baselines relying on monocular depth estimation.
+
+
+ The heightened emphasis on the regulation of deep generative models,
+propelled by escalating concerns pertaining to privacy and compliance with
+regulatory frameworks, underscores the imperative need for precise control
+mechanisms over these models. This urgency is particularly underscored by
+instances in which generative models generate outputs that encompass
+objectionable, offensive, or potentially injurious content. In response,
+machine unlearning has emerged to selectively forget specific knowledge or
+remove the influence of undesirable data subsets from pre-trained models.
+However, modern machine unlearning approaches typically assume access to model
+parameters and architectural details during unlearning, which is not always
+feasible. In multitude of downstream tasks, these models function as black-box
+systems, with inaccessible pre-trained parameters, architectures, and training
+data. In such scenarios, the possibility of filtering undesired outputs becomes
+a practical alternative. The primary goal of this study is twofold: first, to
+elucidate the relationship between filtering and unlearning processes, and
+second, to formulate a methodology aimed at mitigating the display of
+undesirable outputs generated from models characterized as black-box systems.
+Theoretical analysis in this study demonstrates that, in the context of
+black-box models, filtering can be seen as a form of weak unlearning. Our
+proposed \textbf{\textit{Feature Aware Similarity Thresholding(FAST)}} method
+effectively suppresses undesired outputs by systematically encoding the
+representation of unwanted features in the latent space.
+
+
+
+
+
+
+
+ ☆ DRStageNet: Deep Learning for Diabetic Retinopathy Staging from Fundus
+ Images
+
+
+
+
+
+
+
+
+ Yevgeniy Men, Jonathan Fhima, Leo Anthony Celi, Lucas Zago Ribeiro, Luis Filipe Nakayama, Joachim A. Behar
+
+
+ Diabetic retinopathy (DR) is a prevalent complication of diabetes associated
+with a significant risk of vision loss. Timely identification is critical to
+curb vision impairment. Algorithms for DR staging from digital fundus images
+(DFIs) have been recently proposed. However, models often fail to generalize
+due to distribution shifts between the source domain on which the model was
+trained and the target domain where it is deployed. A common and particularly
+challenging shift is often encountered when the source- and target-domain
+supports do not fully overlap. In this research, we introduce DRStageNet, a
+deep learning model designed to mitigate this challenge. We used seven publicly
+available datasets, comprising a total of 93,534 DFIs that cover a variety of
+patient demographics, ethnicities, geographic origins and comorbidities. We
+fine-tune DINOv2, a pretrained model of self-supervised vision transformer, and
+implement a multi-source domain fine-tuning strategy to enhance generalization
+performance. We benchmark and demonstrate the superiority of our method to two
+state-of-the-art benchmarks, including a recently published foundation model.
+We adapted the grad-rollout method to our regression task in order to provide
+high-resolution explainability heatmaps. The error analysis showed that 59\% of
+the main errors had incorrect reference labels. DRStageNet is accessible at URL
+[upon acceptance of the manuscript].
+
+
+
+
+
+
+
+ ☆ NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
+ Models via Complexity Classes
+
+
+ Complex reasoning ability is one of the most important features of current
+LLMs, which has also been leveraged to play an integral role in complex
+decision-making tasks. Therefore, the investigation into the reasoning
+capabilities of Large Language Models (LLMs) is critical: numerous benchmarks
+have been established to assess the reasoning abilities of LLMs. However,
+current benchmarks are inadequate in offering a rigorous evaluation of the full
+extent of reasoning abilities that LLMs are capable of achieving. They are also
+prone to the risk of overfitting, as these benchmarks, being publicly
+accessible and static, allow models to potentially tailor their responses to
+specific benchmark metrics, thereby inflating their performance. Addressing
+these limitations, our research introduces a new benchmark, named NPHardEval.
+This benchmark is designed to evaluate the reasoning abilities of LLMs across a
+broad spectrum of 900 algorithmic questions, extending up to the NP-Hard
+complexity class. These questions are meticulously chosen to represent a wide
+range of complexity class below the NP-hard complexity class, offering a
+rigorous measure of the reasoning ability of LLMs. Through this study, we shed
+light on the current state of reasoning in LLMs, providing an objective and
+rigorous perspective through the comparison of LLMs' performance across complex
+classes. Moreover, this benchmark is designed with a dynamic update mechanism,
+where the datapoints are refreshed on a monthly basis. Such regular updates
+play a crucial role in mitigating the risk of LLMs overfitting to the
+benchmark, promoting a more accurate and reliable assessment of their reasoning
+capabilities. The benchmark dataset and code of NPHardEval are available at
+https://github.com/casmlab/NPHardEval.
+
+
+
+ comment: 22 pages, 6 figures, 2 tables
+
+
+
+
+
+
+ ☆ On rate-optimal classification from non-private and from private data
+
+
+
+
+
+
+
+
+ Balázs Csanád Csáji, László Györfi, Ambrus Tamás
+
+
+ In this paper we revisit the classical problem of classification, but impose
+privacy constraints. Under such constraints, the raw data
+$(X_1,Y_1),\ldots,(X_n,Y_n)$ cannot be directly observed, and all classifiers
+are functions of the randomised outcome of a suitable local differential
+privacy mechanism. The statistician is free to choose the form of this privacy
+mechanism, and here we add Laplace distributed noise to a discretisation of the
+location of each feature vector $X_i$ and to its label $Y_i$. The
+classification rule is the privatized version of the well-studied partitioning
+classification rule. In addition to the standard Lipschitz and margin
+conditions, a novel characteristic is introduced, by which the exact rate of
+convergence of the classification error probability is calculated, both for
+non-private and private data.
+
+
+
+
+
+
+
+ ☆ Sample Path Regularity of Gaussian Processes from the Covariance Kernel
+
+
+
+
+
+
+
+
+ Nathaël Da Costa, Marvin Pförtner, Lancelot Da Costa, Philipp Hennig
+
+
+ Gaussian processes (GPs) are the most common formalism for defining
+probability distributions over spaces of functions. While applications of GPs
+are myriad, a comprehensive understanding of GP sample paths, i.e. the function
+spaces over which they define a probability measure on, is lacking. In
+practice, GPs are not constructed through a probability measure, but instead
+through a mean function and a covariance kernel. In this paper we provide
+necessary and sufficient conditions on the covariance kernel for the sample
+paths of the corresponding GP to attain a given regularity. We use the
+framework of H\"older regularity as it grants us particularly straightforward
+conditions, which simplify further in the cases of stationary and isotropic
+GPs. We then demonstrate that our results allow for novel and unusually tight
+characterisations of the sample path regularities of the GPs commonly used in
+machine learning applications, such as the Mat\'ern GPs.
+
+
+ We propose SutraNets, a novel method for neural probabilistic forecasting of
+long-sequence time series. SutraNets use an autoregressive generative model to
+factorize the likelihood of long sequences into products of conditional
+probabilities. When generating long sequences, most autoregressive approaches
+suffer from harmful error accumulation, as well as challenges in modeling
+long-distance dependencies. SutraNets treat long, univariate prediction as
+multivariate prediction over lower-frequency sub-series. Autoregression
+proceeds across time and across sub-series in order to ensure coherent
+multivariate (and, hence, high-frequency univariate) outputs. Since sub-series
+can be generated using fewer steps, SutraNets effectively reduce error
+accumulation and signal path distances. We find SutraNets to significantly
+improve forecasting accuracy over competitive alternatives on six real-world
+datasets, including when we vary the number of sub-series and scale up the
+depth and width of the underlying sequence models.
+
+
+
+
+
+
+
+ ☆ Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning
+
+
+
+
+
+
+
+
+ Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, Zheng Xiong, Yicheng Luo, Jianye Hao, Kun Shao, Haitham Bou-Ammar, Jun Wang
+
+
+ A key method for creating Artificial Intelligence (AI) agents is
+Reinforcement Learning (RL). However, constructing a standalone RL policy that
+maps perception to action directly encounters severe problems, chief among them
+being its lack of generality across multiple tasks and the need for a large
+amount of training data. The leading cause is that it cannot effectively
+integrate prior information into the perception-action cycle when devising the
+policy. Large language models (LLMs) emerged as a fundamental way to
+incorporate cross-domain knowledge into AI agents but lack crucial learning and
+adaptation toward specific decision problems. This paper presents a general
+framework model for integrating and learning structured reasoning into AI
+agents' policies. Our methodology is motivated by the modularity found in the
+human brain. The framework utilises the construction of intrinsic and extrinsic
+functions to add previous understandings of reasoning structures. It also
+provides the adaptive ability to learn models inside every module or function,
+consistent with the modular structure of cognitive processes. We describe the
+framework in-depth and compare it with other AI pipelines and existing
+frameworks. The paper explores practical applications, covering experiments
+that show the effectiveness of our method. Our results indicate that AI agents
+perform and adapt far better when organised reasoning and prior knowledge are
+embedded. This opens the door to more resilient and general AI agent systems.
+
+
+
+ comment: paper and appendix, 27 pages
+
+
+
+
+
+
+ ☆ Spatiotemporal-Linear: Towards Universal Multivariate Time Series
+ Forecasting
+
+
+ Within the field of complicated multivariate time series forecasting (TSF),
+popular techniques frequently rely on intricate deep learning architectures,
+ranging from transformer-based designs to recurrent neural networks. However,
+recent findings suggest that simple Linear models can surpass sophisticated
+constructs on diverse datasets. These models directly map observation to
+multiple future time steps, thereby minimizing error accumulation in iterative
+multi-step prediction. Yet, these models fail to incorporate spatial and
+temporal information within the data, which is critical for capturing patterns
+and dependencies that drive insightful predictions. This oversight often leads
+to performance bottlenecks, especially under specific sequence lengths and
+dataset conditions, preventing their universal application. In response, we
+introduce the SpatioTemporal-Linear (STL) framework. STL seamlessly integrates
+time-embedded and spatially-informed bypasses to augment the Linear-based
+architecture. These extra routes offer a more robust and refined regression to
+the data, particularly when the amount of observation is limited and the
+capacity of simple linear layers to capture dependencies declines. Empirical
+evidence highlights STL's prowess, outpacing both Linear and Transformer
+benchmarks across varied observation and prediction durations and datasets.
+Such robustness accentuates its suitability across a spectrum of applications,
+including but not limited to, traffic trajectory and rare disease progression
+forecasting. Through this discourse, we not only validate the STL's distinctive
+capacities to become a more general paradigm in multivariate time-series
+prediction using deep-learning techniques but also stress the need to tackle
+data-scarce prediction scenarios for universal application. Code will be made
+available.
+
+
+
+
+
+
+
+ ☆ Large Scale Traning of Graph Neural Networks for Optimal Markov-Chain
+ Partitioning Using the Kemeny Constant
+
+
+
+
+
+
+
+
+ Sam Alexander Martino, João Morado, Chenghao Li, Zhenghao Lu, Edina Rosta
+
+
+ Traditional clustering algorithms often struggle to capture the complex
+relationships within graphs and generalise to arbitrary clustering criteria.
+The emergence of graph neural networks (GNNs) as a powerful framework for
+learning representations of graph data provides new approaches to solving the
+problem. Previous work has shown GNNs to be capable of proposing partitionings
+using a variety of criteria, however, these approaches have not yet been
+extended to work on Markov chains or kinetic networks. These arise frequently
+in the study of molecular systems and are of particular interest to the
+biochemical modelling community. In this work, we propose several GNN-based
+architectures to tackle the graph partitioning problem for Markov Chains
+described as kinetic networks. This approach aims to minimize how much a
+proposed partitioning changes the Kemeny constant. We propose using an
+encoder-decoder architecture and show how simple GraphSAGE-based GNNs with
+linear layers can outperform much larger and more expressive attention-based
+models in this context. As a proof of concept, we first demonstrate the
+method's ability to cluster randomly connected graphs. We also use a linear
+chain architecture corresponding to a 1D free energy profile as our kinetic
+network. Subsequently, we demonstrate the effectiveness of our method through
+experiments on a data set derived from molecular dynamics. We compare the
+performance of our method to other partitioning techniques such as PCCA+. We
+explore the importance of feature and hyperparameter selection and propose a
+general strategy for large-scale parallel training of GNNs for discovering
+optimal graph partitionings.
+
+
+
+
+
+
+
+ ☆ Learning Lagrangian Multipliers for the Travelling Salesman Problem
+
+
+ Lagrangian relaxation is a versatile mathematical technique employed to relax
+constraints in an optimization problem, enabling the generation of dual bounds
+to prove the optimality of feasible solutions and the design of efficient
+propagators in constraint programming (such as the weighted circuit
+constraint). However, the conventional process of deriving Lagrangian
+multipliers (e.g., using subgradient methods) is often computationally
+intensive, limiting its practicality for large-scale or time-sensitive
+problems. To address this challenge, we propose an innovative unsupervised
+learning approach that harnesses the capabilities of graph neural networks to
+exploit the problem structure, aiming to generate accurate Lagrangian
+multipliers efficiently. We apply this technique to the well-known Held-Karp
+Lagrangian relaxation for the travelling salesman problem. The core idea is to
+predict accurate Lagrangian multipliers and to employ them as a warm start for
+generating Held-Karp relaxation bounds. These bounds are subsequently utilized
+to enhance the filtering process carried out by branch-and-bound algorithms. In
+contrast to much of the existing literature, which primarily focuses on finding
+feasible solutions, our approach operates on the dual side, demonstrating that
+learning can also accelerate the proof of optimality. We conduct experiments
+across various distributions of the metric travelling salesman problem,
+considering instances with up to 200 cities. The results illustrate that our
+approach can improve the filtering level of the weighted circuit global
+constraint, reduce the optimality gap by a factor two for unsolved instances up
+to a timeout, and reduce the execution time for solved instances by 10%.
+
+
+
+
+
+
+
+ ☆ Understanding the Regularity of Self-Attention with Optimal Transport
+
+
+
+
+
+
+
+
+ Valérie Castin, Pierre Ablin, Gabriel Peyré
+
+
+ Transformers and their multi-head attention mechanism have completely changed
+the machine learning landscape in just a few years, by outperforming
+state-of-art models in a wide range of domains. Still, little is known about
+their robustness from a theoretical perspective. We tackle this problem by
+studying the local Lipschitz constant of self-attention, that provides an
+attack-agnostic way of measuring the robustness of a neural network. We adopt a
+measure-theoretic framework, by viewing inputs as probability measures equipped
+with the Wasserstein distance. This allows us to generalize attention to inputs
+of infinite length, and to derive an upper bound and a lower bound on the
+Lipschitz constant of self-attention on compact sets. The lower bound
+significantly improves prior results, and grows more than exponentially with
+the radius of the compact set, which rules out the possibility of obtaining
+robustness guarantees without any additional constraint on the input space. Our
+results also point out that measures with a high local Lipschitz constant are
+typically made of a few diracs, with a very unbalanced distribution of mass.
+Finally, we analyze the stability of self-attention under perturbations that
+change the number of tokens, which appears to be a natural question in the
+measure-theoretic framework. In particular, we show that for some inputs,
+attacks that duplicate tokens before perturbing them are more efficient than
+attacks that simply move tokens. We call this phenomenon mass splitting.
+
+
+
+
+
+
+
+ ☆ PARDINUS: Weakly supervised discarding of photo-trapping empty images
+ based on autoencoders
+
+
+
+
+
+
+
+
+ David de la Rosa, Antonio J Rivera, María J del Jesus, Francisco Charte
+
+
+ Photo-trapping cameras are widely employed for wildlife monitoring. Those
+cameras take photographs when motion is detected to capture images where
+animals appear. A significant portion of these images are empty - no wildlife
+appears in the image. Filtering out those images is not a trivial task since it
+requires hours of manual work from biologists. Therefore, there is a notable
+interest in automating this task. Automatic discarding of empty photo-trapping
+images is still an open field in the area of Machine Learning. Existing
+solutions often rely on state-of-the-art supervised convolutional neural
+networks that require the annotation of the images in the training phase.
+PARDINUS (Weakly suPervised discARDINg of photo-trapping empty images based on
+aUtoencoderS) is constructed on the foundation of weakly supervised learning
+and proves that this approach equals or even surpasses other fully supervised
+methods that require further labeling work.
+
+
+
+
+
+
+
+ ☆ The Effects of Signal-to-Noise Ratio on Generative Adversarial Networks
+ Applied to Marine Bioacoustic Data
+
+
+
+
+
+
+
+
+ Georgia Atkinson, Nick Wright, A. Stephen McGough, Per Berggren
+
+
+ In recent years generative adversarial networks (GANs) have been used to
+supplement datasets within the field of marine bioacoustics. This is driven by
+factors such as the cost to collect data, data sparsity and aid preprocessing.
+One notable challenge with marine bioacoustic data is the low signal-to-noise
+ratio (SNR) posing difficulty when applying deep learning techniques such as
+GANs. This work investigates the effect SNR has on the audio-based GAN
+performance and examines three different evaluation methodologies for GAN
+performance, yielding interesting results on the effects of SNR on GANs,
+specifically WaveGAN.
+
+
+
+ comment: 6 pages, 6 figures
+
+
+
+
+
+
+ ☆ On support vector machines under a multiple-cost scenario
+
+
+ Support Vector Machine (SVM) is a powerful tool in binary classification,
+known to attain excellent misclassification rates. On the other hand, many
+realworld classification problems, such as those found in medical diagnosis,
+churn or fraud prediction, involve misclassification costs which may be
+different in the different classes. However, it may be hard for the user to
+provide precise values for such misclassification costs, whereas it may be much
+easier to identify acceptable misclassification rates values. In this paper we
+propose a novel SVM model in which misclassification costs are considered by
+incorporating performance constraints in the problem formulation. Specifically,
+our aim is to seek the hyperplane with maximal margin yielding
+misclassification rates below given threshold values. Such maximal margin
+hyperplane is obtained by solving a quadratic convex problem with linear
+constraints and integer variables. The reported numerical experience shows that
+our model gives the user control on the misclassification rates in one class
+(possibly at the expense of an increase in misclassification rates for the
+other class) and is feasible in terms of running times.
+
+
+
+
+
+
+
+ ☆ The Rate-Distortion-Perception-Classification Tradeoff: Joint Source
+ Coding and Modulation via Inverse-Domain GANs
+
+
+
+
+
+
+
+
+ Junli Fang, João F. C. Mota, Baoshan Lu, Weicheng Zhang, Xuemin Hong
+
+
+ The joint source coding and modulation (JSCM) framework was enabled by recent
+developments in deep learning, which allows to automatically learn from data,
+and in an end-to-end fashion, the best compression codes and modulation
+schemes. In this paper, we show the existence of a strict tradeoff between
+channel rate, distortion, perception, and classification accuracy in a JSCM
+scenario. We then propose two image compression methods to navigate that
+tradeoff: an inverse-domain generative adversarial network (ID-GAN), which
+achieves extreme compression, and a simpler, heuristic method that reveals
+insights about the performance of ID-GAN. Experiment results not only
+corroborate the theoretical findings, but also demonstrate that the proposed
+ID-GAN algorithm significantly improves system performance compared to
+traditional separation-based methods and recent deep JSCM architectures.
+
+
+
+
+
+
+
+ ☆ Integration Of Evolutionary Automated Machine Learning With Structural
+ Sensitivity Analysis For Composite Pipelines
+
+
+
+
+
+
+
+
+ Nikolay O. Nikitin, Maiia Pinchuk, Valerii Pokrovskii, Peter Shevchenko, Andrey Getmanov, Yaroslav Aksenkin, Ilia Revin, Andrey Stebenkov, Ekaterina Poslavskaya, Anna V. Kalyuzhnaya
+
+
+ Automated machine learning (AutoML) systems propose an end-to-end solution to
+a given machine learning problem, creating either fixed or flexible pipelines.
+Fixed pipelines are task independent constructs: their general composition
+remains the same, regardless of the data. In contrast, the structure of
+flexible pipelines varies depending on the input, making them finely tailored
+to individual tasks. However, flexible pipelines can be structurally
+overcomplicated and have poor explainability. We propose the EVOSA approach
+that compensates for the negative points of flexible pipelines by incorporating
+a sensitivity analysis which increases the robustness and interpretability of
+the flexible solutions. EVOSA quantitatively estimates positive and negative
+impact of an edge or a node on a pipeline graph, and feeds this information to
+the evolutionary AutoML optimizer. The correctness and efficiency of EVOSA was
+validated in tabular, multimodal and computer vision tasks, suggesting
+generalizability of the proposed approach across domains.
+
+
+
+
+
+
+
+ ☆ Large Language Model (LLM) Bias Index -- LLMBI
+
+
+ The Large Language Model Bias Index (LLMBI) is a pioneering approach designed
+to quantify and address biases inherent in large language models (LLMs), such
+as GPT-4. We recognise the increasing prevalence and impact of LLMs across
+diverse sectors. This research introduces a novel metric, LLMBI, to
+systematically measure and mitigate biases potentially skewing model responses.
+We formulated LLMBI using a composite scoring system incorporating multiple
+dimensions of bias, including but not limited to age, gender, and racial
+biases.
+ To operationalise this metric, we engaged in a multi-step process involving
+collecting and annotating LLM responses, applying sophisticated Natural
+Language Processing (NLP) techniques for bias detection, and computing the
+LLMBI score through a specially crafted mathematical formula. The formula
+integrates weighted averages of various bias dimensions, a penalty for dataset
+diversity deficiencies, and a correction for sentiment biases. Our empirical
+analysis, conducted using responses from OpenAI's API, employs advanced
+sentiment analysis as a representative method for bias detection.
+ The research reveals LLMs, whilst demonstrating impressive capabilities in
+text generation, exhibit varying degrees of bias across different dimensions.
+LLMBI provides a quantifiable measure to compare biases across models and over
+time, offering a vital tool for systems engineers, researchers and regulators
+in enhancing the fairness and reliability of LLMs. It highlights the potential
+of LLMs in mimicking unbiased human-like responses. Additionally, it
+underscores the necessity of continuously monitoring and recalibrating such
+models to align with evolving societal norms and ethical standards.
+
+
+
+
+
+
+
+
+ Long Shi, Lei Cao, Jun Wang, Badong Chen
+
+
+ Latent multi-view subspace clustering has been demonstrated to have desirable
+clustering performance. However, the original latent representation method
+vertically concatenates the data matrices from multiple views into a single
+matrix along the direction of dimensionality to recover the latent
+representation matrix, which may result in an incomplete information recovery.
+To fully recover the latent space representation, we in this paper propose an
+Enhanced Latent Multi-view Subspace Clustering (ELMSC) method. The ELMSC method
+involves constructing an augmented data matrix that enhances the representation
+of multi-view data. Specifically, we stack the data matrices from various views
+into the block-diagonal locations of the augmented matrix to exploit the
+complementary information. Meanwhile, the non-block-diagonal entries are
+composed based on the similarity between different views to capture the
+consistent information. In addition, we enforce a sparse regularization for the
+non-diagonal blocks of the augmented self-representation matrix to avoid
+redundant calculations of consistency information. Finally, a novel iterative
+algorithm based on the framework of Alternating Direction Method of Multipliers
+(ADMM) is developed to solve the optimization problem for ELMSC. Extensive
+experiments on real-world datasets demonstrate that our proposed ELMSC is able
+to achieve higher clustering performance than some state-of-art multi-view
+clustering methods.
+
+
+
+
+
+
+
+ ☆ Diffusion Maps for Signal Filtering in Graph Learning
+
+
+ This paper explores the application diffusion maps as graph shift operators
+in understanding the underlying geometry of graph signals. The study evaluates
+the improvements in graph learning when using diffusion map generated filters
+to the Markov Variation minimization problem. The paper showcases the
+effectiveness of this approach through examples involving synthetically
+generated and real-world temperature sensor data. These examples also compare
+the diffusion map graph signal model with other commonly used graph signal
+operators. The results provide new approaches for the analysis and
+understanding of complex, non-Euclidean data structures.
+
+
+
+
+
+
+
+ ☆ Hazards from Increasingly Accessible Fine-Tuning of Downloadable
+ Foundation Models NeurIPS 2023
+
+
+
+
+
+
+
+
+ Alan Chan, Ben Bucknall, Herbie Bradley, David Krueger
+
+
+ Public release of the weights of pretrained foundation models, otherwise
+known as downloadable access \citep{solaiman_gradient_2023}, enables
+fine-tuning without the prohibitive expense of pretraining. Our work argues
+that increasingly accessible fine-tuning of downloadable models may increase
+hazards. First, we highlight research to improve the accessibility of
+fine-tuning. We split our discussion into research that A) reduces the
+computational cost of fine-tuning and B) improves the ability to share that
+cost across more actors. Second, we argue that increasingly accessible
+fine-tuning methods may increase hazard through facilitating malicious use and
+making oversight of models with potentially dangerous capabilities more
+difficult. Third, we discuss potential mitigatory measures, as well as benefits
+of more accessible fine-tuning. Given substantial remaining uncertainty about
+hazards, we conclude by emphasizing the urgent need for the development of
+mitigations.
+
+
+
+ comment: Accepted as a spotlight workshop paper at the Socially Responsible
+ Language Modelling Research (SoLaR) workshop, held at NeurIPS 2023
+
+
+
+
+
+
+ ☆ Progressing from Anomaly Detection to Automated Log Labeling and
+ Pioneering Root Cause Analysis ICDM 2023
+
+
+
+
+
+
+
+
+ Thorsten Wittkopp, Alexander Acker, Odej Kao
+
+
+ The realm of AIOps is transforming IT landscapes with the power of AI and ML.
+Despite the challenge of limited labeled data, supervised models show promise,
+emphasizing the importance of leveraging labels for training, especially in
+deep learning contexts. This study enhances the field by introducing a taxonomy
+for log anomalies and exploring automated data labeling to mitigate labeling
+challenges. It goes further by investigating the potential of diverse anomaly
+detection techniques and their alignment with specific anomaly types. However,
+the exploration doesn't stop at anomaly detection. The study envisions a future
+where root cause analysis follows anomaly detection, unraveling the underlying
+triggers of anomalies. This uncharted territory holds immense potential for
+revolutionizing IT systems management. In essence, this paper enriches our
+understanding of anomaly detection, and automated labeling, and sets the stage
+for transformative root cause analysis. Together, these advances promise more
+resilient IT systems, elevating operational efficiency and user satisfaction in
+an ever-evolving technological landscape.
+
+
+
+ comment: accepted at AIOPS workshop @ICDM 2023
+
+
+
+
+
+
+ ☆ Can Machines Learn Robustly, Privately, and Efficiently?
+
+
+
+
+
+
+
+
+ Youssef Allouah, Rachid Guerraoui, John Stephan
+
+
+ The success of machine learning (ML) applications relies on vast datasets and
+distributed architectures, which, as they grow, present challenges for ML. In
+real-world scenarios, where data often contains sensitive information, issues
+like data poisoning and hardware failures are common. Ensuring privacy and
+robustness is vital for the broad adoption of ML in public life. This paper
+examines the costs associated with achieving these objectives in distributed
+architectures. We overview the meanings of privacy and robustness in
+distributed ML, and clarify how they can be achieved efficiently in isolation.
+However, we contend that the integration of these objectives entails a notable
+compromise in computational efficiency. We delve into this intricate balance,
+exploring the challenges and solutions for privacy, robustness, and
+computational efficiency in ML applications.
+
+
+ Pulmonary embolism (PE) is a prevalent lung disease that can lead to right
+ventricular hypertrophy and failure in severe cases, ranking second in severity
+only to myocardial infarction and sudden death. Pulmonary artery CT angiography
+(CTPA) is a widely used diagnostic method for PE. However, PE detection
+presents challenges in clinical practice due to limitations in imaging
+technology. CTPA can produce noises similar to PE, making confirmation of its
+presence time-consuming and prone to overdiagnosis. Nevertheless, the
+traditional segmentation method of PE can not fully consider the hierarchical
+structure of features, local and global spatial features of PE CT images. In
+this paper, we propose an automatic PE segmentation method called SCUNet++
+(Swin Conv UNet++). This method incorporates multiple fusion dense skip
+connections between the encoder and decoder, utilizing the Swin Transformer as
+the encoder. And fuses features of different scales in the decoder subnetwork
+to compensate for spatial information loss caused by the inevitable
+downsampling in Swin-UNet or other state-of-the-art methods, effectively
+solving the above problem. We provide a theoretical analysis of this method in
+detail and validate it on publicly available PE CT image datasets FUMPE and
+CAD-PE. The experimental results indicate that our proposed method achieved a
+Dice similarity coefficient (DSC) of 83.47% and a Hausdorff distance 95th
+percentile (HD95) of 3.83 on the FUMPE dataset, as well as a DSC of 83.42% and
+an HD95 of 5.10 on the CAD-PE dataset. These findings demonstrate that our
+method exhibits strong performance in PE segmentation tasks, potentially
+enhancing the accuracy of automatic segmentation of PE and providing a powerful
+diagnostic tool for clinical physicians. Our source code and new FUMPE dataset
+are available at https://github.com/JustlfC03/SCUNet-plusplus.
+
+
+
+ comment: 10 pages, 7 figures, accept wacv2024
+
+
+
+
+
+
+ ☆ Time-changed normalizing flows for accurate SDE modeling
+
+
+
+
+
+
+
+
+ Naoufal El Bekri, Lucas Drumetz, Franck Vermet
+
+
+ The generative paradigm has become increasingly important in machine learning
+and deep learning models. Among popular generative models are normalizing
+flows, which enable exact likelihood estimation by transforming a base
+distribution through diffeomorphic transformations. Extending the normalizing
+flow framework to handle time-indexed flows gave dynamic normalizing flows, a
+powerful tool to model time series, stochastic processes, and neural stochastic
+differential equations (SDEs). In this work, we propose a novel variant of
+dynamic normalizing flows, a Time Changed Normalizing Flow (TCNF), based on
+time deformation of a Brownian motion which constitutes a versatile and
+extensive family of Gaussian processes. This approach enables us to effectively
+model some SDEs, that cannot be modeled otherwise, including standard ones such
+as the well-known Ornstein-Uhlenbeck process, and generalizes prior
+methodologies, leading to improved results and better inference and prediction
+capability.
+
+
+
+
+
+
+
+ ☆ A Mathematical Guide to Operator Learning
+
+
+ Operator learning aims to discover properties of an underlying dynamical
+system or partial differential equation (PDE) from data. Here, we present a
+step-by-step guide to operator learning. We explain the types of problems and
+PDEs amenable to operator learning, discuss various neural network
+architectures, and explain how to employ numerical PDE solvers effectively. We
+also give advice on how to create and manage training data and conduct
+optimization. We offer intuition behind the various neural network
+architectures employed in operator learning by motivating them from the
+point-of-view of numerical linear algebra.
+
+
+
+
+
+
+
+
+ Raffaele Marino, Lorenzo Buffoni, Lorenzo Chicchi, Lorenzo Giambagli, Duccio Fanelli
+
+
+ EODECA (Engineered Ordinary Differential Equations as Classification
+Algorithm) is a novel approach at the intersection of machine learning and
+dynamical systems theory, presenting a unique framework for classification
+tasks [1]. This method stands out with its dynamical system structure,
+utilizing ordinary differential equations (ODEs) to efficiently handle complex
+classification challenges. The paper delves into EODECA's dynamical properties,
+emphasizing its resilience against random perturbations and robust performance
+across various classification scenarios. Notably, EODECA's design incorporates
+the ability to embed stable attractors in the phase space, enhancing
+reliability and allowing for reversible dynamics. In this paper, we carry out a
+comprehensive analysis by expanding on the work [1], and employing a Euler
+discretization scheme. In particular, we evaluate EODECA's performance across
+five distinct classification problems, examining its adaptability and
+efficiency. Significantly, we demonstrate EODECA's effectiveness on the MNIST
+and Fashion MNIST datasets, achieving impressive accuracies of $98.06\%$ and
+$88.21\%$, respectively. These results are comparable to those of a multi-layer
+perceptron (MLP), underscoring EODECA's potential in complex data processing
+tasks. We further explore the model's learning journey, assessing its evolution
+in both pre and post training environments and highlighting its ability to
+navigate towards stable attractors. The study also investigates the
+invertibility of EODECA, shedding light on its decision-making processes and
+internal workings. This paper presents a significant step towards a more
+transparent and robust machine learning paradigm, bridging the gap between
+machine learning algorithms and dynamical systems methodologies.
+
+
+ Multimodal intent recognition aims to leverage diverse modalities such as
+expressions, body movements and tone of speech to comprehend user's intent,
+constituting a critical task for understanding human language and behavior in
+real-world multimodal scenarios. Nevertheless, the majority of existing methods
+ignore potential correlations among different modalities and own limitations in
+effectively learning semantic features from nonverbal modalities. In this
+paper, we introduce a token-level contrastive learning method with
+modality-aware prompting (TCL-MAP) to address the above challenges. To
+establish an optimal multimodal semantic environment for text modality, we
+develop a modality-aware prompting module (MAP), which effectively aligns and
+fuses features from text, video and audio modalities with similarity-based
+modality alignment and cross-modality attention mechanism. Based on the
+modality-aware prompt and ground truth labels, the proposed token-level
+contrastive learning framework (TCL) constructs augmented samples and employs
+NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal
+textual semantic insights derived from intent labels to guide the learning
+processes of other modalities in return. Extensive experiments show that our
+method achieves remarkable improvements compared to state-of-the-art methods.
+Additionally, ablation analyses demonstrate the superiority of the
+modality-aware prompt over the handcrafted prompt, which holds substantial
+significance for multimodal prompt learning. The codes are released at
+https://github.com/thuiar/TCL-MAP.
+
+
+
+ comment: Accepted by AAAI 2024 (Main Track, Long Paper)
+
+
+
+
+
+
+ ☆ Deep Non-Parametric Time Series Forecaster
+
+
+
+
+
+
+
+
+ Syama Sundar Rangapuram, Jan Gasthaus, Lorenzo Stella, Valentin Flunkert, David Salinas, Yuyang Wang, Tim Januschowski
+
+
+ This paper presents non-parametric baseline models for time series
+forecasting. Unlike classical forecasting models, the proposed approach does
+not assume any parametric form for the predictive distribution and instead
+generates predictions by sampling from the empirical distribution according to
+a tunable strategy. By virtue of this, the model is always able to produce
+reasonable forecasts (i.e., predictions within the observed data range) without
+fail unlike classical models that suffer from numerical stability on some data
+distributions. Moreover, we develop a global version of the proposed method
+that automatically learns the sampling strategy by exploiting the information
+across multiple related time series. The empirical evaluation shows that the
+proposed methods have reasonable and consistent performance across all
+datasets, proving them to be strong baselines to be considered in one's
+forecasting toolbox.
+
+
+
+
+
+
+
+ ☆ SAVAE: Leveraging the variational Bayes autoencoder for survival
+ analysis
+
+
+
+
+
+
+
+
+ Patricia A. Apellániz, Juan Parras, Santiago Zazo
+
+
+ As in many fields of medical research, survival analysis has witnessed a
+growing interest in the application of deep learning techniques to model
+complex, high-dimensional, heterogeneous, incomplete, and censored medical
+data. Current methods often make assumptions about the relations between data
+that may not be valid in practice. In response, we introduce SAVAE (Survival
+Analysis Variational Autoencoder), a novel approach based on Variational
+Autoencoders. SAVAE contributes significantly to the field by introducing a
+tailored ELBO formulation for survival analysis, supporting various parametric
+distributions for covariates and survival time (as long as the log-likelihood
+is differentiable). It offers a general method that consistently performs well
+on various metrics, demonstrating robustness and stability through different
+experiments. Our proposal effectively estimates time-to-event, accounting for
+censoring, covariate interactions, and time-varying risk associations. We
+validate our model in diverse datasets, including genomic, clinical, and
+demographic data, with varying levels of censoring. This approach demonstrates
+competitive performance compared to state-of-the-art techniques, as assessed by
+the Concordance Index and the Integrated Brier Score. SAVAE also offers an
+interpretable model that parametrically models covariates and time. Moreover,
+its generative architecture facilitates further applications such as
+clustering, data imputation, and the generation of synthetic patient data
+through latent space inference from survival data.
+
+
+ In today's digital world, Generative Artificial Intelligence (GenAI) such as
+Large Language Models (LLMs) is becoming increasingly prevalent, extending its
+reach across diverse applications. This surge in adoption has sparked a
+significant increase in demand for data-centric GenAI models, highlighting the
+necessity for robust data communication infrastructures. Central to this need
+are message brokers, which serve as essential channels for data transfer within
+various system components. This survey aims to delve into a comprehensive
+analysis of traditional and modern message brokers, offering a comparative
+study of prevalent platforms. Our study considers numerous criteria including,
+but not limited to, open-source availability, integrated monitoring tools,
+message prioritization mechanisms, capabilities for parallel processing,
+reliability, distribution and clustering functionalities, authentication
+processes, data persistence strategies, fault tolerance, and scalability.
+Furthermore, we explore the intrinsic constraints that the design and operation
+of each message broker might impose, recognizing that these limitations are
+crucial in understanding their real-world applicability. We then leverage these
+insights to propose a sophisticated message broker framework -- one designed
+with the adaptability and robustness necessary to meet the evolving requisites
+of GenAI applications. Finally, this study examines the enhancement of message
+broker mechanisms specifically for GenAI contexts, emphasizing the criticality
+of developing a versatile message broker framework. Such a framework would be
+poised for quick adaptation, catering to the dynamic and growing demands of
+GenAI in the foreseeable future. Through this dual-pronged approach, we intend
+to contribute a foundational compendium that can guide future innovations and
+infrastructural advancements in the realm of GenAI data communication.
+
+
+ Electronic health records (EHRs) have become the foundation of machine
+learning applications in healthcare, while the utility of real patient records
+is often limited by privacy and security concerns. Synthetic EHR generation
+provides an additional perspective to compensate for this limitation. Most
+existing methods synthesize new records based on real EHR data, without
+consideration of different types of events in EHR data, which cannot control
+the event combinations in line with medical common sense. In this paper, we
+propose MSIC, a Multi-visit health Status Inference model for Collaborative EHR
+synthesis to address these limitations. First, we formulate the synthetic EHR
+generation process as a probabilistic graphical model and tightly connect
+different types of events by modeling the latent health states. Then, we derive
+a health state inference method tailored for the multi-visit scenario to
+effectively utilize previous records to synthesize current and future records.
+Furthermore, we propose to generate medical reports to add textual descriptions
+for each medical event, providing broader applications for synthesized EHR
+data. For generating different paragraphs in each visit, we incorporate a
+multi-generator deliberation framework to collaborate the message passing of
+multiple generators and employ a two-phase decoding strategy to generate
+high-quality reports. Our extensive experiments on the widely used benchmarks,
+MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results
+on the quality of synthetic data while maintaining low privacy risks.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ☆ Balancing Energy Efficiency and Distributional Robustness in
+ Over-the-Air Federated Learning
+
+
+
+
+
+
+
+
+ Mohamed Badi, Chaouki Ben Issaid, Anis Elgabli, Mehdi Bennis
+
+
+ The growing number of wireless edge devices has magnified challenges
+concerning energy, bandwidth, latency, and data heterogeneity. These challenges
+have become bottlenecks for distributed learning. To address these issues, this
+paper presents a novel approach that ensures energy efficiency for
+distributionally robust federated learning (FL) with over air computation
+(AirComp). In this context, to effectively balance robustness with energy
+efficiency, we introduce a novel client selection method that integrates two
+complementary insights: a deterministic one that is designed for energy
+efficiency, and a probabilistic one designed for distributional robustness.
+Simulation results underscore the efficacy of the proposed algorithm, revealing
+its superior performance compared to baselines from both robustness and energy
+efficiency perspectives, achieving more than 3-fold energy savings compared to
+the considered baselines.
+
+
+ We introduce Neural Flow Maps, a novel simulation method bridging the
+emerging paradigm of implicit neural representations with fluid simulation
+based on the theory of flow maps, to achieve state-of-the-art simulation of
+inviscid fluid phenomena. We devise a novel hybrid neural field representation,
+Spatially Sparse Neural Fields (SSNF), which fuses small neural networks with a
+pyramid of overlapping, multi-resolution, and spatially sparse grids, to
+compactly represent long-term spatiotemporal velocity fields at high accuracy.
+With this neural velocity buffer in hand, we compute long-term, bidirectional
+flow maps and their Jacobians in a mechanistically symmetric manner, to
+facilitate drastic accuracy improvement over existing solutions. These
+long-range, bidirectional flow maps enable high advection accuracy with low
+dissipation, which in turn facilitates high-fidelity incompressible flow
+simulations that manifest intricate vortical structures. We demonstrate the
+efficacy of our neural fluid simulation in a variety of challenging simulation
+scenarios, including leapfrogging vortices, colliding vortices, vortex
+reconnections, as well as vortex generation from moving obstacles and density
+differences. Our examples show increased performance over existing methods in
+terms of energy conservation, visual complexity, adherence to experimental
+observations, and preservation of detailed vortical structures.
+
+
+
+
+
+
+
+ ☆ Towards more sustainable enterprise data and application management with
+ cross silo Federated Learning and Analytics
+
+
+ To comply with new legal requirements and policies committed to privacy
+protection, more and more companies start to deploy cross-silo Federated
+Learning at global scale, where several clients/silos collaboratively train a
+global model under the coordination of a central server. Instead of data
+sharing and transmission, clients train models using their private local data
+and exchange model updates. However, there is little understanding of the
+carbon emission impact of cross silo Federated Learning due to the lack of
+related works. In this study, we first analyze the sustainability aspect of
+cross-silo Federated Learning, across the AI product life cycle instead of
+focusing only on the model training, with the comparison to the centralized
+method. A more holistic quantitative cost and CO2 emission estimation method
+for real world cross-silo Federated Learning setting is proposed. Secondly, we
+propose a novel data and application management system using cross silo
+Federated Learning and analytics to make IT companies more sustainable and cost
+effective.
+
+
+ The increasing reliance of drivers on navigation applications has made
+transportation networks more susceptible to data-manipulation attacks by
+malicious actors. Adversaries may exploit vulnerabilities in the data
+collection or processing of navigation services to inject false information,
+and to thus interfere with the drivers' route selection. Such attacks can
+significantly increase traffic congestions, resulting in substantial waste of
+time and resources, and may even disrupt essential services that rely on road
+networks. To assess the threat posed by such attacks, we introduce a
+computational framework to find worst-case data-injection attacks against
+transportation networks. First, we devise an adversarial model with a threat
+actor who can manipulate drivers by increasing the travel times that they
+perceive on certain roads. Then, we employ hierarchical multi-agent
+reinforcement learning to find an approximate optimal adversarial strategy for
+data manipulation. We demonstrate the applicability of our approach through
+simulating attacks on the Sioux Falls, ND network topology.
+
+
+
+
+
+
+
+ ☆ Explainable Multi-Camera 3D Object Detection with Transformer-Based
+ Saliency Maps
+
+
+ Vision Transformers (ViTs) have achieved state-of-the-art results on various
+computer vision tasks, including 3D object detection. However, their end-to-end
+implementation also makes ViTs less explainable, which can be a challenge for
+deploying them in safety-critical applications, such as autonomous driving,
+where it is important for authorities, developers, and users to understand the
+model's reasoning behind its predictions. In this paper, we propose a novel
+method for generating saliency maps for a DetR-like ViT with multiple camera
+inputs used for 3D object detection. Our method is based on the raw attention
+and is more efficient than gradient-based methods. We evaluate the proposed
+method on the nuScenes dataset using extensive perturbation tests and show that
+it outperforms other explainability methods in terms of visual quality and
+quantitative metrics. We also demonstrate the importance of aggregating
+attention across different layers of the transformer. Our work contributes to
+the development of explainable AI for ViTs, which can help increase trust in AI
+applications by establishing more transparency regarding the inner workings of
+AI models.
+
+
+
+
+
+
+
+ ☆ SIG: Speaker Identification in Literature via Prompt-Based Generation AAAI 2024
+
+
+
+
+
+
+
+
+ Zhenlin Su, Liyan Xu, Jin Xu, Jiangnan Li, Mingdu Huangfu
+
+
+ Identifying speakers of quotations in narratives is an important task in
+literary analysis, with challenging scenarios including the out-of-domain
+inference for unseen speakers, and non-explicit cases where there are no
+speaker mentions in surrounding context. In this work, we propose a simple and
+effective approach SIG, a generation-based method that verbalizes the task and
+quotation input based on designed prompt templates, which also enables easy
+integration of other auxiliary tasks that further bolster the speaker
+identification performance. The prediction can either come from direct
+generation by the model, or be determined by the highest generation probability
+of each speaker candidate. Based on our approach design, SIG supports
+out-of-domain evaluation, and achieves open-world classification paradigm that
+is able to accept any forms of candidate input. We perform both cross-domain
+evaluation and in-domain evaluation on PDNC, the largest dataset of this task,
+where empirical results suggest that SIG outperforms previous baselines of
+complicated designs, as well as the zero-shot ChatGPT, especially excelling at
+those hard non-explicit scenarios by up to 17% improvement. Additional
+experiments on another dataset WP further corroborate the efficacy of SIG.
+
+
+ The scope of this paper is generative modeling through diffusion processes.
+An approach falling within this paradigm is the work of Song et al. (2021),
+which relies on a time-reversal argument to construct a diffusion process
+targeting the desired data distribution. We show that the time-reversal
+argument, common to all denoising diffusion probabilistic modeling proposals,
+is not necessary. We obtain diffusion processes targeting the desired data
+distribution by taking appropriate mixtures of diffusion bridges. The resulting
+transport is exact by construction, allows for greater flexibility in choosing
+the dynamics of the underlying diffusion, and can be approximated by means of a
+neural network via novel training objectives. We develop a unifying view of the
+drift adjustments corresponding to our and to time-reversal approaches and make
+use of this representation to inspect the inner workings of diffusion-based
+generative models. Finally, we leverage on scalable simulation and inference
+techniques common in spatial statistics to move beyond fully factorial
+distributions in the underlying diffusion dynamics. The methodological advances
+contained in this work contribute toward establishing a general framework for
+generative modeling based on diffusion processes.
+
+
+
+ comment: original date: 18 Nov 2021; archival of ICLR submission
+ (https://openreview.net/forum?id=oVfIKuhqfC); no differences
+
+
+
+
+
+
+ ☆ MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning
+
+
+ Prompt learning has demonstrated impressive efficacy in the fine-tuning of
+multimodal large models to a wide range of downstream tasks. Nonetheless,
+applying existing prompt learning methods for the diagnosis of neurological
+disorder still suffers from two issues: (i) existing methods typically treat
+all patches equally, despite the fact that only a small number of patches in
+neuroimaging are relevant to the disease, and (ii) they ignore the structural
+information inherent in the brain connection network which is crucial for
+understanding and diagnosing neurological disorders. To tackle these issues, we
+introduce a novel prompt learning model by learning graph prompts during the
+fine-tuning process of multimodal large models for diagnosing neurological
+disorders. Specifically, we first leverage GPT-4 to obtain relevant disease
+concepts and compute semantic similarity between these concepts and all
+patches. Secondly, we reduce the weight of irrelevant patches according to the
+semantic similarity between each patch and disease-related concepts. Moreover,
+we construct a graph among tokens based on these concepts and employ a graph
+convolutional network layer to extract the structural information of the graph,
+which is used to prompt the pre-trained multimodal large models for diagnosing
+neurological disorders. Extensive experiments demonstrate that our method
+achieves superior performance for neurological disorder diagnosis compared with
+state-of-the-art methods and validated by clinicians.
+
+
+
+
+
+
+
+ ☆ Data is Moody: Discovering Data Modification Rules from Process Event
+ Logs
+
+
+
+
+
+
+
+
+ Marco Bjarne Schuster, Boris Wiegand, Jilles Vreeken
+
+
+ Although event logs are a powerful source to gain insight about the behavior
+of the underlying business process, existing work primarily focuses on finding
+patterns in the activity sequences of an event log, while ignoring event
+attribute data. Event attribute data has mostly been used to predict event
+occurrences and process outcome, but the state of the art neglects to mine
+succinct and interpretable rules how event attribute data changes during
+process execution. Subgroup discovery and rule-based classification approaches
+lack the ability to capture the sequential dependencies present in event logs,
+and thus lead to unsatisfactory results with limited insight into the process
+behavior.
+ Given an event log, we are interested in finding accurate yet succinct and
+interpretable if-then rules how the process modifies data. We formalize the
+problem in terms of the Minimum Description Length (MDL) principle, by which we
+choose the model with the best lossless description of the data. Additionally,
+we propose the greedy Moody algorithm to efficiently search for rules. By
+extensive experiments on both synthetic and real-world data, we show Moody
+indeed finds compact and interpretable rules, needs little data for accurate
+discovery, and is robust to noise.
+
+
+
+
+
+
+
+ ☆ Accelerated Convergence of Stochastic Heavy Ball Method under
+ Anisotropic Gradient Noise
+
+
+ Heavy-ball momentum with decaying learning rates is widely used with SGD for
+optimizing deep learning models. In contrast to its empirical popularity, the
+understanding of its theoretical property is still quite limited, especially
+under the standard anisotropic gradient noise condition for quadratic
+regression problems. Although it is widely conjectured that heavy-ball momentum
+method can provide accelerated convergence and should work well in large batch
+settings, there is no rigorous theoretical analysis. In this paper, we fill
+this theoretical gap by establishing a non-asymptotic convergence bound for
+stochastic heavy-ball methods with step decay scheduler on quadratic
+objectives, under the anisotropic gradient noise condition. As a direct
+implication, we show that heavy-ball momentum can provide
+$\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term
+of SGD while still achieving near-optimal convergence rate with respect to the
+stochastic variance term. The combined effect implies an overall convergence
+rate within log factors from the statistical minimax rate. This means SGD with
+heavy-ball momentum is useful in the large-batch settings such as distributed
+machine learning or federated learning, where a smaller number of iterations
+can significantly reduce the number of communication rounds, leading to
+acceleration in practice.
+
+
+ Designing online algorithms with machine learning predictions is a recent
+technique beyond the worst-case paradigm for various practically relevant
+online problems (scheduling, caching, clustering, ski rental, etc.). While most
+previous learning-augmented algorithm approaches focus on integrating the
+predictions of a single oracle, we study the design of online algorithms with
+\emph{multiple} experts. To go beyond the popular benchmark of a static best
+expert in hindsight, we propose a new \emph{dynamic} benchmark (linear
+combinations of predictions that change over time). We present a competitive
+algorithm in the new dynamic benchmark with a performance guarantee of $O(\log
+K)$, where $K$ is the number of experts, for $0-1$ online optimization
+problems. Furthermore, our multiple-expert approach provides a new perspective
+on how to combine in an online manner several online algorithms - a
+long-standing central subject in the online algorithm research community.
+
+
+
+
+
+
+
+ ☆ Machine learning for structure-guided materials and process design
+
+
+
+
+
+
+
+
+ Lukas Morand, Tarek Iraki, Johannes Dornheim, Stefan Sandfeld, Norbert Link, Dirk Helm
+
+
+ In recent years, there has been a growing interest in accelerated materials
+innovation in both, research and industry. However, to truly add value to the
+development of new advanced materials, it is inevitable to take into account
+manufacturing processes and thereby tailor materials design approaches to
+support downstream process design approaches. As a major step into this
+direction, we present a holistic optimization approach that covers the entire
+materials process-structure-property chain. Our approach specifically employs
+machine learning techniques to address two critical identification problems.
+The first is to solve a materials design problem, which involves identifying
+near-optimal material structures that exhibit desired macroscopic properties.
+The second is to solve a process design problem that is to find an optimal
+processing path to manufacture these material structures. Both identification
+problems are typically ill-posed, which presents a significant challenge for
+solution approaches. However, the non-unique nature of these problems also
+offers an important advantage for processing: By having several target
+structures that perform similarly well, the corresponding processes can be
+efficiently guided towards manufacturing the best reachable structure. In
+particular, we apply deep reinforcement learning for process design in
+combination with a multi-task learning-based optimization approach for
+materials design. The functionality of the approach will be demonstrated by
+using it to manufacture crystallographic textures with desired properties in a
+metal forming process.
+
+
+ Graph anomaly detection is crucial for identifying nodes that deviate from
+regular behavior within graphs, benefiting various domains such as fraud
+detection and social network. Although existing reconstruction-based methods
+have achieved considerable success, they may face the \textit{Anomaly
+Overfitting} and \textit{Homophily Trap} problems caused by the abnormal
+patterns in the graph, breaking the assumption that normal nodes are often
+better reconstructed than abnormal ones. Our observations indicate that models
+trained on graphs with fewer anomalies exhibit higher detection performance.
+Based on this insight, we introduce a novel two-stage framework called
+Anomaly-Denoised Autoencoders for Graph Anomaly Detection (ADA-GAD). In the
+first stage, we design a learning-free anomaly-denoised augmentation method to
+generate graphs with reduced anomaly levels. We pretrain graph autoencoders on
+these augmented graphs at multiple levels, which enables the graph autoencoders
+to capture normal patterns. In the next stage, the decoders are retrained for
+detection on the original graph, benefiting from the multi-level
+representations learned in the previous stage. Meanwhile, we propose the node
+anomaly distribution regularization to further alleviate \textit{Anomaly
+Overfitting}. We validate the effectiveness of our approach through extensive
+experiments on both synthetic and real-world datasets.
+
+
+
+ comment: Accepted to AAAI-2024
+
+
+
+
+
+
+ ☆ Multi-view user representation learning for user matching without
+ personal information
+
+
+
+
+
+
+
+
+ Hongliu Cao, Ilias El Baamrani, Eoin Thomas
+
+
+ As the digitization of travel industry accelerates, analyzing and
+understanding travelers' behaviors becomes increasingly important. However,
+traveler data frequently exhibit high data sparsity due to the relatively low
+frequency of user interactions with travel providers. Compounding this effect
+the multiplication of devices, accounts and platforms while browsing travel
+products online also leads to data dispersion. To deal with these challenges,
+probabilistic traveler matching can be used. Most existing solutions for user
+matching are not suitable for traveler matching as a traveler's browsing
+history is typically short and URLs in the travel industry are very
+heterogeneous with many tokens. To deal with these challenges, we propose the
+similarity based multi-view information fusion to learn a better user
+representation from URLs by treating the URLs as multi-view data. The
+experimental results show that the proposed multi-view user representation
+learning can take advantage of the complementary information from different
+views, highlight the key information in URLs and perform significantly better
+than other representation learning solutions for the user matching task.
+
+
+
+
+
+
+
+ ☆ DuaLight: Enhancing Traffic Signal Control by Leveraging
+ Scenario-Specific and Scenario-Shared Knowledge AAMAS2024
+
+
+ Reinforcement learning has been revolutionizing the traditional traffic
+signal control task, showing promising power to relieve congestion and improve
+efficiency. However, the existing methods lack effective learning mechanisms
+capable of absorbing dynamic information inherent to a specific scenario and
+universally applicable dynamic information across various scenarios. Moreover,
+within each specific scenario, they fail to fully capture the essential
+empirical experiences about how to coordinate between neighboring and target
+intersections, leading to sub-optimal system-wide outcomes.
+ Viewing these issues, we propose DuaLight, which aims to leverage both the
+experiential information within a single scenario and the generalizable
+information across various scenarios for enhanced decision-making.
+Specifically, DuaLight introduces a scenario-specific experiential weight
+module with two learnable parts: Intersection-wise and Feature-wise, guiding
+how to adaptively utilize neighbors and input features for each scenario, thus
+providing a more fine-grained understanding of different intersections.
+Furthermore, we implement a scenario-shared Co-Train module to facilitate the
+learning of generalizable dynamics information across different scenarios.
+Empirical results on both real-world and synthetic scenarios show DuaLight
+achieves competitive performance across various metrics, offering a promising
+solution to alleviate traffic congestion, with 3-7\% improvements. The code is
+available under: https://github.com/lujiaming-12138/DuaLight.
+
+
+
+ comment: Accepted by AAMAS2024
+
+
+
+
+
+
+ ☆ An effective and efficient green federated learning method for one-layer
+ neural networks
+
+
+
+
+
+
+
+
+ Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas, Elena Hernández-Pereira, Beatriz Pérez-Sánchez
+
+
+ Nowadays, machine learning algorithms continue to grow in complexity and
+require a substantial amount of computational resources and energy. For these
+reasons, there is a growing awareness of the development of new green
+algorithms and distributed AI can contribute to this. Federated learning (FL)
+is one of the most active research lines in machine learning, as it allows the
+training of collaborative models in a distributed way, an interesting option in
+many real-world environments, such as the Internet of Things, allowing the use
+of these models in edge computing devices. In this work, we present a FL
+method, based on a neural network without hidden layers, capable of generating
+a global collaborative model in a single training round, unlike traditional FL
+methods that require multiple rounds for convergence. This allows obtaining an
+effective and efficient model that simplifies the management of the training
+process. Moreover, this method preserve data privacy by design, a crucial
+aspect in current data protection regulations. We conducted experiments with
+large datasets and a large number of federated clients. Despite being based on
+a network model without hidden layers, it maintains in all cases competitive
+accuracy results compared to more complex state-of-the-art machine learning
+models. Furthermore, we show that the method performs equally well in both
+identically and non-identically distributed scenarios. Finally, it is an
+environmentally friendly algorithm as it allows significant energy savings
+during the training process compared to its centralized counterpart.
+
+
+
+
+
+
+
+ ☆ Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and
+ Spectral Optimal Transport
+
+
+ In neural audio signal processing, pitch conditioning has been used to
+enhance the performance of synthesizers. However, jointly training pitch
+estimators and synthesizers is a challenge when using standard audio-to-audio
+reconstruction loss, leading to reliance on external pitch trackers. To address
+this issue, we propose using a spectral loss function inspired by optimal
+transportation theory that minimizes the displacement of spectral energy. We
+validate this approach through an unsupervised autoencoding task that fits a
+harmonic template to harmonic signals. We jointly estimate the fundamental
+frequency and amplitudes of harmonics using a lightweight encoder and
+reconstruct the signals using a differentiable harmonic synthesizer. The
+proposed approach offers a promising direction for improving unsupervised
+parameter estimation in neural audio applications.
+
+
+
+
+
+
+
+ ☆ Theory of Hallucinations based on Equivariance
+
+
+ Equivariance is an important feature in machine learning, including language
+models. It ensures that any sequences of phrases with the same meanings are
+interpreted consistently. For example, the sentence 'There is a cat on the
+table' should be interpreted by language models as it is, regardless of
+variations in its token-level expression. Building on this insight, I propose a
+new theory suggesting that insufficient equivariance in language models can
+lead to hallucinations. According to this theory, which is both intuitive and
+novel, language models trained on relatively small datasets tend to
+misinterpret input texts and/or generate incorrect texts (i.e.,
+hallucinations). To test this theory, I developed a toy model known as 'dancing
+men', which is a character-level substitution cipher. Additionally, I propose a
+novel technique based on the T5 (Text To Text Transfer Transformer) model to
+efficiently decipher these codes without relying on frequency analysis. I have
+found that this T5 model can almost completely solve the cipher, demonstrating
+its ability to acquire equivariance in this frame. This method could be scaled
+up to word-level and sentence-level substitution ciphers, analogous to large
+language models without tokenizers or dictionaries. This scalability makes it
+suitable for investigating the proposed link between inadequate equivariance
+acquisition and the emergence of hallucinations.
+
+
+
+
+
+
+
+ ☆ Hutchinson Trace Estimation for High-Dimensional and High-Order
+ Physics-Informed Neural Networks
+
+
+
+
+
+
+
+
+ Zheyuan Hu, Zekun Shi, George Em Karniadakis, Kenji Kawaguchi
+
+
+ Physics-Informed Neural Networks (PINNs) have proven effective in solving
+partial differential equations (PDEs), especially when some data are available
+by blending seamlessly data and physics. However, extending PINNs to
+high-dimensional and even high-order PDEs encounters significant challenges due
+to the computational cost associated with automatic differentiation in the
+residual loss. Herein, we address the limitations of PINNs in handling
+high-dimensional and high-order PDEs by introducing Hutchinson Trace Estimation
+(HTE). Starting with the second-order high-dimensional PDEs ubiquitous in
+scientific computing, HTE transforms the calculation of the entire Hessian
+matrix into a Hessian vector product (HVP). This approach alleviates the
+computational bottleneck via Taylor-mode automatic differentiation and
+significantly reduces memory consumption from the Hessian matrix to HVP. We
+further showcase HTE's convergence to the original PINN loss and its unbiased
+behavior under specific conditions. Comparisons with Stochastic Dimension
+Gradient Descent (SDGD) highlight the distinct advantages of HTE, particularly
+in scenarios with significant variance among dimensions. We further extend HTE
+to higher-order and higher-dimensional PDEs, specifically addressing the
+biharmonic equation. By employing tensor-vector products (TVP), HTE efficiently
+computes the colossal tensor associated with the fourth-order high-dimensional
+biharmonic equation, saving memory and enabling rapid computation. The
+effectiveness of HTE is illustrated through experimental setups, demonstrating
+comparable convergence rates with SDGD under memory and speed constraints.
+Additionally, HTE proves valuable in accelerating the Gradient-Enhanced PINN
+(gPINN) version as well as the Biharmonic equation. Overall, HTE opens up a new
+capability in scientific machine learning for tackling high-order and
+high-dimensional PDEs.
+
+
+
+
+
+
+
+
+ Xuan Gong, Shanglin Li, Yuxiang Bao, Barry Yao, Yawen Huang, Ziyan Wu, Baochang Zhang, Yefeng Zheng, David Doermann
+
+
+ Federated learning (FL) is a machine learning paradigm in which distributed
+local nodes collaboratively train a central model without sharing individually
+held private data. Existing FL methods either iteratively share local model
+parameters or deploy co-distillation. However, the former is highly susceptible
+to private data leakage, and the latter design relies on the prerequisites of
+task-relevant real data. Instead, we propose a data-free FL framework based on
+local-to-central collaborative distillation with direct input and output space
+exploitation. Our design eliminates any requirement of recursive local
+parameter exchange or auxiliary task-relevant data to transfer knowledge,
+thereby giving direct privacy control to local users. In particular, to cope
+with the inherent data heterogeneity across locals, our technique learns to
+distill input on which each local model produces consensual yet unique results
+to represent each expertise. Our proposed FL framework achieves notable
+privacy-utility trade-offs with extensive experiments on image classification
+and segmentation tasks under various real-world heterogeneous federated
+learning settings on both natural and medical images.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ☆ Safe Reinforcement Learning with Instantaneous Constraints: The Role of
+ Aggressive Exploration
+
+
+ This paper studies safe Reinforcement Learning (safe RL) with linear function
+approximation and under hard instantaneous constraints where unsafe actions
+must be avoided at each step. Existing studies have considered safe RL with
+hard instantaneous constraints, but their approaches rely on several key
+assumptions: $(i)$ the RL agent knows a safe action set for {\it every} state
+or knows a {\it safe graph} in which all the state-action-state triples are
+safe, and $(ii)$ the constraint/cost functions are {\it linear}. In this paper,
+we consider safe RL with instantaneous hard constraints without assumption
+$(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our
+proposed algorithm, LSVI-AE, achieves $\tilde{\cO}(\sqrt{d^3H^4K})$ regret and
+$\tilde{\cO}(H \sqrt{dK})$ hard constraint violation when the cost function is
+linear and $\cO(H\gamma_K \sqrt{K})$ hard constraint violation when the cost
+function belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length
+of each episode, and $\gamma_K$ is the information gain w.r.t the kernel used
+to approximate cost functions. Our results achieve the optimal dependency on
+the learning horizon $K$, matching the lower bound we provide in this paper and
+demonstrating the efficiency of LSVI-AE. Notably, the design of our approach
+encourages aggressive policy exploration, providing a unique perspective on
+safe RL with general cost functions and no prior knowledge of safe actions,
+which may be of independent interest.
+
+
+
+
+
+
+
+ ☆ Attacking Byzantine Robust Aggregation in High Dimensions
+
+
+ Training modern neural networks or models typically requires averaging over a
+sample of high-dimensional vectors. Poisoning attacks can skew or bias the
+average vectors used to train the model, forcing the model to learn specific
+patterns or avoid learning anything useful. Byzantine robust aggregation is a
+principled algorithmic defense against such biasing. Robust aggregators can
+bound the maximum bias in computing centrality statistics, such as mean, even
+when some fraction of inputs are arbitrarily corrupted. Designing such
+aggregators is challenging when dealing with high dimensions. However, the
+first polynomial-time algorithms with strong theoretical bounds on the bias
+have recently been proposed. Their bounds are independent of the number of
+dimensions, promising a conceptual limit on the power of poisoning attacks in
+their ongoing arms race against defenses.
+ In this paper, we show a new attack called HIDRA on practical realization of
+strong defenses which subverts their claim of dimension-independent bias. HIDRA
+highlights a novel computational bottleneck that has not been a concern of
+prior information-theoretic analysis. Our experimental evaluation shows that
+our attacks almost completely destroy the model performance, whereas existing
+attacks with the same goal fail to have much effect. Our findings leave the
+arms race between poisoning attacks and provable defenses wide open.
+
+
+
+
+
+
+
+ ☆ Multiagent Copilot Approach for Shared Autonomy between Human EEG and
+ TD3 Deep Reinforcement Learning
+
+
+ Deep reinforcement learning (RL) algorithms enable the development of fully
+autonomous agents that can interact with the environment. Brain-computer
+interface (BCI) systems decipher human implicit brain signals regardless of the
+explicit environment. In this study, we integrated deep RL and BCI to improve
+beneficial human interventions in autonomous systems and the performance in
+decoding brain activities by considering environmental factors. Shared autonomy
+was allowed between the action command decoded from the electroencephalography
+(EEG) of the human agent and the action generated from the twin delayed DDPG
+(TD3) agent for a given environment. Our proposed copilot control scheme with a
+full blocker (Co-FB) significantly outperformed the individual EEG (EEG-NB) or
+TD3 control. The Co-FB model achieved a higher target approaching score, lower
+failure rate, and lower human workload than the EEG-NB model. The Co-FB control
+scheme had a higher invisible target score and level of allowed human
+intervention than the TD3 model. We also proposed a disparity d-index to
+evaluate the effect of contradicting agent decisions on the control accuracy
+and authority of the copilot model. We found a significant correlation between
+the control authority of the TD3 agent and the performance improvement of human
+EEG classification with respect to the d-index. We also observed that shifting
+control authority to the TD3 agent improved performance when BCI decoding was
+not optimal. These findings indicate that the copilot system can effectively
+handle complex environments and that BCI performance can be improved by
+considering environmental factors. Future work should employ continuous action
+space and different multi-agent approaches to evaluate copilot performance.
+
+
+
+ comment: 14 pages, 6 figures
+
+
+
+
+
+
+ ☆ How to Overcome Curse-of-Dimensionality for Out-of-Distribution
+ Detection? AAAI 2024
+
+
+ Machine learning models deployed in the wild can be challenged by
+out-of-distribution (OOD) data from unknown classes. Recent advances in OOD
+detection rely on distance measures to distinguish samples that are relatively
+far away from the in-distribution (ID) data. Despite the promise,
+distance-based methods can suffer from the curse-of-dimensionality problem,
+which limits the efficacy in high-dimensional feature space. To combat this
+problem, we propose a novel framework, Subspace Nearest Neighbor (SNN), for OOD
+detection. In training, our method regularizes the model and its feature
+representation by leveraging the most relevant subset of dimensions (i.e.
+subspace). Subspace learning yields highly distinguishable distance measures
+between ID and OOD data. We provide comprehensive experiments and ablations to
+validate the efficacy of SNN. Compared to the current best distance-based
+method, SNN reduces the average FPR95 by 15.96% on the CIFAR-100 benchmark.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ DMC4ML: Data Movement Complexity for Machine Learning
+
+
+
+
+
+
+
+
+ Chen Ding, Christopher Kanan, Dylan McKellips, Toranosuke Ozawa, Arian Shahmirza, Wesley Smith
+
+
+ The greatest demand for today's computing is machine learning. This paper
+analyzes three machine learning algorithms: transformers, spatial convolution,
+and FFT. The analysis is novel in three aspects. First, it measures the cost of
+memory access on an abstract memory hierarchy, instead of traditional time or
+space complexity. Second, the analysis is asymptotic and identifies the primary
+sources of the memory cost. Finally, the result is symbolic, which can be used
+to select algorithmic parameters such as the group size in grouped query
+attention for any dimension size and number of heads and the batch size for
+batched convolution for any image size and kernel size.
+
+
+
+
+
+
+
+ ☆ Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
+
+
+ The widespread use of Text-to-Image (T2I) models in content generation
+requires careful examination of their safety, including their robustness to
+adversarial attacks. Despite extensive research into this, the reasons for
+their effectiveness are underexplored. This paper presents an empirical study
+on adversarial attacks against T2I models, focusing on analyzing factors
+associated with attack success rates (ASRs). We introduce a new attack
+objective - entity swapping using adversarial suffixes and two gradient-based
+attack algorithms. Human and automatic evaluations reveal the asymmetric nature
+of ASRs on entity swap: for example, it is easier to replace "human" with
+"robot" in the prompt "a human dancing in the rain." with an adversarial suffix
+but is significantly harder in reverse. We further propose probing metrics to
+establish indicative signals from the model's beliefs to the adversarial ASR.
+We identify conditions resulting in a 60% success probability for adversarial
+attacks and others where this likelihood drops below 5%.
+
+
+ When handling streaming graphs, existing graph representation learning models
+encounter a catastrophic forgetting problem, where previously learned knowledge
+of these models is easily overwritten when learning with newly incoming graphs.
+In response, Continual Graph Learning emerges as a novel paradigm enabling
+graph representation learning from static to streaming graphs. Our prior work,
+CaT is a replay-based framework with a balanced continual learning procedure,
+which designs a small yet effective memory bank for replaying data by
+condensing incoming graphs. Although the CaT alleviates the catastrophic
+forgetting problem, there exist three issues: (1) The graph condensation
+algorithm derived in CaT only focuses on labelled nodes while neglecting
+abundant information carried by unlabelled nodes; (2) The continual training
+scheme of the CaT overemphasises on the previously learned knowledge, limiting
+the model capacity to learn from newly added memories; (3) Both the
+condensation process and replaying process of the CaT are time-consuming. In
+this paper, we propose a psudo-label guided memory bank (PUMA) CGL framework,
+extending from the CaT to enhance its efficiency and effectiveness by
+overcoming the above-mentioned weaknesses and limits. To fully exploit the
+information in a graph, PUMA expands the coverage of nodes during graph
+condensation with both labelled and unlabelled nodes. Furthermore, a
+training-from-scratch strategy is proposed to upgrade the previous continual
+learning scheme for a balanced training between the historical and the new
+graphs. Besides, PUMA uses a one-time prorogation and wide graph encoders to
+accelerate the graph condensation and the graph encoding process in the
+training stage to improve the efficiency of the whole framework. Extensive
+experiments on four datasets demonstrate the state-of-the-art performance and
+efficiency over existing methods.
+
+
+
+ comment: The code has been released in https://github.com/superallen13/PUMA.
+ arXiv admin note: substantial text overlap with arXiv:2309.09455
+
+
+
+
+
+
+ ☆ PC-Conv: Unifying Homophily and Heterophily with Two-fold Filtering AAAI2024
+
+
+ Recently, many carefully crafted graph representation learning methods have
+achieved impressive performance on either strong heterophilic or homophilic
+graphs, but not both. Therefore, they are incapable of generalizing well across
+real-world graphs with different levels of homophily. This is attributed to
+their neglect of homophily in heterophilic graphs, and vice versa. In this
+paper, we propose a two-fold filtering mechanism to extract homophily in
+heterophilic graphs and vice versa. In particular, we extend the graph heat
+equation to perform heterophilic aggregation of global information from a long
+distance. The resultant filter can be exactly approximated by the
+Possion-Charlier (PC) polynomials. To further exploit information at multiple
+orders, we introduce a powerful graph convolution PC-Conv and its instantiation
+PCNet for the node classification task. Compared with state-of-the-art GNNs,
+PCNet shows competitive performance on well-known homophilic and heterophilic
+graphs. Our implementation is available at https://github.com/uestclbh/PC-Conv.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ☆ REBEL: A Regularization-Based Solution for Reward Overoptimization in
+ Reinforcement Learning from Human Feedback
+
+
+ In this work, we propose REBEL, an algorithm for sample efficient reward
+regularization based robotic reinforcement learning from human feedback
+(RRLHF). Reinforcement learning (RL) performance for continuous control
+robotics tasks is sensitive to the underlying reward function. In practice, the
+reward function often ends up misaligned with human intent, values, social
+norms, etc., leading to catastrophic failures in the real world. We leverage
+human preferences to learn regularized reward functions and eventually align
+the agents with the true intended behavior. We introduce a novel notion of
+reward regularization to the existing RRLHF framework, which is termed as agent
+preferences. So, we not only consider human feedback in terms of preferences,
+we also propose to take into account the preference of the underlying RL agent
+while learning the reward function. We show that this helps to improve the
+over-optimization associated with the design of reward functions in RL. We
+experimentally show that REBEL exhibits up to 70% improvement in sample
+efficiency to achieve a similar level of episodic reward returns as compared to
+the state-of-the-art methods such as PEBBLE and PEBBLE+SURF.
+
+
+
+
+
+
+
+ ☆ Scalable 3D Reconstruction From Single Particle X-Ray Diffraction Images
+ Based on Online Machine Learning
+
+
+
+
+
+
+
+
+ Jay Shenoy, Axel Levy, Frédéric Poitevin, Gordon Wetzstein
+
+
+ X-ray free-electron lasers (XFELs) offer unique capabilities for measuring
+the structure and dynamics of biomolecules, helping us understand the basic
+building blocks of life. Notably, high-repetition-rate XFELs enable single
+particle imaging (X-ray SPI) where individual, weakly scattering biomolecules
+are imaged under near-physiological conditions with the opportunity to access
+fleeting states that cannot be captured in cryogenic or crystallized
+conditions. Existing X-ray SPI reconstruction algorithms, which estimate the
+unknown orientation of a particle in each captured image as well as its shared
+3D structure, are inadequate in handling the massive datasets generated by
+these emerging XFELs. Here, we introduce X-RAI, an online reconstruction
+framework that estimates the structure of a 3D macromolecule from large X-ray
+SPI datasets. X-RAI consists of a convolutional encoder, which amortizes pose
+estimation over large datasets, as well as a physics-based decoder, which
+employs an implicit neural representation to enable high-quality 3D
+reconstruction in an end-to-end, self-supervised manner. We demonstrate that
+X-RAI achieves state-of-the-art performance for small-scale datasets in
+simulation and challenging experimental settings and demonstrate its
+unprecedented ability to process large datasets containing millions of
+diffraction images in an online fashion. These abilities signify a paradigm
+shift in X-ray SPI towards real-time capture and reconstruction.
+
+
+ The recent emergence of large language models (LLMs) shows the potential for
+artificial general intelligence, revealing new opportunities in industry 4.0
+and smart manufacturing. However, a notable gap exists in applying these LLMs
+in industry, primarily due to their training on general knowledge rather than
+domain-specific knowledge. Such specialized domain knowledge is vital for
+effectively addressing the complex needs of industrial applications. To bridge
+this gap, this paper proposes an Industrial Large Knowledge Model (ILKM)
+framework emphasizing their potential to revolutionize the industry in smart
+manufacturing. In addition, ILKMs and LLMs are compared from eight
+perspectives. Finally, "6S Principle" is proposed as the guideline for the
+development of ILKMs in smart manufacturing.
+
+
+
+ comment: The paper has been submitted to Manufacturing Letters (Under Review)
+
+
+
+
+
+
+ ☆ Room Occupancy Prediction: Exploring the Power of Machine Learning and
+ Temporal Insights
+
+
+
+
+
+
+
+
+ Siqi Mao, Yaping Yuan, Yinpu Li, Ziren Wang, Yuanxin Yao, Yixin Kang
+
+
+ Energy conservation in buildings is a paramount concern to combat greenhouse
+gas emissions and combat climate change. The efficient management of room
+occupancy, involving actions like lighting control and climate adjustment, is a
+pivotal strategy to curtail energy consumption. In contexts where surveillance
+technology isn't viable, non-intrusive sensors are employed to estimate room
+occupancy. In this study, we present a predictive framework for room occupancy
+that leverages a diverse set of machine learning models, with Random Forest
+consistently achieving the highest predictive accuracy. Notably, this dataset
+encompasses both temporal and spatial dimensions, revealing a wealth of
+information. Intriguingly, our framework demonstrates robust performance even
+in the absence of explicit temporal modeling. These findings underscore the
+remarkable predictive power of traditional machine learning models. The success
+can be attributed to the presence of feature redundancy, the simplicity of
+linear spatial and temporal patterns, and the advantages of high-frequency data
+sampling. While these results are compelling, it's essential to remain open to
+the possibility that explicitly modeling the temporal dimension could unlock
+deeper insights or further enhance predictive capabilities in specific
+scenarios. In summary, our research not only validates the effectiveness of our
+prediction framework for continuous and classification tasks but also
+underscores the potential for improvements through the inclusion of temporal
+aspects. The study highlights the promise of machine learning in shaping
+energy-efficient practices and room occupancy management.
+
+
+
+
+
+
+
+ ☆ Sharp error estimates for target measure diffusion maps with
+ applications to the committor problem
+
+
+
+
+
+
+
+
+ Shashank Sule, Luke Evans, Maria Cameron
+
+
+ We obtain asymptotically sharp error estimates for the consistency error of
+the Target Measure Diffusion map (TMDmap) (Banisch et al. 2020), a variant of
+diffusion maps featuring importance sampling and hence allowing input data
+drawn from an arbitrary density. The derived error estimates include the bias
+error and the variance error. The resulting convergence rates are consistent
+with the approximation theory of graph Laplacians. The key novelty of our
+results lies in the explicit quantification of all the prefactors on
+leading-order terms. We also prove an error estimate for solutions of Dirichlet
+BVPs obtained using TMDmap, showing that the solution error is controlled by
+consistency error. We use these results to study an important application of
+TMDmap in the analysis of rare events in systems governed by overdamped
+Langevin dynamics using the framework of transition path theory (TPT). The
+cornerstone ingredient of TPT is the solution of the committor problem, a
+boundary value problem for the backward Kolmogorov PDE. Remarkably, we find
+that the TMDmap algorithm is particularly suited as a meshless solver to the
+committor problem due to the cancellation of several error terms in the
+prefactor formula. Furthermore, significant improvements in bias and variance
+errors occur when using a quasi-uniform sampling density. Our numerical
+experiments show that these improvements in accuracy are realizable in practice
+when using $\delta$-nets as spatially uniform inputs to the TMDmap algorithm.
+
+
+
+
+
+
+
+ ☆ Generative Pretraining at Scale: Transformer-Based Encoding of
+ Transactional Behavior for Fraud Detection
+
+
+
+
+
+
+
+
+ Ze Yu Zhao, Zheng Zhu, Guilin Li, Wenhan Wang, Bo Wang
+
+
+ In this work, we introduce an innovative autoregressive model leveraging
+Generative Pretrained Transformer (GPT) architectures, tailored for fraud
+detection in payment systems. Our approach innovatively confronts token
+explosion and reconstructs behavioral sequences, providing a nuanced
+understanding of transactional behavior through temporal and contextual
+analysis. Utilizing unsupervised pretraining, our model excels in feature
+representation without the need for labeled data. Additionally, we integrate a
+differential convolutional approach to enhance anomaly detection, bolstering
+the security and efficacy of one of the largest online payment merchants in
+China. The scalability and adaptability of our model promise broad
+applicability in various transactional contexts.
+
+
+
+
+
+
+
+ ☆ Graph Attention-Based Symmetry Constraint Extraction for Analog Circuits
+
+
+
+
+
+
+
+
+ Qi Xu, Lijie Wang, Jing Wang, Song Chen, Lin Cheng, Yi Kang
+
+
+ In recent years, analog circuits have received extensive attention and are
+widely used in many emerging applications. The high demand for analog circuits
+necessitates shorter circuit design cycles. To achieve the desired performance
+and specifications, various geometrical symmetry constraints must be carefully
+considered during the analog layout process. However, the manual labeling of
+these constraints by experienced analog engineers is a laborious and
+time-consuming process. To handle the costly runtime issue, we propose a
+graph-based learning framework to automatically extract symmetric constraints
+in analog circuit layout. The proposed framework leverages the connection
+characteristics of circuits and the devices'information to learn the general
+rules of symmetric constraints, which effectively facilitates the extraction of
+device-level constraints on circuit netlists. The experimental results
+demonstrate that compared to state-of-the-art symmetric constraint detection
+approaches, our framework achieves higher accuracy and lower false positive
+rate.
+
+
+
+
+
+
+
+ ☆ Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
+
+
+
+
+
+
+
+
+ Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
+
+
+ As the development of large-scale Generative AI models evolve beyond text
+(1D) generation to include image (2D) and video (3D) generation, processing
+spatial and temporal information presents unique challenges to quality,
+performance, and efficiency. We present the first work towards understanding
+this new system design space for multi-modal text-to-image (TTI) and
+text-to-video (TTV) generation models. Current model architecture designs are
+bifurcated into 2 categories: Diffusion- and Transformer-based models. Our
+systematic performance characterization on a suite of eight representative
+TTI/TTV models shows that after state-of-the-art optimization techniques such
+as Flash Attention are applied, Convolution accounts for up to 44% of execution
+time for Diffusion-based TTI models, while Linear layers consume up to 49% of
+execution time for Transformer-based models. We additionally observe that
+Diffusion-based TTI models resemble the Prefill stage of LLM inference, and
+benefit from 1.1-2.5x greater speedup from Flash Attention than
+Transformer-based TTI models that resemble the Decode phase. Since
+optimizations designed for LLMs do not map directly onto TTI/TTV models, we
+must conduct a thorough characterization of these workloads to gain insights
+for new optimization opportunities. In doing so, we define sequence length in
+the context of TTI/TTV models and observe sequence length can vary up to 4x in
+Diffusion model inference. We additionally observe temporal aspects of TTV
+workloads pose unique system bottlenecks, with Temporal Attention accounting
+for over 60% of total Attention time. Overall, our in-depth system performance
+characterization is a critical first step towards designing efficient and
+deployable systems for emerging TTI/TTV workloads.
+
+
+ Federated learning enables joint training of machine learning models from
+distributed clients without sharing their local data. One key challenge in
+federated learning is to handle non-identically distributed data across the
+clients, which leads to deteriorated model training performances. Prior works
+in this line of research mainly focus on utilizing last-step global model
+parameters/gradients or the linear combinations of the past model
+parameters/gradients, which do not fully exploit the potential of global
+information from the model training trajectory. In this paper, we propose a
+novel federated learning framework with projected trajectory regularization
+(FedPTR) for tackling the data heterogeneity issue, which proposes a unique way
+to better extract the essential global information from the model training
+trajectory. Specifically, FedPTR allows local clients or the server to optimize
+an auxiliary (synthetic) dataset that mimics the learning dynamics of the
+recent model update and utilizes it to project the next-step model trajectory
+for local training regularization. We conduct rigorous theoretical analysis for
+our proposed framework under nonconvex stochastic settings to verify its fast
+convergence under heterogeneous data distributions. Experiments on various
+benchmark datasets and non-i.i.d. settings validate the effectiveness of our
+proposed framework.
+
+
+
+
+
+
+
+
+ Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu
+
+
+ Training large foundation models using self-supervised objectives on
+unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a
+standard procedure. Unfortunately, the efficacy of this approach is often
+constrained by both limited fine-tuning compute and scarcity in labeled
+downstream data. We introduce Multimodal Attention Merging (MAM), an attempt
+that facilitates direct knowledge transfer from attention matrices of models
+rooted in high resource modalities, text and images, to those in
+resource-constrained domains, speech and audio, employing a zero-shot paradigm.
+MAM reduces the relative Word Error Rate (WER) of an Automatic Speech
+Recognition (ASR) model by up to 6.70%, and relative classification error of an
+Audio Event Classification (AEC) model by 10.63%. In cases where some
+data/compute is available, we present Learnable-MAM, a data-driven approach to
+merging attention matrices, resulting in a further 2.90% relative reduction in
+WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning.
+
+
+
+ comment: 5 pages, 1 figure
+
+
+
+
+
+
+ ☆ Generative Models for Simulation of KamLAND-Zen
+
+
+
+
+
+
+
+
+ Z. Fu, C. Grant, D. M. Krawiec, A. Li, L. Winslow
+
+
+ The next generation of searches for neutrinoless double beta decay
+(0{\nu}\b{eta}\b{eta}) are poised to answer deep questions on the nature of
+neutrinos and the source of the Universe's matter-antimatter asymmetry. They
+will be looking for event rates of less than one event per ton of instrumented
+isotope per year. To claim discovery, accurate and efficient simulations of
+detector events that mimic 0{\nu}\b{eta}\b{eta} is critical. Traditional Monte
+Carlo (MC) simulations can be supplemented by machine-learning-based generative
+models. In this work, we describe the performance of generative models designed
+for monolithic liquid scintillator detectors like KamLAND to produce highly
+accurate simulation data without a predefined physics model. We demonstrate its
+ability to recover low-level features and perform interpolation. In the future,
+the results of these generative models can be used to improve event
+classification and background rejection by providing high-quality abundant
+generated data.
+
+
+
+ comment: Submitted to EPJC
+
+
+
+
+
+
+ ☆ Quality-Diversity Generative Sampling for Learning with Synthetic Data AAAI 2024
+
+
+
+
+
+
+
+
+ Allen Chang, Matthew C. Fontaine, Serena Booth, Maja J. Matarić, Stefanos Nikolaidis
+
+
+ Generative models can serve as surrogates for some real data sources by
+creating synthetic training datasets, but in doing so they may transfer biases
+to downstream tasks. We focus on protecting quality and diversity when
+generating synthetic training datasets. We propose quality-diversity generative
+sampling (QDGS), a framework for sampling data uniformly across a user-defined
+measure space, despite the data coming from a biased generator. QDGS is a
+model-agnostic framework that uses prompt guidance to optimize a quality
+objective across measures of diversity for synthetically generated data,
+without fine-tuning the generative model. Using balanced synthetic datasets
+generated by QDGS, we first debias classifiers trained on color-biased shape
+datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we
+prompt for desired semantic concepts, such as skin tone and age, to create an
+intersectional dataset with a combined blend of visual features. Leveraging
+this balanced data for training classifiers improves fairness while maintaining
+accuracy on facial recognition benchmarks. Code available at:
+https://github.com/Cylumn/qd-generative-sampling
+
+
+ Today's most powerful machine learning approaches are typically designed to
+train stateless architectures with predefined layers and differentiable
+activation functions. While these approaches have led to unprecedented
+successes in areas such as natural language processing and image recognition,
+the trained models are also susceptible to making mistakes that a human would
+not. In this paper, we take the view that true intelligence may require the
+ability of a machine learning model to manage internal state, but that we have
+not yet discovered the most effective algorithms for training such models. We
+further postulate that such algorithms might not necessarily be based on
+gradient descent over a deep architecture, but rather, might work best with an
+architecture that has discrete activations and few initial topological
+constraints (such as multiple predefined layers). We present one attempt in our
+ongoing efforts to design such a training algorithm, applied to an architecture
+with binary activations and only a single matrix of weights, and show that it
+is able to form useful representations of natural language text, but is also
+limited in its ability to leverage large quantities of training data. We then
+provide ideas for improving the algorithm and for designing other training
+algorithms for similar architectures. Finally, we discuss potential benefits
+that could be gained if an effective training algorithm is found, and suggest
+experiments for evaluating whether these benefits exist in practice.
+
+
+
+ comment: 5 pages, 2 figures
+
+
+
+
+
+
+ ♻ ☆ Beyond Human Data: Scaling Self-Training for Problem-Solving with
+ Language Models
+
+
+
+
+
+
+
+
+ Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Kenealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, Maxwell L. Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, Noah Fiedel
+
+
+ Fine-tuning language models~(LMs) on human-generated data remains a prevalent
+practice. However, the performance of such models is often limited by the
+quantity and diversity of high-quality human data. In this paper, we explore
+whether we can go beyond human data on tasks where we have access to scalar
+feedback, for example, on math problems where one can verify correctness. To do
+so, we investigate a simple self-training method based on
+expectation-maximization, which we call ReST$^{EM}$, where we (1) generate
+samples from the model and filter them using binary feedback, (2) fine-tune the
+model on these samples, and (3) repeat this process a few times. Testing on
+advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find
+that ReST$^{EM}$ scales favorably with model size and significantly surpasses
+fine-tuning only on human data. Overall, our findings suggest self-training
+with feedback can substantially reduce dependence on human-generated data.
+
+
+
+ comment: First three authors contributed equally
+
+
+
+
+
+
+ ♻ ☆ UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
+
+
+ Large Language Models (LLMs) have made the ambitious quest for generalist
+agents significantly far from being a fantasy. A key hurdle for building such
+general models is the diversity and heterogeneity of tasks and modalities. A
+promising solution is unification, allowing the support of a myriad of tasks
+and modalities within one unified framework. While few large models (e.g.,
+Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more
+than two modalities, current small to mid-scale unified models are still
+limited to 2 modalities, usually image-text or video-text. The question that we
+ask is: is it possible to build efficiently a unified model that can support
+all modalities? To answer this, we propose UnIVAL, a step further towards this
+ambitious goal. Without relying on fancy datasets sizes or models with billions
+of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities
+and unifies text, images, video, and audio into a single model. Our model is
+efficiently pretrained on many tasks, based on task balancing and multimodal
+curriculum learning. UnIVAL shows competitive performance to existing
+state-of-the-art approaches, across image and video-text tasks. The feature
+representations learned from image and video-text modalities, allows the model
+to achieve competitive performance when finetuned on audio-text tasks, despite
+not being pretrained on audio. Thanks to the unified model, we propose a novel
+study on multimodal model merging via weight interpolation of models trained on
+different multimodal tasks, showing their benefits in particular for
+out-of-distribution generalization. Finally, we motivate unification by showing
+the synergy between tasks. The model weights and code are released here:
+https://github.com/mshukor/UnIVAL.
+
+
+
+
+
+
+
+ ♻ ☆ The Framework Tax: Disparities Between Inference Efficiency in NLP
+ Research and Deployment EMNLP 2023
+
+
+
+
+
+
+
+
+ Jared Fernandez, Jacob Kahn, Clara Na, Yonatan Bisk, Emma Strubell
+
+
+ Increased focus on the computational efficiency of NLP systems has motivated
+the design of efficient model architectures and improvements to underlying
+hardware accelerators. However, the resulting increases in computational
+throughput and reductions in floating point operations have not directly
+translated to improvements in wall-clock inference latency. We demonstrate that
+these discrepancies can be largely attributed to bottlenecks introduced by deep
+learning frameworks. We denote this phenomenon as the \textit{framework tax},
+and observe that the disparity is growing as hardware speed increases over
+time. In this work, we examine this phenomenon through a series of case studies
+analyzing the effects of model design decisions, framework paradigms, and
+hardware platforms on total model latency. Code is available at
+https://github.com/JaredFern/Framework-Tax.
+
+
+
+ comment: EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Next Steps for Human-Centered Generative AI: A Technical Perspective
+
+
+
+
+
+
+
+
+ Xiang 'Anthony' Chen, Jeff Burke, Ruofei Du, Matthew K. Hong, Jennifer Jacobs, Philippe Laban, Dingzeyu Li, Nanyun Peng, Karl D. D. Willis, Chien-Sheng Wu, Bolei Zhou
+
+
+ Through iterative, cross-disciplinary discussions, we define and propose
+next-steps for Human-centered Generative AI (HGAI). We contribute a
+comprehensive research agenda that lays out future directions of Generative AI
+spanning three levels: aligning with human values; assimilating human intents;
+and augmenting human abilities. By identifying these next-steps, we intend to
+draw interdisciplinary research teams to pursue a coherent set of emergent
+ideas in HGAI, focusing on their interested topics while maintaining a coherent
+big picture of the future work landscape.
+
+
+
+
+
+
+
+ ♻ ☆ Attesting Distributional Properties of Training Data for Machine
+ Learning
+
+
+
+
+
+
+
+
+ Vasisht Duddu, Anudeep Das, Nora Khayata, Hossein Yalame, Thomas Schneider, N. Asokan
+
+
+ The success of machine learning (ML) has been accompanied by increased
+concerns about its trustworthiness. Several jurisdictions are preparing ML
+regulatory frameworks. One such concern is ensuring that model training data
+has desirable distributional properties for certain sensitive attributes. For
+example, draft regulations indicate that model trainers are required to show
+that training datasets have specific distributional properties, such as
+reflecting diversity of the population.
+ We propose the notion of property attestation allowing a prover (e.g., model
+trainer) to demonstrate relevant distributional properties of training data to
+a verifier (e.g., a customer) without revealing the data. We present an
+effective hybrid property attestation combining property inference with
+cryptographic mechanisms.
+
+
+
+
+
+
+
+ ♻ ☆ Toward Generalizable Machine Learning Models in Speech, Language, and
+ Hearing Sciences: Estimating Sample Size and Reducing Overfitting
+
+
+
+
+
+
+
+
+ Hamzeh Ghasemzadeh, Robert E. Hillman, Daryush D. Mehta
+
+
+ This study's first purpose is to provide quantitative evidence that would
+incentivize researchers to instead use the more robust method of nested
+cross-validation. The second purpose is to present methods and MATLAB codes for
+doing power analysis for ML-based analysis during the design of a study. Monte
+Carlo simulations were used to quantify the interactions between the employed
+cross-validation method, the discriminative power of features, the
+dimensionality of the feature space, and the dimensionality of the model. Four
+different cross-validations (single holdout, 10-fold, train-validation-test,
+and nested 10-fold) were compared based on the statistical power and
+statistical confidence of the ML models. Distributions of the null and
+alternative hypotheses were used to determine the minimum required sample size
+for obtaining a statistically significant outcome ({\alpha}=0.05,
+1-\b{eta}=0.8). Statistical confidence of the model was defined as the
+probability of correct features being selected and hence being included in the
+final model. Our analysis showed that the model generated based on the single
+holdout method had very low statistical power and statistical confidence and
+that it significantly overestimated the accuracy. Conversely, the nested
+10-fold cross-validation resulted in the highest statistical confidence and the
+highest statistical power, while providing an unbiased estimate of the
+accuracy. The required sample size with a single holdout could be 50% higher
+than what would be needed if nested cross-validation were used. Confidence in
+the model based on nested cross-validation was as much as four times higher
+than the confidence in the single holdout-based model. A computational model,
+MATLAB codes, and lookup tables are provided to assist researchers with
+estimating the sample size during the design of their future studies.
+
+
+
+ comment: Accepted at JSLHR
+
+
+
+
+
+
+ ♻ ☆ Building Flexible, Scalable, and Machine Learning-ready Multimodal
+ Oncology Datasets
+
+
+ The advancements in data acquisition, storage, and processing techniques have
+resulted in the rapid growth of heterogeneous medical data. Integrating
+radiological scans, histopathology images, and molecular information with
+clinical data is essential for developing a holistic understanding of the
+disease and optimizing treatment. The need for integrating data from multiple
+sources is further pronounced in complex diseases such as cancer for enabling
+precision medicine and personalized treatments. This work proposes Multimodal
+Integration of Oncology Data System (MINDS) - a flexible, scalable, and
+cost-effective metadata framework for efficiently fusing disparate data from
+public sources such as the Cancer Research Data Commons (CRDC) into an
+interconnected, patient-centric framework. MINDS offers an interface for
+exploring relationships across data types and building cohorts for developing
+large-scale multimodal machine learning models. By harmonizing multimodal data,
+MINDS aims to potentially empower researchers with greater analytical ability
+to uncover diagnostic and prognostic insights and enable evidence-based
+personalized care. MINDS tracks granular end-to-end data provenance, ensuring
+reproducibility and transparency. The cloud-native architecture of MINDS can
+handle exponential data growth in a secure, cost-optimized manner while
+ensuring substantial storage optimization, replication avoidance, and dynamic
+access capabilities. Auto-scaling, access controls, and other mechanisms
+guarantee pipelines' scalability and security. MINDS overcomes the limitations
+of existing biomedical data silos via an interoperable metadata-driven approach
+that represents a pivotal step toward the future of oncology data integration.
+
+
+
+
+
+
+
+ ♻ ☆ On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and
+ Efficient Gradient Methods AAAI 2024
+
+
+
+
+
+
+
+
+ Anh Duc Nguyen, Tuan Dung Nguyen, Quang Minh Nguyen, Hoang H. Nguyen, Lam M. Nguyen, Kim-Chuan Toh
+
+
+ This paper studies the Partial Optimal Transport (POT) problem between two
+unbalanced measures with at most $n$ supports and its applications in various
+AI tasks such as color transfer or domain adaptation. There is hence the need
+for fast approximations of POT with increasingly large problem sizes in arising
+applications. We first theoretically and experimentally investigate the
+infeasibility of the state-of-the-art Sinkhorn algorithm for POT due to its
+incompatible rounding procedure, which consequently degrades its qualitative
+performance in real world applications like point-cloud registration. To this
+end, we propose a novel rounding algorithm for POT, and then provide a feasible
+Sinkhorn procedure with a revised computation complexity of
+$\mathcal{\widetilde O}(n^2/\varepsilon^4)$. Our rounding algorithm also
+permits the development of two first-order methods to approximate the POT
+problem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent
+(APDAGD), finds an $\varepsilon$-approximate solution to the POT problem in
+$\mathcal{\widetilde O}(n^{2.5}/\varepsilon)$, which is better in $\varepsilon$
+than revised Sinkhorn. The second method, Dual Extrapolation, achieves the
+computation complexity of $\mathcal{\widetilde O}(n^2/\varepsilon)$, thereby
+being the best in the literature. We further demonstrate the flexibility of POT
+compared to standard OT as well as the practicality of our algorithms on real
+applications where two marginal distributions are unbalanced.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Effects of cavity nonlinearities and linear losses on silicon
+ microring-based reservoir computing
+
+
+
+
+
+
+
+
+ Bernard J. Giron Castro, Christophe Peucheret, Darko Zibar, Francesco Da Ros
+
+
+ Microring resonators (MRRs) are promising devices for time-delay photonic
+reservoir computing, but the impact of the different physical effects taking
+place in the MRRs on the reservoir computing performance is yet to be fully
+understood. We numerically analyze the impact of linear losses as well as
+thermo-optic and free-carrier effects relaxation times on the prediction error
+of the time-series task NARMA-10. We demonstrate the existence of three
+regions, defined by the input power and the frequency detuning between the
+optical source and the microring resonance, that reveal the cavity transition
+from linear to nonlinear regimes. One of these regions offers very low error in
+time-series prediction under relatively low input power and number of nodes
+while the other regions either lack nonlinearity or become unstable. This study
+provides insight into the design of the MRR and the optimization of its
+physical properties for improving the prediction performance of time-delay
+reservoir computing.
+
+
+ Implicit representations such as Neural Radiance Fields (NeRF) have been
+shown to be very effective at novel view synthesis. However, these models
+typically require manual and careful human data collection for training. In
+this paper, we present AutoNeRF, a method to collect data required to train
+NeRFs using autonomous embodied agents. Our method allows an agent to explore
+an unseen environment efficiently and use the experience to build an implicit
+map representation autonomously. We compare the impact of different exploration
+strategies including handcrafted frontier-based exploration, end-to-end and
+modular approaches composed of trained high-level planners and classical
+low-level path followers. We train these models with different reward functions
+tailored to this problem and evaluate the quality of the learned
+representations on four different downstream tasks: classical viewpoint
+rendering, map reconstruction, planning, and pose refinement. Empirical results
+show that NeRFs can be trained on actively collected data using just a single
+episode of experience in an unseen environment, and can be used for several
+downstream robotic tasks, and that modular trained exploration models
+outperform other classical and end-to-end baselines. Finally, we show that
+AutoNeRF can reconstruct large-scale scenes, and is thus a useful tool to
+perform scene-specific adaptation as the produced 3D environment models can be
+loaded into a simulator to fine-tune a policy of interest.
+
+
+
+
+
+
+
+ ♻ ☆ RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
+
+
+
+
+
+
+
+
+ Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X. Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, Antoine Laurens, Claudio Fantacci, Valentin Dalibard, Martina Zambelli, Murilo Martins, Rugile Pevceviciute, Michiel Blokzijl, Misha Denil, Nathan Batchelor, Thomas Lampe, Emilio Parisotto, Konrad Żołna, Scott Reed, Sergio Gómez Colmenarejo, Jon Scholz, Abbas Abdolmaleki, Oliver Groth, Jean-Baptiste Regli, Oleg Sushkov, Tom Rothörl, José Enrique Chen, Yusuf Aytar, Dave Barker, Joy Ortiz, Martin Riedmiller, Jost Tobias Springenberg, Raia Hadsell, Francesco Nori, Nicolas Heess
+
+
+ The ability to leverage heterogeneous robotic experience from different
+robots and tasks to quickly master novel skills and embodiments has the
+potential to transform robot learning. Inspired by recent advances in
+foundation models for vision and language, we propose a multi-embodiment,
+multi-task generalist agent for robotic manipulation. This agent, named
+RoboCat, is a visual goal-conditioned decision transformer capable of consuming
+action-labelled visual experience. This data spans a large repertoire of motor
+control skills from simulated and real robotic arms with varying sets of
+observations and actions. With RoboCat, we demonstrate the ability to
+generalise to new tasks and robots, both zero-shot as well as through
+adaptation using only 100-1000 examples for the target task. We also show how a
+trained model itself can be used to generate data for subsequent training
+iterations, thus providing a basic building block for an autonomous improvement
+loop. We investigate the agent's capabilities, with large-scale evaluations
+both in simulation and on three different real robot embodiments. We find that
+as we grow and diversify its training data, RoboCat not only shows signs of
+cross-task transfer, but also becomes more efficient at adapting to new tasks.
+
+
+
+ comment: Transactions on Machine Learning Research (12/2023)
+
+
+
+
+
+
+
+ Hugo Henri Joseph Senetaire, Damien Garreau, Jes Frellsen, Pierre-Alexandre Mattei
+
+
+ A wide variety of model explanation approaches have been proposed in recent
+years, all guided by very different rationales and heuristics. In this paper,
+we take a new route and cast interpretability as a statistical inference
+problem. We propose a general deep probabilistic model designed to produce
+interpretable predictions. The model parameters can be learned via maximum
+likelihood, and the method can be adapted to any predictor network architecture
+and any type of prediction problem. Our method is a case of amortized
+interpretability models, where a neural network is used as a selector to allow
+for fast interpretation at inference time. Several popular interpretability
+methods are shown to be particular cases of regularised maximum likelihood for
+our general model. We propose new datasets with ground truth selection which
+allow for the evaluation of the features importance map. Using these datasets,
+we show experimentally that using multiple imputation provides more reasonable
+interpretations.
+
+
+ Since the rise of fair machine learning as a critical field of inquiry, many
+different notions on how to quantify and measure discrimination have been
+proposed in the literature. Some of these notions, however, were shown to be
+mutually incompatible. Such findings make it appear that numerous different
+kinds of fairness exist, thereby making a consensus on the appropriate measure
+of fairness harder to reach, hindering the applications of these tools in
+practice. In this paper, we investigate one of these key impossibility results
+that relates the notions of statistical and predictive parity. Specifically, we
+derive a new causal decomposition formula for the fairness measures associated
+with predictive parity, and obtain a novel insight into how this criterion is
+related to statistical parity through the legal doctrines of disparate
+treatment, disparate impact, and the notion of business necessity. Our results
+show that through a more careful causal analysis, the notions of statistical
+and predictive parity are not really mutually exclusive, but complementary and
+spanning a spectrum of fairness notions through the concept of business
+necessity. Finally, we demonstrate the importance of our findings on a
+real-world example.
+
+
+
+
+
+
+
+ ♻ ☆ DG-TTA: Out-of-domain medical image segmentation through Domain
+ Generalization and Test-Time Adaptation
+
+
+
+
+
+
+
+
+ Christian Weihsbach, Christian N. Kruse, Alexander Bigalke, Mattias P. Heinrich
+
+
+ Applying pre-trained medical segmentation models on out-of-domain images
+often yields predictions of insufficient quality. Several strategies have been
+proposed to maintain model performance, such as finetuning or unsupervised- and
+source-free domain adaptation. These strategies set restrictive requirements
+for data availability. In this study, we propose to combine domain
+generalization and test-time adaptation to create a highly effective approach
+for reusing pre-trained models in unseen target domains. Domain-generalized
+pre-training on source data is used to obtain the best initial performance in
+the target domain. We introduce the MIND descriptor previously used in image
+registration tasks as a further technique to achieve generalization and present
+superior performance for small-scale datasets compared to existing approaches.
+At test-time, high-quality segmentation for every single unseen scan is ensured
+by optimizing the model weights for consistency given different image
+augmentations. That way, our method enables separate use of source and target
+data and thus removes current data availability barriers. Moreover, the
+presented method is highly modular as it does not require specific model
+architectures or prior knowledge of involved domains and labels. We demonstrate
+this by integrating it into the nnUNet, which is currently the most popular and
+accurate framework for medical image segmentation. We employ multiple datasets
+covering abdominal, cardiac, and lumbar spine scans and compose several
+out-of-domain scenarios in this study. We demonstrate that our method, combined
+with pre-trained whole-body CT models, can effectively segment MR images with
+high accuracy in all of the aforementioned scenarios. Open-source code can be
+found here: https://github.com/multimodallearning/DG-TTA
+
+
+
+ comment: This work has been submitted to the IEEE for possible publication.
+ Copyright may be transferred without notice, after which this version may no
+ longer be accessible
+
+
+
+
+
+
+ ♻ ☆ A mathematical perspective on Transformers
+
+
+ Transformers play a central role in the inner workings of large language
+models. We develop a mathematical framework for analyzing Transformers based on
+their interpretation as interacting particle systems, which reveals that
+clusters emerge in long time. Our study explores the underlying theory and
+offers new perspectives for mathematicians as well as computer scientists.
+
+
+
+
+
+
+
+ ♻ ☆ Investigating the Corruption Robustness of Image Classifiers with Random
+ Lp-norm Corruptions
+
+
+ Robustness is a fundamental property of machine learning classifiers required
+to achieve safety and reliability. In the field of adversarial robustness of
+image classifiers, robustness is commonly defined as the stability of a model
+to all input changes within a p-norm distance. However, in the field of random
+corruption robustness, variations observed in the real world are used, while
+p-norm corruptions are rarely considered. This study investigates the use of
+random p-norm corruptions to augment the training and test data of image
+classifiers. We evaluate the model robustness against imperceptible random
+p-norm corruptions and propose a novel robustness metric. We empirically
+investigate whether robustness transfers across different p-norms and derive
+conclusions on which p-norm corruptions a model should be trained and
+evaluated. We find that training data augmentation with a combination of p-norm
+corruptions significantly improves corruption robustness, even on top of
+state-of-the-art data augmentation schemes.
+
+
+
+ comment: Camera-ready version submitted to VISAPP 2024
+
+
+
+
+
+
+ ♻ ☆ PriPrune: Quantifying and Preserving Privacy in Pruned Federated
+ Learning
+
+
+ Federated learning (FL) is a paradigm that allows several client devices and
+a server to collaboratively train a global model, by exchanging only model
+updates, without the devices sharing their local training data. These devices
+are often constrained in terms of communication and computation resources, and
+can further benefit from model pruning -- a paradigm that is widely used to
+reduce the size and complexity of models. Intuitively, by making local models
+coarser, pruning is expected to also provide some protection against privacy
+attacks in the context of FL. However this protection has not been previously
+characterized, formally or experimentally, and it is unclear if it is
+sufficient against state-of-the-art attacks.
+ In this paper, we perform the first investigation of privacy guarantees for
+model pruning in FL. We derive information-theoretic upper bounds on the amount
+of information leaked by pruned FL models. We complement and validate these
+theoretical findings, with comprehensive experiments that involve
+state-of-the-art privacy attacks, on several state-of-the-art FL pruning
+schemes, using benchmark datasets. This evaluation provides valuable insights
+into the choices and parameters that can affect the privacy protection provided
+by pruning. Based on these insights, we introduce PriPrune -- a privacy-aware
+algorithm for local model pruning, which uses a personalized per-client defense
+mask and adapts the defense pruning rate so as to jointly optimize privacy and
+model performance. PriPrune is universal in that can be applied after any
+pruned FL scheme on the client, without modification, and protects against any
+inversion attack by the server. Our empirical evaluation demonstrates that
+PriPrune significantly improves the privacy-accuracy tradeoff compared to
+state-of-the-art pruned FL schemes that do not take privacy into account.
+
+
+ Sharpness-aware minimization (SAM) has well documented merits in enhancing
+generalization of deep neural networks, even without sizable data augmentation.
+Embracing the geometry of the loss function, where neighborhoods of 'flat
+minima' heighten generalization ability, SAM seeks 'flat valleys' by minimizing
+the maximum loss caused by an adversary perturbing parameters within the
+neighborhood. Although critical to account for sharpness of the loss function,
+such an 'over-friendly adversary' can curtail the outmost level of
+generalization. The novel approach of this contribution fosters stabilization
+of adversaries through variance suppression (VaSSO) to avoid such friendliness.
+VaSSO's provable stability safeguards its numerical improvement over SAM in
+model-agnostic tasks, including image classification and machine translation.
+In addition, experiments confirm that VaSSO endows SAM with robustness against
+high levels of label noise.
+
+
+ The dynamic Schr\"odinger bridge problem seeks a stochastic process that
+defines a transport between two target probability measures, while optimally
+satisfying the criteria of being closest, in terms of Kullback-Leibler
+divergence, to a reference process. We propose a novel sampling-based iterative
+algorithm, the iterated diffusion bridge mixture (IDBM) procedure, aimed at
+solving the dynamic Schr\"odinger bridge problem. The IDBM procedure exhibits
+the attractive property of realizing a valid transport between the target
+probability measures at each iteration. We perform an initial theoretical
+investigation of the IDBM procedure, establishing its convergence properties.
+The theoretical findings are complemented by numerical experiments illustrating
+the competitive performance of the IDBM procedure. Recent advancements in
+generative modeling employ the time-reversal of a diffusion process to define a
+generative process that approximately transports a simple distribution to the
+data distribution. As an alternative, we propose utilizing the first iteration
+of the IDBM procedure as an approximation-free method for realizing this
+transport. This approach offers greater flexibility in selecting the generative
+process dynamics and exhibits accelerated training and superior sample quality
+over larger discretization intervals. In terms of implementation, the necessary
+modifications are minimally intrusive, being limited to the training loss
+definition.
+
+
+ Meta-Bayesian optimisation (meta-BO) aims to improve the sample efficiency of
+Bayesian optimisation by leveraging data from related tasks. While previous
+methods successfully meta-learn either a surrogate model or an acquisition
+function independently, joint training of both components remains an open
+challenge. This paper proposes the first end-to-end differentiable meta-BO
+framework that generalises neural processes to learn acquisition functions via
+transformer architectures. We enable this end-to-end framework with
+reinforcement learning (RL) to tackle the lack of labelled acquisition data.
+Early on, we notice that training transformer-based neural processes from
+scratch with RL is challenging due to insufficient supervision, especially when
+rewards are sparse. We formalise this claim with a combinatorial analysis
+showing that the widely used notion of regret as a reward signal exhibits a
+logarithmic sparsity pattern in trajectory lengths. To tackle this problem, we
+augment the RL objective with an auxiliary task that guides part of the
+architecture to learn a valid probabilistic model as an inductive bias. We
+demonstrate that our method achieves state-of-the-art regret results against
+various baselines in experiments on standard hyperparameter optimisation tasks
+and also outperforms others in the real-world problems of mixed-integer
+programming tuning, antibody design, and logic synthesis for electronic design
+automation.
+
+
+
+
+
+
+
+ ♻ ☆ Model-based Clustering with Missing Not At Random Data
+
+
+ Model-based unsupervised learning, as any learning task, stalls as soon as
+missing data occurs. This is even more true when the missing data are
+informative, or said missing not at random (MNAR). In this paper, we propose
+model-based clustering algorithms designed to handle very general types of
+missing data, including MNAR data. To do so, we introduce a mixture model for
+different types of data (continuous, count, categorical and mixed) to jointly
+model the data distribution and the MNAR mechanism, remaining vigilant to the
+relative degrees of freedom of each. Several MNAR models are discussed, for
+which the cause of the missingness can depend on both the values of the missing
+variable themselves and on the class membership. However, we focus on a
+specific MNAR model, called MNARz, for which the missingness only depends on
+the class membership. We first underline its ease of estimation, by showing
+that the statistical inference can be carried out on the data matrix
+concatenated with the missing mask considering finally a standard MAR
+mechanism. Consequently, we propose to perform clustering using the Expectation
+Maximization algorithm, specially developed for this simplified
+reinterpretation. Finally, we assess the numerical performances of the proposed
+methods on synthetic data and on the real medical registry TraumaBase as well.
+
+
+ Reinforcement learning (RL) provides a powerful framework for
+decision-making, but its application in practice often requires a carefully
+designed reward function. Adversarial Imitation Learning (AIL) sheds light on
+automatic policy acquisition without access to the reward signal from the
+environment. In this work, we propose Auto-Encoding Adversarial Imitation
+Learning (AEAIL), a robust and scalable AIL framework. To induce expert
+policies from demonstrations, AEAIL utilizes the reconstruction error of an
+auto-encoder as a reward signal, which provides more information for optimizing
+policies than the prior discriminator-based ones. Subsequently, we use the
+derived objective functions to train the auto-encoder and the agent policy.
+Experiments show that our AEAIL performs superior compared to state-of-the-art
+methods on both state and image based environments. More importantly, AEAIL
+shows much better robustness when the expert demonstrations are noisy.
+
+
+
+ comment: 13 pages
+
+
+
+
+
+
+ ♻ ☆ PrNet: A Neural Network for Correcting Pseudoranges to Improve
+ Positioning with Android Raw GNSS Measurements
+
+
+ We present a neural network for mitigating biased errors in pseudoranges to
+improve localization performance with data collected from mobile phones. A
+satellite-wise Multilayer Perceptron (MLP) is designed to regress the
+pseudorange bias correction from six satellite, receiver, context-related
+features derived from Android raw Global Navigation Satellite System (GNSS)
+measurements. To train the MLP, we carefully calculate the target values of
+pseudorange bias using location ground truth and smoothing techniques and
+optimize a loss function involving the estimation residuals of smartphone clock
+bias. The corrected pseudoranges are then used by a model-based localization
+engine to compute locations. The Google Smartphone Decimeter Challenge (GSDC)
+dataset, which contains Android smartphone data collected from both rural and
+urban areas, is utilized for evaluation. Both fingerprinting and cross-trace
+localization results demonstrate that our proposed method outperforms
+model-based and state-of-the-art data-driven approaches.
+
+
+
+
+
+
+
+ ♻ ☆ Review of AlexNet for Medical Image Classification
+
+
+ In recent years, the rapid development of deep learning has led to a wide
+range of applications in the field of medical image classification. The
+variants of neural network models with ever-increasing performance share some
+commonalities: to try to mitigate overfitting, improve generalization, avoid
+gradient vanishing and exploding, etc. AlexNet first utilizes the dropout
+technique to mitigate overfitting and the ReLU activation function to avoid
+gradient vanishing. Therefore, we focus our discussion on AlexNet, which has
+contributed greatly to the development of CNNs in 2012. After reviewing over 40
+papers, including journal papers and conference papers, we give a narrative on
+the technical details, advantages, and application areas of AlexNet.
+
+
+
+
+
+
+
+ ♻ ☆ Meta Objective Guided Disambiguation for Partial Label Learning
+
+
+ Partial label learning (PLL) is a typical weakly supervised learning
+framework, where each training instance is associated with a candidate label
+set, among which only one label is valid. To solve PLL problems, typically
+methods try to perform disambiguation for candidate sets by either using prior
+knowledge, such as structure information of training data, or refining model
+outputs in a self-training manner. Unfortunately, these methods often fail to
+obtain a favorable performance due to the lack of prior information or
+unreliable predictions in the early stage of model training. In this paper, we
+propose a novel framework for partial label learning with meta objective guided
+disambiguation (MoGD), which aims to recover the ground-truth label from
+candidate labels set by solving a meta objective on a small validation set.
+Specifically, to alleviate the negative impact of false positive labels, we
+re-weight each candidate label based on the meta loss on the validation set.
+Then, the classifier is trained by minimizing the weighted cross entropy loss.
+The proposed method can be easily implemented by using various deep networks
+with the ordinary SGD optimizer. Theoretically, we prove the convergence
+property of meta objective and derive the estimation error bounds of the
+proposed method. Extensive experiments on various benchmark datasets and
+real-world PLL datasets demonstrate that the proposed method can achieve
+competent performance when compared with the state-of-the-art methods.
+
+
+
+
+
+
+
+
+ Dongyue Guo, Zheng Zhang, Zhen Yan, Jianwei Zhang, Yi Lin
+
+
+ Flight Trajectory Prediction (FTP) is an essential task in Air Traffic
+Control (ATC), which can assist air traffic controllers in managing airspace
+more safely and efficiently. Existing approaches generally perform
+multi-horizon FTP tasks in an autoregressive manner, thereby suffering from
+error accumulation and low-efficiency problems. In this paper, a novel
+framework, called FlightBERT++, is proposed to i) forecast multi-horizon flight
+trajectories directly in a non-autoregressive way, and ii) improve the
+limitation of the binary encoding (BE) representation in the FlightBERT.
+Specifically, the FlightBERT++ is implemented by a generalized encoder-decoder
+architecture, in which the encoder learns the temporal-spatial patterns from
+historical observations and the decoder predicts the flight status for the
+future horizons. Compared with conventional architecture, an innovative
+horizon-aware contexts generator is dedicatedly designed to consider the prior
+horizon information, which further enables non-autoregressive multi-horizon
+prediction. Moreover, a differential prompted decoder is proposed to enhance
+the capability of the differential predictions by leveraging the stationarity
+of the differential sequence. The experimental results on a real-world dataset
+demonstrated that the FlightBERT++ outperformed the competitive baselines in
+both FTP performance and computational efficiency.
+
+
+ Forward invariance is a long-studied property in control theory that is used
+to certify that a dynamical system stays within some pre-specified set of
+states for all time, and also admits robustness guarantees (e.g., the
+certificate holds under perturbations). We propose a general framework for
+training and provably certifying robust forward invariance in Neural ODEs. We
+apply this framework to provide certified safety in robust continuous control.
+To our knowledge, this is the first instance of training Neural ODE policies
+with such non-vacuous certified guarantees. In addition, we explore the
+generality of our framework by using it to certify adversarial robustness for
+image classification.
+
+
+
+
+
+
+
+ ♻ ☆ Backdoor Attack with Sparse and Invisible Trigger
+
+
+ Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the
+adversary manipulates a small portion of training data such that the victim
+model predicts normally on the benign samples but classifies the triggered
+samples as the target class. The backdoor attack is an emerging yet threatening
+training-phase threat, leading to serious risks in DNN-based applications. In
+this paper, we revisit the trigger patterns of existing backdoor attacks. We
+reveal that they are either visible or not sparse and therefore are not
+stealthy enough. More importantly, it is not feasible to simply combine
+existing methods to design an effective sparse and invisible backdoor attack.
+To address this problem, we formulate the trigger generation as a bi-level
+optimization problem with sparsity and invisibility constraints and propose an
+effective method to solve it. The proposed method is dubbed sparse and
+invisible backdoor attack (SIBA). We conduct extensive experiments on benchmark
+datasets under different settings, which verify the effectiveness of our attack
+and its resistance to existing backdoor defenses. The codes for reproducing
+main experiments are available at \url{https://github.com/YinghuaGao/SIBA}.
+
+
+
+ comment: The first two authors contributed equally to this work. 13 pages
+
+ In recent years, trust region on-policy reinforcement learning has achieved
+impressive results in addressing complex control tasks and gaming scenarios.
+However, contemporary state-of-the-art algorithms within this category
+primarily emphasize improvement in expected performance, lacking the ability to
+control over the worst-case performance outcomes. To address this limitation,
+we introduce a novel objective function; by optimizing which, it will lead to
+guaranteed monotonic improvement in the lower bound of near-total performance
+samples (absolute performance). Considering this groundbreaking theoretical
+advancement, we then refine this theoretically grounded algorithm through a
+series of approximations, resulting in a practical solution called Absolute
+Policy Optimization (APO). Our experiments demonstrate the effectiveness of our
+approach across challenging continuous control benchmark tasks and extend its
+applicability to mastering Atari games. Our findings reveal that APO
+significantly outperforms state-of-the-art policy gradient algorithms,
+resulting in substantial improvements in both expected performance and
+worst-case performance.
+
+
+
+ comment: submission to Journal of Machine Learning Research
+
+
+
+
+
+
+ ♻ ☆ Constructing Custom Thermodynamics Using Deep Learning
+
+
+
+
+
+
+
+
+ Xiaoli Chen, Beatrice W. Soh, Zi-En Ooi, Eleonore Vissol-Gaudin, Haijun Yu, Kostya S. Novoselov, Kedar Hippalgaonkar, Qianxiao Li
+
+
+ One of the most exciting applications of artificial intelligence (AI) is
+automated scientific discovery based on previously amassed data, coupled with
+restrictions provided by known physical principles, including symmetries and
+conservation laws. Such automated hypothesis creation and verification can
+assist scientists in studying complex phenomena, where traditional physical
+intuition may fail. Here we develop a platform based on a generalized Onsager
+principle to learn macroscopic dynamical descriptions of arbitrary stochastic
+dissipative systems directly from observations of their microscopic
+trajectories. Our method simultaneously constructs reduced thermodynamic
+coordinates and interprets the dynamics on these coordinates. We demonstrate
+its effectiveness by studying theoretically and validating experimentally the
+stretching of long polymer chains in an externally applied field. Specifically,
+we learn three interpretable thermodynamic coordinates and build a dynamical
+landscape of polymer stretching, including the identification of stable and
+transition states and the control of the stretching rate. Our general
+methodology can be used to address a wide range of scientific and technological
+applications.
+
+
+
+ comment: Fix figure visibility issue
+
+
+
+
+
+
+ ♻ ☆ Prompt-Based Editing for Text Style Transfer EMNLP
+
+
+ Prompting approaches have been recently explored in text style transfer,
+where a textual prompt is used to query a pretrained language model to generate
+style-transferred texts word by word in an autoregressive manner. However, such
+a generation process is less controllable and early prediction errors may
+affect future word predictions. In this paper, we present a prompt-based
+editing approach for text style transfer. Specifically, we prompt a pretrained
+language model for style classification and use the classification probability
+to compute a style score. Then, we perform discrete search with word-level
+editing to maximize a comprehensive scoring function for the style-transfer
+task. In this way, we transform a prompt-based generation problem into a
+classification one, which is a training-free process and more controllable than
+the autoregressive generation of sentences. In our experiments, we performed
+both automatic and human evaluation on three style-transfer benchmark datasets,
+and show that our approach largely outperforms the state-of-the-art systems
+that have 20 times more parameters. Additional empirical analyses further
+demonstrate the effectiveness of our approach.
+
+
+ Quantitative markets are characterized by swift dynamics and abundant
+uncertainties, making the pursuit of profit-driven stock trading actions
+inherently challenging. Within this context, reinforcement learning (RL), which
+operates on a reward-centric mechanism for optimal control, has surfaced as a
+potentially effective solution to the intricate financial decision-making
+conundrums presented. This paper delves into the fusion of two established
+financial trading strategies, namely the constant proportion portfolio
+insurance (CPPI) and the time-invariant portfolio protection (TIPP), with the
+multi-agent deep deterministic policy gradient (MADDPG) framework. As a result,
+we introduce two novel multi-agent RL (MARL) methods, CPPI-MADDPG and
+TIPP-MADDPG, tailored for probing strategic trading within quantitative
+markets. To validate these innovations, we implemented them on a diverse
+selection of 100 real-market shares. Our empirical findings reveal that the
+CPPI-MADDPG and TIPP-MADDPG strategies consistently outpace their traditional
+counterparts, affirming their efficacy in the realm of quantitative trading.
+
+
+
+
+
+
+
+ ♻ ☆ Guiding Language Model Reasoning with Planning Tokens
+
+
+ Large language models (LLMs) have recently attracted considerable interest
+for their ability to perform complex reasoning tasks, such as chain-of-thought
+reasoning. However, most of the existing approaches to enhance this ability
+rely heavily on data-driven methods, while neglecting the structural aspects of
+the model's reasoning capacity. We find that while LLMs can manage individual
+reasoning steps well, they struggle with maintaining consistency across an
+entire reasoning chain. To solve this, we introduce 'planning tokens' at the
+start of each reasoning step, serving as a guide for the model. These token
+embeddings are then fine-tuned along with the rest of the model parameters. Our
+approach requires a negligible increase in trainable parameters (just 0.001%)
+and can be applied through either full fine-tuning or a more
+parameter-efficient scheme. We demonstrate our method's effectiveness by
+applying it to three different LLMs, showing notable accuracy improvements
+across three math word problem datasets w.r.t. plain chain-of-thought
+fine-tuning baselines.
+
+
+
+ comment: 10 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Towards Federated Foundation Models: Scalable Dataset Pipelines for
+ Group-Structured Learning
+
+
+ We introduce Dataset Grouper, a library to create large-scale
+group-structured (e.g., federated) datasets, enabling federated learning
+simulation at the scale of foundation models. This library facilitates the
+creation of group-structured versions of existing datasets based on
+user-specified partitions and directly leads to a variety of useful
+heterogeneous datasets that can be plugged into existing software frameworks.
+Dataset Grouper offers three key advantages. First, it scales to settings where
+even a single group's dataset is too large to fit in memory. Second, it
+provides flexibility, both in choosing the base (non-partitioned) dataset and
+in defining partitions. Finally, it is framework-agnostic. We empirically
+demonstrate that Dataset Grouper enables large-scale federated language
+modeling simulations on datasets that are orders of magnitude larger than in
+previous work, allowing for federated training of language models with hundreds
+of millions, and even billions, of parameters. Our experimental results show
+that algorithms like FedAvg operate more as meta-learning methods than as
+empirical risk minimization methods at this scale, suggesting their utility in
+downstream personalization and task-specific adaptation. Dataset Grouper is
+available at https://github.com/google-research/dataset_grouper.
+
+
+
+ comment: Dataset Grouper is available at
+ https://github.com/google-research/dataset_grouper
+
+
+
+
+
+
+ ♻ ☆ MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using
+ Differentiable Shading
+
+
+
+
+
+
+
+
+ Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amin Fadaeinejad, Rafael M. O. Cruz, Marc-Andre Carbonneau
+
+
+ Reconstructing an avatar from a portrait image has many applications in
+multimedia, but remains a challenging research problem. Extracting reflectance
+maps and geometry from one image is ill-posed: recovering geometry is a
+one-to-many mapping problem and reflectance and light are difficult to
+disentangle. Accurate geometry and reflectance can be captured under the
+controlled conditions of a light stage, but it is costly to acquire large
+datasets in this fashion. Moreover, training solely with this type of data
+leads to poor generalization with in-the-wild images. This motivates the
+introduction of MoSAR, a method for 3D avatar generation from monocular images.
+We propose a semi-supervised training scheme that improves generalization by
+learning from both light stage and in-the-wild datasets. This is achieved using
+a novel differentiable shading formulation. We show that our approach
+effectively disentangles the intrinsic face parameters, producing relightable
+avatars. As a result, MoSAR estimates a richer set of skin reflectance maps,
+and generates more realistic avatars than existing state-of-the-art methods. We
+also introduce a new dataset, named FFHQ-UV-Intrinsics, the first public
+dataset providing intrinsic face attributes at scale (diffuse, specular,
+ambient occlusion and translucency maps) for a total of 10k subjects. The
+project website and the dataset are available on the following link:
+https://ubisoft-laforge.github.io/character/mosar/
+
+
+ Restless multi-armed bandits (RMAB) have been widely used to model sequential
+decision making problems with constraints. The decision maker (DM) aims to
+maximize the expected total reward over an infinite horizon under an
+"instantaneous activation constraint" that at most B arms can be activated at
+any decision epoch, where the state of each arm evolves stochastically
+according to a Markov decision process (MDP). However, this basic model fails
+to provide any fairness guarantee among arms. In this paper, we introduce
+RMAB-F, a new RMAB model with "long-term fairness constraints", where the
+objective now is to maximize the long term reward while a minimum long-term
+activation fraction for each arm must be satisfied. For the online RMAB-F
+setting (i.e., the underlying MDPs associated with each arm are unknown to the
+DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL.
+We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the
+reward regret and the fairness violation regret. Compared with off-the-shelf RL
+methods, our Fair-UCRL is much more computationally efficient since it contains
+a novel exploitation that leverages a low-complexity index policy for making
+decisions. Experimental results further demonstrate the effectiveness of our
+Fair-UCRL.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Two Bicomplex and One Multicomplex Least Mean Square algorithms
+
+
+
+
+
+
+
+
+ Daniel Alpay, Kamal Diki, Mihaela Vajiac
+
+
+ We study and introduce new gradient operators in the complex and bicomplex
+settings, inspired from the well-known Least Mean Square (LMS) algorithm
+invented in 1960 by Widrow and Hoff for Adaptive Linear Neuron (ADALINE).
+ These gradient operators will be used to formulate new learning rules for the
+Bicomplex Least Mean Square (BLMS) algorithms and we will also formulate these
+learning rules will for the case of multicomplex LMS algorithms (MLMS). This
+approach extends both the classical real and complex LMS algorithms.
+
+
+ Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic
+to the articulatory space. Signal-processing features like the MFCCs, have been
+widely used for the AAI task. For subjects with dysarthric speech, AAI is
+challenging because of an imprecise and indistinct pronunciation. In this work,
+we perform AAI for dysarthric speech using representations from pre-trained
+self-supervised learning (SSL) models. We demonstrate the impact of different
+pre-trained features on this challenging AAI task, at low-resource conditions.
+In addition, we also condition x-vectors to the extracted SSL features to train
+a BLSTM network. In the seen case, we experiment with three AAI training
+schemes (subject-specific, pooled, and fine-tuned). The results, consistent
+across training schemes, reveal that DeCoAR, in the fine-tuned scheme, achieves
+a relative improvement of the Pearson Correlation Coefficient (CC) by ~1.81%
+and ~4.56% for healthy controls and patients, respectively, over MFCCs. We
+observe similar average trends for different SSL features in the unseen case.
+Overall, SSL networks like wav2vec, APC, and DeCoAR, trained with feature
+reconstruction or future timestep prediction tasks, perform well in predicting
+dysarthric articulatory trajectories.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 7
+
+
+
+
+
+ ☆ VIEScore: Towards Explainable Metrics for Conditional Image Synthesis
+ Evaluation
+
+
+
+
+
+
+
+
+ Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen
+
+
+ In the rapidly advancing field of conditional image generation research,
+challenges such as limited explainability lie in effectively evaluating the
+performance and capabilities of various models. This paper introduces VIESCORE,
+a Visual Instruction-guided Explainable metric for evaluating any conditional
+image generation tasks. VIESCORE leverages general knowledge from Multimodal
+Large Language Models (MLLMs) as the backbone and does not require training or
+fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image
+tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of
+0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)
+VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in
+evaluating synthetic images. (3) VIESCORE achieves a correlation on par with
+human ratings in the generation tasks but struggles in editing tasks. With
+these results, we believe VIESCORE shows its great potential to replace human
+judges in evaluating image synthesis tasks.
+
+
+ Multimodal intent recognition aims to leverage diverse modalities such as
+expressions, body movements and tone of speech to comprehend user's intent,
+constituting a critical task for understanding human language and behavior in
+real-world multimodal scenarios. Nevertheless, the majority of existing methods
+ignore potential correlations among different modalities and own limitations in
+effectively learning semantic features from nonverbal modalities. In this
+paper, we introduce a token-level contrastive learning method with
+modality-aware prompting (TCL-MAP) to address the above challenges. To
+establish an optimal multimodal semantic environment for text modality, we
+develop a modality-aware prompting module (MAP), which effectively aligns and
+fuses features from text, video and audio modalities with similarity-based
+modality alignment and cross-modality attention mechanism. Based on the
+modality-aware prompt and ground truth labels, the proposed token-level
+contrastive learning framework (TCL) constructs augmented samples and employs
+NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal
+textual semantic insights derived from intent labels to guide the learning
+processes of other modalities in return. Extensive experiments show that our
+method achieves remarkable improvements compared to state-of-the-art methods.
+Additionally, ablation analyses demonstrate the superiority of the
+modality-aware prompt over the handcrafted prompt, which holds substantial
+significance for multimodal prompt learning. The codes are released at
+https://github.com/thuiar/TCL-MAP.
+
+
+
+ comment: Accepted by AAAI 2024 (Main Track, Long Paper)
+
+
+
+
+
+
+
+ Zhenyang Li, Fan Liu, Yinwei Wei, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli
+
+
+ Recommendation algorithms forecast user preferences by correlating user and
+item representations derived from historical interaction patterns. In pursuit
+of enhanced performance, many methods focus on learning robust and independent
+representations by disentangling the intricate factors within interaction data
+across various modalities in an unsupervised manner. However, such an approach
+obfuscates the discernment of how specific factors (e.g., category or brand)
+influence the outcomes, making it challenging to regulate their effects. In
+response to this challenge, we introduce a novel method called Attribute-Driven
+Disentangled Representation Learning (short for AD-DRL), which explicitly
+incorporates attributes from different modalities into the disentangled
+representation learning process. By assigning a specific attribute to each
+factor in multimodal features, AD-DRL can disentangle the factors at both
+attribute and attribute-value levels. To obtain robust and independent
+representations for each factor associated with a specific attribute, we first
+disentangle the representations of features both within and across different
+modalities. Moreover, we further enhance the robustness of the representations
+by fusing the multimodal features of the same factor. Empirical evaluations
+conducted on three public real-world datasets substantiate the effectiveness of
+AD-DRL, as well as its interpretability and controllability.
+
+
+
+
+
+
+
+ ☆ Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
+
+
+
+
+
+
+
+
+ Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
+
+
+ As the development of large-scale Generative AI models evolve beyond text
+(1D) generation to include image (2D) and video (3D) generation, processing
+spatial and temporal information presents unique challenges to quality,
+performance, and efficiency. We present the first work towards understanding
+this new system design space for multi-modal text-to-image (TTI) and
+text-to-video (TTV) generation models. Current model architecture designs are
+bifurcated into 2 categories: Diffusion- and Transformer-based models. Our
+systematic performance characterization on a suite of eight representative
+TTI/TTV models shows that after state-of-the-art optimization techniques such
+as Flash Attention are applied, Convolution accounts for up to 44% of execution
+time for Diffusion-based TTI models, while Linear layers consume up to 49% of
+execution time for Transformer-based models. We additionally observe that
+Diffusion-based TTI models resemble the Prefill stage of LLM inference, and
+benefit from 1.1-2.5x greater speedup from Flash Attention than
+Transformer-based TTI models that resemble the Decode phase. Since
+optimizations designed for LLMs do not map directly onto TTI/TTV models, we
+must conduct a thorough characterization of these workloads to gain insights
+for new optimization opportunities. In doing so, we define sequence length in
+the context of TTI/TTV models and observe sequence length can vary up to 4x in
+Diffusion model inference. We additionally observe temporal aspects of TTV
+workloads pose unique system bottlenecks, with Temporal Attention accounting
+for over 60% of total Attention time. Overall, our in-depth system performance
+characterization is a critical first step towards designing efficient and
+deployable systems for emerging TTI/TTV workloads.
+
+
+
+
+
+
+
+
+ Yicheng Leng, Chaowei Fang, Gen Li, Yixiang Fang, Guanbin Li
+
+
+ Visible watermarks, while instrumental in protecting image copyrights,
+frequently distort the underlying content, complicating tasks like scene
+interpretation and image editing. Visible watermark removal aims to eliminate
+the interference of watermarks and restore the background content. However,
+existing methods often implement watermark component removal and background
+restoration tasks within a singular branch, leading to residual watermarks in
+the predictions and ignoring cases where watermarks heavily obscure the
+background. To address these limitations, this study introduces the Removing
+Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI
+embodies a two-stage approach: the initial phase centers on discerning and
+segregating the watermark component, while the subsequent phase focuses on
+background content restoration. To achieve meticulous background restoration,
+our proposed model employs a dual-path network capable of fully exploring the
+intrinsic background information beneath semi-transparent watermarks and
+peripheral contextual information from unaffected regions. Moreover, a Global
+and Local Context Interaction module is built upon multi-layer perceptrons and
+bidirectional feature transformation for comprehensive representation modeling
+in the background restoration phase. The efficacy of our approach is
+empirically validated across two large-scale datasets, and our findings reveal
+a marked enhancement over existing watermark removal techniques.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ♻ ☆ UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
+
+
+ Large Language Models (LLMs) have made the ambitious quest for generalist
+agents significantly far from being a fantasy. A key hurdle for building such
+general models is the diversity and heterogeneity of tasks and modalities. A
+promising solution is unification, allowing the support of a myriad of tasks
+and modalities within one unified framework. While few large models (e.g.,
+Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more
+than two modalities, current small to mid-scale unified models are still
+limited to 2 modalities, usually image-text or video-text. The question that we
+ask is: is it possible to build efficiently a unified model that can support
+all modalities? To answer this, we propose UnIVAL, a step further towards this
+ambitious goal. Without relying on fancy datasets sizes or models with billions
+of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities
+and unifies text, images, video, and audio into a single model. Our model is
+efficiently pretrained on many tasks, based on task balancing and multimodal
+curriculum learning. UnIVAL shows competitive performance to existing
+state-of-the-art approaches, across image and video-text tasks. The feature
+representations learned from image and video-text modalities, allows the model
+to achieve competitive performance when finetuned on audio-text tasks, despite
+not being pretrained on audio. Thanks to the unified model, we propose a novel
+study on multimodal model merging via weight interpolation of models trained on
+different multimodal tasks, showing their benefits in particular for
+out-of-distribution generalization. Finally, we motivate unification by showing
+the synergy between tasks. The model weights and code are released here:
+https://github.com/mshukor/UnIVAL.
+
+
+
+
+
+
+
+ ♻ ☆ Differentiable JPEG: The Devil is in the Details WACV 2024
+
+
+
+
+
+
+
+
+ Christoph Reich, Biplob Debnath, Deep Patel, Srimat Chakradhar
+
+
+ JPEG remains one of the most widespread lossy image coding methods. However,
+the non-differentiable nature of JPEG restricts the application in deep
+learning pipelines. Several differentiable approximations of JPEG have recently
+been proposed to address this issue. This paper conducts a comprehensive review
+of existing diff. JPEG approaches and identifies critical details that have
+been missed by previous methods. To this end, we propose a novel diff. JPEG
+approach, overcoming previous limitations. Our approach is differentiable
+w.r.t. the input image, the JPEG quality, the quantization tables, and the
+color conversion parameters. We evaluate the forward and backward performance
+of our diff. JPEG approach against existing methods. Additionally, extensive
+ablations are performed to evaluate crucial design choices. Our proposed diff.
+JPEG resembles the (non-diff.) reference implementation best, significantly
+surpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For
+strong compression rates, we can even improve PSNR by $9.51$dB. Strong
+adversarial attack results are yielded by our diff. JPEG, demonstrating the
+effective gradient approximation. Our code is available at
+https://github.com/necla-ml/Diff-JPEG.
+
+
+ We introduce EmphAssess, a prosodic benchmark designed to evaluate the
+capability of speech-to-speech models to encode and reproduce prosodic
+emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech
+translation. In both cases, the benchmark evaluates the ability of the model to
+encode emphasis in the speech input and accurately reproduce it in the output,
+potentially across a change of speaker and language. As part of the evaluation
+pipeline, we introduce EmphaClass, a new model that classifies emphasis at the
+frame or word level.
+
+
+
+
+
+
+
+ ☆ T-Eval: Evaluating the Tool Utilization Capability Step by Step
+
+
+ Large language models (LLM) have achieved remarkable performance on various
+NLP tasks and are augmented by tools for broader applications. Yet, how to
+evaluate and analyze the tool-utilization capability of LLMs is still
+under-explored. In contrast to previous works that evaluate models
+holistically, we comprehensively decompose the tool utilization into multiple
+sub-processes, including instruction following, planning, reasoning, retrieval,
+understanding, and review. Based on that, we further introduce \shortname~to
+evaluate the tool utilization capability step by step. \shortname~disentangles
+the tool utilization evaluation into several sub-domains along model
+capabilities, facilitating the inner understanding of both holistic and
+isolated competency of LLMs. We conduct extensive experiments on \shortname~and
+in-depth analysis of various LLMs. \shortname~ not only exhibits consistency
+with the outcome-oriented evaluation but also provides a more fine-grained
+analysis of the capabilities of LLMs, providing a new perspective in LLM
+evaluation on tool-utilization ability. The benchmark will be available at
+\href{https://github.com/open-compass/T-Eval}{https://github.com/open-compass/T-Eval}.
+
+
+
+
+
+
+
+ ☆ ChatGPT as a commenter to the news: can LLMs generate human-like
+ opinions?
+
+
+
+
+
+
+
+
+ Rayden Tseng, Suzan Verberne, Peter van der Putten
+
+
+ ChatGPT, GPT-3.5, and other large language models (LLMs) have drawn
+significant attention since their release, and the abilities of these models
+have been investigated for a wide variety of tasks. In this research we
+investigate to what extent GPT-3.5 can generate human-like comments on Dutch
+news articles. We define human likeness as `not distinguishable from human
+comments', approximated by the difficulty of automatic classification between
+human and GPT comments. We analyze human likeness across multiple prompting
+techniques. In particular, we utilize zero-shot, few-shot and context prompts,
+for two generated personas. We found that our fine-tuned BERT models can easily
+distinguish human-written comments from GPT-3.5 generated comments, with none
+of the used prompting methods performing noticeably better. We further analyzed
+that human comments consistently showed higher lexical diversity than
+GPT-generated comments. This indicates that although generative LLMs can
+generate fluent text, their capability to create human-like opinionated
+comments is still limited.
+
+
+
+ comment: Published as Tseng, R., Verberne, S., van der Putten, P. (2023).
+ ChatGPT as a Commenter to the News: Can LLMs Generate Human-Like Opinions?.
+ In: Ceolin, D., Caselli, T., Tulin, M. (eds) Disinformation in Open Online
+ Media. MISDOOM 2023. Lecture Notes in Computer Science, vol 14397. Springer,
+ Cham
+
+
+
+
+
+
+ ☆ Typhoon: Thai Large Language Models
+
+
+ Typhoon is a series of Thai large language models (LLMs) developed
+specifically for the Thai language. This technical report presents challenges
+and insights in developing Thai LLMs, including data preparation, pretraining,
+instruction-tuning, and evaluation. As one of the challenges of low-resource
+languages is the amount of pretraining data, we apply continual training to
+transfer existing world knowledge from a strong LLM. To evaluate the Thai
+knowledge encapsulated in each model from the pretraining stage, we develop
+ThaiExam, a benchmark based on examinations for high-school students and
+investment professionals in Thailand. In addition, we fine-tune Typhoon to
+follow Thai instructions, and we evaluate instruction-tuned models on Thai
+instruction datasets as well as translation, summarization, and
+question-answering tasks. Experimental results on a suite of Thai benchmarks
+show that Typhoon outperforms all open-source Thai language models, and its
+performance is on par with GPT-3.5 in Thai while having only 7 billion
+parameters and being 2.62 times more efficient in tokenizing Thai text.
+
+
+ This paper presents a new supervised representation learning framework,
+namely Structured Probabilistic Coding (SPC), to learn compact and informative
+representations from input related to the target task. SPC is an encoder-only
+probabilistic coding technology with a structured regularization from the
+target label space. By extracting compact and informative representations from
+input related to the target task, SPC can enhance the generalization ability of
+pre-trained language models for better language understanding. Specifically,
+the hidden representation is encoded into a Gaussian distribution space, while
+maximizing the prior entropy of latent representations concerning label space.
+This technique can simultaneously perform information encoding and task
+prediction in one module to more fully utilize the effective information from
+input data, and use variational inference in the output space to reduce
+randomness and uncertainty. To better control the probability distribution in
+the latent space, a structured regularization is proposed to promote
+class-level uniformity in the latent space. With the regularization term, SPC
+can preserve the Gaussian distribution structure of latent code as well as
+better cover the hidden space with class uniformly. We conduct evaluations on
+12 natural language understanding tasks. The results show that our SPC can
+effectively improve the performance of pre-trained language models for various
+classification and regression tasks. Experiments demonstrate that SPC can
+enhance the generalization capability, robustness to label noise, and
+clustering quality of output representations.
+
+
+
+ comment: 11 pages, accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Domain-Specific Fine-Tuning of Large Language Models for Interactive
+ Robot Programming
+
+
+
+
+
+
+
+
+ Benjamin Alt, Urs Keßner, Aleksandar Taranovic, Darko Katic, Andreas Hermann, Rainer Jäkel, Gerhard Neumann
+
+
+ Industrial robots are applied in a widening range of industries, but robot
+programming mostly remains a task limited to programming experts. We propose a
+natural language-based assistant for programming of advanced, industrial
+robotic applications and investigate strategies for domain-specific fine-tuning
+of foundation models with limited data and compute.
+
+
+
+ comment: 5 pages, 1 figure, accepted to the 2024 European Robotics Forum
+
+
+
+
+
+
+ ☆ Diversifying Knowledge Enhancement of Biomedical Language Models using
+ Adapter Modules and Knowledge Graphs
+
+
+
+
+
+
+
+
+ Juraj Vladika, Alexander Fichtl, Florian Matthes
+
+
+ Recent advances in natural language processing (NLP) owe their success to
+pre-training language models on large amounts of unstructured data. Still,
+there is an increasing effort to combine the unstructured nature of LMs with
+structured knowledge and reasoning. Particularly in the rapidly evolving field
+of biomedical NLP, knowledge-enhanced language models (KELMs) have emerged as
+promising tools to bridge the gap between large language models and
+domain-specific knowledge, considering the available biomedical knowledge
+graphs (KGs) curated by experts over the decades. In this paper, we develop an
+approach that uses lightweight adapter modules to inject structured biomedical
+knowledge into pre-trained language models (PLMs). We use two large KGs, the
+biomedical knowledge system UMLS and the novel biochemical ontology OntoChem,
+with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT. The approach
+includes partitioning knowledge graphs into smaller subgraphs, fine-tuning
+adapter modules for each subgraph, and combining the knowledge in a fusion
+layer. We test the performance on three downstream tasks: document
+classification,question answering, and natural language inference. We show that
+our methodology leads to performance improvements in several instances while
+keeping requirements in computing power low. Finally, we provide a detailed
+interpretation of the results and report valuable insights for future work.
+
+
+
+ comment: Accepted as Full Paper to ICAART 2024
+
+
+
+
+
+
+ ☆ Capture the Flag: Uncovering Data Insights with Large Language Models NeurIPS 2023
+
+
+
+
+
+
+
+
+ Issam Laradji, Perouz Taslakian, Sai Rajeswar, Valentina Zantedeschi, Alexandre Lacoste, Nicolas Chapados, David Vazquez, Christopher Pal, Alexandre Drouin
+
+
+ The extraction of a small number of relevant insights from vast amounts of
+data is a crucial component of data-driven decision-making. However,
+accomplishing this task requires considerable technical skills, domain
+expertise, and human labor. This study explores the potential of using Large
+Language Models (LLMs) to automate the discovery of insights in data,
+leveraging recent advances in reasoning and code generation techniques. We
+propose a new evaluation methodology based on a "capture the flag" principle,
+measuring the ability of such models to recognize meaningful and pertinent
+information (flags) in a dataset. We further propose two proof-of-concept
+agents, with different inner workings, and compare their ability to capture
+such flags in a real-world sales dataset. While the work reported here is
+preliminary, our results are sufficiently interesting to mandate future
+exploration by the community.
+
+
+
+ comment: 14 pages, 1 figure, Foundation Models for Decision Making Workshop at
+ NeurIPS 2023
+
+
+
+
+
+
+ ☆ Evaluating Task-oriented Dialogue Systems: A Systematic Review of
+ Measures, Constructs and their Operationalisations
+
+
+ This review gives an extensive overview of evaluation methods for
+task-oriented dialogue systems, paying special attention to practical
+applications of dialogue systems, for example for customer service. The review
+(1) provides an overview of the used constructs and metrics in previous work,
+(2) discusses challenges in the context of dialogue system evaluation and (3)
+develops a research agenda for the future of dialogue system evaluation. We
+conducted a systematic review of four databases (ACL, ACM, IEEE and Web of
+Science), which after screening resulted in 122 studies. Those studies were
+carefully analysed for the constructs and methods they proposed for evaluation.
+We found a wide variety in both constructs and methods. Especially the
+operationalisation is not always clearly reported. We hope that future work
+will take a more critical approach to the operationalisation and specification
+of the used constructs. To work towards this aim, this review ends with
+recommendations for evaluation and suggestions for outstanding questions.
+
+
+ Understanding user intentions is crucial for enhancing product
+recommendations, navigation suggestions, and query reformulations. However,
+user intentions can be complex, involving multiple sessions and attribute
+requirements connected by logical operators such as And, Or, and Not. For
+example, a user may search for Nike or Adidas running shoes across various
+sessions, with a preference for the color purple. In another case, a user may
+have purchased a mattress in a previous session and is now seeking a
+corresponding bed frame without intending to buy another mattress. Prior
+research on session understanding has not sufficiently addressed how to make
+product or attribute recommendations for such complex intentions. In this
+paper, we introduce the task of logical session complex query answering, where
+sessions are treated as hyperedges of items, and we formulate the problem of
+complex intention understanding as a task of logical session complex queries
+answering (LS-CQA) on an aggregated hypergraph of sessions, items, and
+attributes. The proposed task is a special type of complex query answering task
+with sessions as ordered hyperedges. We also propose a new model, the Logical
+Session Graph Transformer (LSGT), which captures interactions among items
+across different sessions and their logical connections using a transformer
+structure. We analyze the expressiveness of LSGT and prove the permutation
+invariance of the inputs for the logical operators. We evaluate LSGT on three
+datasets and demonstrate that it achieves state-of-the-art results.
+
+
+
+
+
+
+
+ ☆ Team Flow at DRC2023: Building Common Ground and Text-based Turn-taking
+ in a Travel Agent Spoken Dialogue System
+
+
+ At the Dialogue Robot Competition 2023 (DRC2023), which was held to improve
+the capability of dialogue robots, our team developed a system that could build
+common ground and take more natural turns based on user utterance texts. Our
+system generated queries for sightseeing spot searches using the common ground
+and engaged in dialogue while waiting for user comprehension.
+
+
+
+ comment: This paper is part of the proceedings of the Dialogue Robot
+ Competition 2023
+
+
+
+
+
+
+ ☆ On Task Performance and Model Calibration with Supervised and
+ Self-Ensembled In-Context Learning
+
+
+
+
+
+
+
+
+ Chengzu Li, Han Zhou, Goran Glavaš, Anna Korhonen, Ivan Vulić
+
+
+ Following the standard supervised fine-tuning (SFT) paradigm, in-context
+learning (ICL) has become an efficient approach propelled by the recent
+advancements in large language models (LLMs), yielding promising performance
+across various tasks in few-shot data setups. However, both paradigms are prone
+to suffer from the critical problem of overconfidence (i.e., miscalibration),
+especially in such limited data setups. In this work, we deliver an in-depth
+analysis of the behavior across different choices of learning methods from the
+perspective of both performance and calibration, as well as their interplay.
+Through extensive controlled experiments, we find that simultaneous gains for
+both task performance and calibration are difficult to achieve, and the problem
+of miscalibration exists across all learning methods in low-resource
+scenarios.To address this challenging trade-off between performance and
+calibration, we then investigate the potential of self-ensembling techniques
+applied at different modeling stages (e.g., variations of in-context examples
+or variations in prompts or different ensembling strategies). We justify the
+feasibility of self-ensembling on SFT in addition to ICL, to make the
+predictions more calibrated and have comparable or even better performance. Our
+work sheds light on which learning paradigm to choose and how to enhance both
+task performance and calibration of LLMs.
+
+
+
+ comment: 9 pages, 4 figures, 5 tables (20 pages, 5 figures, 13 tables
+ including references and appendices)
+
+ Existing PTLM-based models for TSC can be categorized into two groups: 1)
+fine-tuning-based models that adopt PTLM as the context encoder; 2)
+prompting-based models that transfer the classification task to the text/word
+generation task. In this paper, we present a new perspective of leveraging PTLM
+for TSC: simultaneously leveraging the merits of both language modeling and
+explicit target-context interactions via contextual target attributes.
+Specifically, we design the domain- and target-constrained cloze test, which
+can leverage the PTLMs' strong language modeling ability to generate the given
+target's attributes pertaining to the review context. The attributes contain
+the background and property information of the target, which can help to enrich
+the semantics of the review context and the target. To exploit the attributes
+for tackling TSC, we first construct a heterogeneous information graph by
+treating the attributes as nodes and combining them with (1) the syntax graph
+automatically produced by the off-the-shelf dependency parser and (2) the
+semantics graph of the review context, which is derived from the self-attention
+mechanism. Then we propose a heterogeneous information gated graph
+convolutional network to model the interactions among the attribute
+information, the syntactic information, and the contextual information. The
+experimental results on three benchmark datasets demonstrate the superiority of
+our model, which achieves new state-of-the-art performance.
+
+
+
+ comment: Accepted by Journal of Artificial Intelligence Research (JAIR)
+
+
+
+
+
+
+ ☆ A Semantic Space is Worth 256 Language Descriptions: Make Stronger
+ Segmentation Models with Descriptive Properties
+
+
+ This paper introduces ProLab, a novel approach using property-level label
+space for creating strong interpretable segmentation models. Instead of relying
+solely on category-specific annotations, ProLab uses descriptive properties
+grounded in common sense knowledge for supervising segmentation models. It is
+based on two core designs. First, we employ Large Language Models (LLMs) and
+carefully crafted prompts to generate descriptions of all involved categories
+that carry meaningful common sense knowledge and follow a structured format.
+Second, we introduce a description embedding model preserving semantic
+correlation across descriptions and then cluster them into a set of descriptive
+properties (e.g., 256) using K-Means. These properties are based on
+interpretable common sense knowledge consistent with theories of human
+recognition. We empirically show that our approach makes segmentation models
+perform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal
+Context, Cityscapes, and BDD). Our method also shows better scalability with
+extended training steps than category-level supervision. Our interpretable
+segmentation framework also emerges with the generalization ability to segment
+out-of-domain or unknown categories using only in-domain descriptive
+properties. Code is available at https://github.com/lambert-x/ProLab.
+
+
+
+ comment: Preprint. Code is available at https://github.com/lambert-x/ProLab
+
+
+
+
+
+
+ ☆ Data Transformation to Construct a Dataset for Generating
+ Entity-Relationship Model from Natural Language
+
+
+ In order to reduce the manual cost of designing ER models, recent approaches
+have been proposed to address the task of NL2ERM, i.e., automatically
+generating entity-relationship (ER) models from natural language (NL)
+utterances such as software requirements. These approaches are typically
+rule-based ones, which rely on rigid heuristic rules; these approaches cannot
+generalize well to various linguistic ways of describing the same requirement.
+Despite having better generalization capability than rule-based approaches,
+deep-learning-based models are lacking for NL2ERM due to lacking a large-scale
+dataset. To address this issue, in this paper, we report our insight that there
+exists a high similarity between the task of NL2ERM and the increasingly
+popular task of text-to-SQL, and propose a data transformation algorithm that
+transforms the existing data of text-to-SQL into the data of NL2ERM. We apply
+our data transformation algorithm on Spider, one of the most popular
+text-to-SQL datasets, and we also collect some data entries with different NL
+types, to obtain a large-scale NL2ERM dataset. Because NL2ERM can be seen as a
+special information extraction (IE) task, we train two state-of-the-art IE
+models on our dataset. The experimental results show that both the two models
+achieve high performance and outperform existing baselines.
+
+
+
+
+
+
+
+ ☆ Text2Analysis: A Benchmark of Table Question Answering with Advanced
+ Data Analysis and Unclear Queries AAAI'2024
+
+
+
+
+
+
+
+
+ Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen, Shi Han, Zejian Yuan, Dongmei Zhang
+
+
+ Tabular data analysis is crucial in various fields, and large language models
+show promise in this area. However, current research mostly focuses on
+rudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like
+forecasting and chart generation. To address this gap, we developed the
+Text2Analysis benchmark, incorporating advanced analysis tasks that go beyond
+the SQL-compatible operations and require more in-depth analysis. We also
+develop five innovative and effective annotation methods, harnessing the
+capabilities of large language models to enhance data quality and quantity.
+Additionally, we include unclear queries that resemble real-world user
+questions to test how well models can understand and tackle such challenges.
+Finally, we collect 2249 query-result pairs with 347 tables. We evaluate five
+state-of-the-art models using three different metrics and the results show that
+our benchmark presents introduces considerable challenge in the field of
+tabular data analysis, paving the way for more advanced research opportunities.
+
+
+
+ comment: Accepted by AAAI'2024
+
+
+
+
+
+
+ ☆ Compositional Zero-Shot Learning for Attribute-Based Object Reference in
+ Human-Robot Interaction
+
+
+
+
+
+
+
+
+ Peng Gao, Ahmed Jaafar, Brian Reily, Christopher Reardon, Hao Zhang
+
+
+ Language-enabled robots have been widely studied over the past years to
+enable natural human-robot interaction and teaming in various real-world
+applications. Language-enabled robots must be able to comprehend referring
+expressions to identify a particular object from visual perception using a set
+of referring attributes extracted from natural language. However, visual
+observations of an object may not be available when it is referred to, and the
+number of objects and attributes may also be unbounded in open worlds. To
+address the challenges, we implement an attribute-based compositional zero-shot
+learning method that uses a list of attributes to perform referring expression
+comprehension in open worlds. We evaluate the approach on two datasets
+including the MIT-States and the Clothing 16K. The preliminary experimental
+results show that our implemented approach allows a robot to correctly identify
+the objects referred to by human commands.
+
+
+
+ comment: Equal contribution from the first two authors
+
+
+
+
+
+
+ ☆ Structure-Aware Path Inference for Neural Finite State Transducers NeurIPS 2023
+
+
+ Neural finite-state transducers (NFSTs) form an expressive family of
+neurosymbolic sequence transduction models. An NFST models each string pair as
+having been generated by a latent path in a finite-state transducer. As they
+are deep generative models, both training and inference of NFSTs require
+inference networks that approximate posterior distributions over such latent
+variables. In this paper, we focus on the resulting challenge of imputing the
+latent alignment path that explains a given pair of input and output strings
+(e.g., during training). We train three autoregressive approximate models for
+amortized inference of the path, which can then be used as proposal
+distributions for importance sampling. All three models perform lookahead. Our
+most sophisticated (and novel) model leverages the FST structure to consider
+the graph of future paths; unfortunately, we find that it loses out to the
+simpler approaches -- except on an artificial task that we concocted to confuse
+the simpler approaches.
+
+
+
+ comment: In Proceedings of ICBINB Workshop at NeurIPS 2023
+
+
+
+
+
+
+ ☆ Argue with Me Tersely: Towards Sentence-Level Counter-Argument
+ Generation EMNLP2023
+
+
+ Counter-argument generation -- a captivating area in computational
+linguistics -- seeks to craft statements that offer opposing views. While most
+research has ventured into paragraph-level generation, sentence-level
+counter-argument generation beckons with its unique constraints and
+brevity-focused challenges. Furthermore, the diverse nature of
+counter-arguments poses challenges for evaluating model performance solely
+based on n-gram-based metrics. In this paper, we present the ArgTersely
+benchmark for sentence-level counter-argument generation, drawing from a
+manually annotated dataset from the ChangeMyView debate forum. We also propose
+Arg-LlaMA for generating high-quality counter-argument. For better evaluation,
+we trained a BERT-based evaluator Arg-Judge with human preference data. We
+conducted comparative experiments involving various baselines such as LlaMA,
+Alpaca, GPT-3, and others. The results show the competitiveness of our proposed
+framework and evaluator in counter-argument generation tasks. Code and data are
+available at https://github.com/amazingljy1206/ArgTersely.
+
+
+
+ comment: EMNLP2023, main conference
+
+
+
+
+
+
+ ☆ Towards More Faithful Natural Language Explanation Using Multi-Level
+ Contrastive Learning in VQA AAAI 2024
+
+
+ Natural language explanation in visual question answer (VQA-NLE) aims to
+explain the decision-making process of models by generating natural language
+sentences to increase users' trust in the black-box systems. Existing post-hoc
+methods have achieved significant progress in obtaining a plausible
+explanation. However, such post-hoc explanations are not always aligned with
+human logical inference, suffering from the issues on: 1) Deductive
+unsatisfiability, the generated explanations do not logically lead to the
+answer; 2) Factual inconsistency, the model falsifies its counterfactual
+explanation for answers without considering the facts in images; and 3)
+Semantic perturbation insensitivity, the model can not recognize the semantic
+changes caused by small perturbations. These problems reduce the faithfulness
+of explanations generated by models. To address the above issues, we propose a
+novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive
+\textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for
+VQA with semantic-level, image-level, and instance-level factual and
+counterfactual samples. MCLE extracts discriminative features and aligns the
+feature spaces from explanations with visual question and answer to generate
+more consistent explanations. We conduct extensive experiments, ablation
+analysis, and case study to demonstrate the effectiveness of our method on two
+VQA-NLE benchmarks.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ Speech Translation with Large Language Models: An Industrial Practice
+
+
+
+
+
+
+
+
+ Zhichao Huang, Rong Ye, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li
+
+
+ Given the great success of large language models (LLMs) across various tasks,
+in this paper, we introduce LLM-ST, a novel and effective speech translation
+model constructed upon a pre-trained LLM. By integrating the large language
+model (LLM) with a speech encoder and employing multi-task instruction tuning,
+LLM-ST can produce accurate timestamped transcriptions and translations, even
+from long audio inputs. Furthermore, our findings indicate that the
+implementation of Chain-of-Thought (CoT) prompting can yield advantages in the
+context of LLM-ST. Through rigorous experimentation on English and Chinese
+datasets, we showcase the exceptional performance of LLM-ST, establishing a new
+benchmark in the field of speech translation. Demo:
+https://speechtranslation.github.io/llm-st/.
+
+
+
+
+
+
+
+ ☆ The Truth is in There: Improving Reasoning in Language Models with
+ Layer-Selective Rank Reduction
+
+
+
+
+
+
+
+
+ Pratyusha Sharma, Jordan T. Ash, Dipendra Misra
+
+
+ Transformer-based Large Language Models (LLMs) have become a fixture in
+modern machine learning. Correspondingly, significant resources are allocated
+towards research that aims to further advance this technology, typically
+resulting in models of increasing size that are trained on increasing amounts
+of data. This work, however, demonstrates the surprising result that it is
+often possible to significantly improve the performance of LLMs by selectively
+removing higher-order components of their weight matrices. This simple
+intervention, which we call LAyer-SElective Rank reduction (LASER), can be done
+on a model after training has completed, and requires no additional parameters
+or data. We show extensive experiments demonstrating the generality of this
+finding across language models and datasets, and provide in-depth analyses
+offering insights into both when LASER is effective and the mechanism by which
+it operates.
+
+
+
+
+
+
+
+ ☆ How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
+ May Cry'' Benchmark
+
+
+
+
+
+
+
+
+ Eldar Kurtic, Torsten Hoefler, Dan Alistarh
+
+
+ Pruning large language models (LLMs) from the BERT family has emerged as a
+standard compression benchmark, and several pruning methods have been proposed
+for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into
+question the validity of all existing methods, exhibiting a more complex setup
+where many known pruning methods appear to fail. We revisit the question of
+accurate BERT-pruning during fine-tuning on downstream datasets, and propose a
+set of general guidelines for successful pruning, even on the challenging SMC
+benchmark. First, we perform a cost-vs-benefits analysis of pruning model
+components, such as the embeddings and the classification head; second, we
+provide a simple-yet-general way of scaling training, sparsification and
+learning rate schedules relative to the desired target sparsity; finally, we
+investigate the importance of proper parametrization for Knowledge Distillation
+in the context of LLMs. Our simple insights lead to state-of-the-art results,
+both on classic BERT-pruning benchmarks, as well as on the SMC benchmark,
+showing that even classic gradual magnitude pruning (GMP) can yield competitive
+results, with the right approach.
+
+
+
+ comment: Accepted as oral to CPAL 2024
+
+
+
+
+
+
+ ☆ Developing Interactive Tourism Planning: A Dialogue Robot System Powered
+ by a Large Language Mode
+
+
+ In recent years, large language models (LLMs) have rapidly proliferated and
+have been utilized in various tasks, including research in dialogue systems. We
+aimed to construct a system that not only leverages the flexible conversational
+abilities of LLMs but also their advanced planning capabilities to reduce the
+speaking load on human interlocutors and efficiently plan trips. Furthermore,
+we propose a method that divides the complex task of a travel agency into
+multiple subtasks, managing each as a separate phase to effectively accomplish
+the task. Our proposed system confirmed a certain level of success by achieving
+fourth place in the Dialogue Robot Competition 2023 preliminaries rounds. We
+report on the challenges identified through the competition.
+
+
+
+ comment: This paper is part of the proceedings of the Dialogue Robot
+ Competition 2023
+
+ Computerised clinical coding approaches aim to automate the process of
+assigning a set of codes to medical records. While there is active research
+pushing the state of the art on clinical coding for hospitalized patients, the
+outpatient setting -- where doctors tend to non-hospitalised patients -- is
+overlooked. Although both settings can be formalised as a multi-label
+classification task, they present unique and distinct challenges, which raises
+the question of whether the success of inpatient clinical coding approaches
+translates to the outpatient setting. This paper is the first to investigate
+how well state-of-the-art deep learning-based clinical coding approaches work
+in the outpatient setting at hospital scale. To this end, we collect a large
+outpatient dataset comprising over 7 million notes documenting over half a
+million patients. We adapt four state-of-the-art clinical coding approaches to
+this setting and evaluate their potential to assist coders. We find evidence
+that clinical coding in outpatient settings can benefit from more innovations
+in popular inpatient coding benchmarks. A deeper analysis of the factors
+contributing to the success -- amount and form of data and choice of document
+representation -- reveals the presence of easy-to-solve examples, the coding of
+which can be completely automated with a low error rate.
+
+
+
+ comment: 9 pages, preprint under review
+
+
+
+
+
+
+ ☆ Decoupling Representation and Knowledge for Few-Shot Intent
+ Classification and Slot Filling
+
+
+
+
+
+
+
+
+ Jie Han, Yixiong Zou, Haozhao Wang, Jun Wang, Wei Liu, Yao Wu, Tao Zhang, Ruixuan Li
+
+
+ Few-shot intent classification and slot filling are important but challenging
+tasks due to the scarcity of finely labeled data. Therefore, current works
+first train a model on source domains with sufficiently labeled data, and then
+transfer the model to target domains where only rarely labeled data is
+available. However, experience transferring as a whole usually suffers from
+gaps that exist among source domains and target domains. For instance,
+transferring domain-specific-knowledge-related experience is difficult. To
+tackle this problem, we propose a new method that explicitly decouples the
+transferring of general-semantic-representation-related experience and the
+domain-specific-knowledge-related experience. Specifically, for
+domain-specific-knowledge-related experience, we design two modules to capture
+intent-slot relation and slot-slot relation respectively. Extensive experiments
+on Snips and FewJoint datasets show that our method achieves state-of-the-art
+performance. The method improves the joint accuracy metric from 27.72% to
+42.20% in the 1-shot setting, and from 46.54% to 60.79% in the 5-shot setting.
+
+
+ Query-focused summarization (QFS) aims to provide a summary of a single
+document/multi documents that can satisfy the information needs of a given
+query. It is useful for various real-world applications, such as abstractive
+snippet generation or more recent retrieval augmented generation (RAG). A
+prototypical QFS pipeline consists of a retriever (sparse or dense retrieval)
+and a generator (usually a large language model). However, applying large
+language models (LLM) potentially leads to hallucinations, especially when the
+evidence contradicts the prior belief of LLMs. There has been growing interest
+in developing new decoding methods to improve generation quality and reduce
+hallucination. In this work, we conduct a large-scale reproducibility on one
+recently proposed decoding method -- Context-aware Decoding (CAD). In addition
+to replicating CAD's experiments on news summarization datasets, we include
+experiments on QFS datasets, and conduct more rigorous analysis on
+computational complexity and hyperparameter sensitivity. Experiments with eight
+different language models show that performance-wise, CAD improves QFS quality
+by (1) reducing factuality errors/hallucinations while (2) mostly retaining the
+match of lexical patterns, measured by ROUGE scores, while also at a cost of
+increased inference-time FLOPs and reduced decoding speed. The code
+implementation based on Huggingface Library is made available
+https://github.com/zhichaoxu-shufe/context-aware-decoding-qfs
+
+
+
+ comment: technical report
+
+
+
+
+
+
+ ☆ Parameter Efficient Tuning Allows Scalable Personalization of LLMs for
+ Text Entry: A Case Study on Abbreviation Expansion
+
+
+ Abbreviation expansion is a strategy used to speed up communication by
+limiting the amount of typing and using a language model to suggest expansions.
+Here we look at personalizing a Large Language Model's (LLM) suggestions based
+on prior conversations to enhance the relevance of predictions, particularly
+when the user data is small (~1000 samples). Specifically, we compare
+fine-tuning, prompt-tuning, and retrieval augmented generation of expanded text
+suggestions for abbreviated inputs. Our case study with a deployed 8B parameter
+LLM on a real user living with ALS, and experiments on movie character
+personalization indicates that (1) customization may be necessary in some
+scenarios and prompt-tuning generalizes well to those, (2) fine-tuning on
+in-domain data (with as few as 600 samples) still shows some gains, however (3)
+retrieval augmented few-shot selection also outperforms fine-tuning. (4)
+Parameter efficient tuning allows for efficient and scalable personalization.
+For prompt-tuning, we also find that initializing the learned "soft-prompts" to
+user relevant concept tokens leads to higher accuracy than random
+initialization.
+
+
+
+
+
+
+
+ ☆ Exploiting Novel GPT-4 APIs
+
+
+
+
+
+
+
+
+ Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave
+
+
+ Language model attacks typically assume one of two extreme threat models:
+full white-box access to model weights, or black-box access limited to a text
+generation API. However, real-world APIs are often more flexible than just text
+generation: these APIs expose ``gray-box'' access leading to new threat
+vectors. To explore this, we red-team three new functionalities exposed in the
+GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that
+fine-tuning a model on as few as 15 harmful examples or 100 benign examples can
+remove core safeguards from GPT-4, enabling a range of harmful outputs.
+Furthermore, we find that GPT-4 Assistants readily divulge the function call
+schema and can be made to execute arbitrary function calls. Finally, we find
+that knowledge retrieval can be hijacked by injecting instructions into
+retrieval documents. These vulnerabilities highlight that any additions to the
+functionality exposed by an API can create new vulnerabilities.
+
+
+
+ comment: 10 pages, 1 figure, 4 tables
+
+
+
+
+
+
+ ☆ Characterizing and Classifying Developer Forum Posts with their
+ Intentions
+
+
+
+
+
+
+
+
+ Xingfang Wu, Eric Laufer, Heng Li, Foutse Khomh, Santhosh Srinivasan, Jayden Luo
+
+
+ With the rapid growth of the developer community, the amount of posts on
+online technical forums has been growing rapidly, which poses difficulties for
+users to filter useful posts and find important information. Tags provide a
+concise feature dimension for users to locate their interested posts and for
+search engines to index the most relevant posts according to the queries.
+However, most tags are only focused on the technical perspective (e.g., program
+language, platform, tool). In most cases, forum posts in online developer
+communities reveal the author's intentions to solve a problem, ask for advice,
+share information, etc. The modeling of the intentions of posts can provide an
+extra dimension to the current tag taxonomy. By referencing previous studies
+and learning from industrial perspectives, we create a refined taxonomy for the
+intentions of technical forum posts. Through manual labeling and analysis on a
+sampled post dataset extracted from online forums, we understand the relevance
+between the constitution of posts (code, error messages) and their intentions.
+Furthermore, inspired by our manual study, we design a pre-trained
+transformer-based model to automatically predict post intentions. The best
+variant of our intention prediction framework, which achieves a Micro F1-score
+of 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787,
+outperforms the state-of-the-art baseline approach. Our characterization and
+automated classification of forum posts regarding their intentions may help
+forum maintainers or third-party tool developers improve the organization and
+retrieval of posts on technical forums. We have released our annotated dataset
+and codes in our supplementary material package.
+
+
+
+ comment: 39 pages
+
+
+
+
+
+
+ ☆ Deep de Finetti: Recovering Topic Distributions from Large Language
+ Models
+
+
+
+
+
+
+
+
+ Liyi Zhang, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths
+
+
+ Large language models (LLMs) can produce long, coherent passages of text,
+suggesting that LLMs, although trained on next-word prediction, must represent
+the latent structure that characterizes a document. Prior work has found that
+internal representations of LLMs encode one aspect of latent structure, namely
+syntax; here we investigate a complementary aspect, namely the document's topic
+structure. We motivate the hypothesis that LLMs capture topic structure by
+connecting LLM optimization to implicit Bayesian inference. De Finetti's
+theorem shows that exchangeable probability distributions can be represented as
+a mixture with respect to a latent generating distribution. Although text is
+not exchangeable at the level of syntax, exchangeability is a reasonable
+starting assumption for topic structure. We thus hypothesize that predicting
+the next token in text will lead LLMs to recover latent topic distributions. We
+examine this hypothesis using Latent Dirichlet Allocation (LDA), an
+exchangeable probabilistic topic model, as a target, and we show that the
+representations formed by LLMs encode both the topics used to generate
+synthetic data and those used to explain natural corpus data.
+
+
+
+ comment: 13 pages, 4 figures
+
+
+
+
+
+
+ ☆ SimLM: Can Language Models Infer Parameters of Physical Systems?
+
+
+ Recent developments in large-scale machine learning models for
+general-purpose understanding, translation and generation of language are
+driving impact across a variety of sectors including medicine, robotics, and
+scientific discovery. The strength of such Large Language Models (LLMs) stems
+from the large corpora that they are trained with. While this imbues them with
+a breadth of capabilities, they have been found unsuitable for some specific
+types of problems such as advanced mathematics. In this paper, we highlight the
+inability of LLMs to reason about physics tasks. We demonstrate that their
+ability to infer parameters of physical systems can be improved, without
+retraining, by augmenting their context with feedback from physical simulation.
+
+
+
+
+
+
+
+ ☆ Experimenting with Large Language Models and vector embeddings in NASA
+ SciX
+
+
+
+
+
+
+
+
+ Sergi Blanco-Cuaresma, Ioana Ciucă, Alberto Accomazzi, Michael J. Kurtz, Edwin A. Henneken, Kelly E. Lockhart, Felix Grezes, Thomas Allen, Golnaz Shapurian, Carolyn S. Grant, Donna M. Thompson, Timothy W. Hostetler, Matthew R. Templeton, Shinyi Chen, Jennifer Koch, Taylor Jacovich, Daniel Chivvis, Fernanda de Macedo Alves, Jean-Claude Paquin, Jennifer Bartlett, Mugdha Polimera, Stephanie Jarmak
+
+
+ Open-source Large Language Models enable projects such as NASA SciX (i.e.,
+NASA ADS) to think out of the box and try alternative approaches for
+information retrieval and data augmentation, while respecting data copyright
+and users' privacy. However, when large language models are directly prompted
+with questions without any context, they are prone to hallucination. At NASA
+SciX we have developed an experiment where we created semantic vectors for our
+large collection of abstracts and full-text content, and we designed a prompt
+system to ask questions using contextual chunks from our system. Based on a
+non-systematic human evaluation, the experiment shows a lower degree of
+hallucination and better responses when using Retrieval Augmented Generation.
+Further exploration is required to design new features and data augmentation
+processes at NASA SciX that leverages this technology while respecting the high
+level of trust and quality that the project holds.
+
+
+
+ comment: To appear in the proceedings of the 33th annual international
+ Astronomical Data Analysis Software & Systems (ADASS XXXIII)
+
+
+
+
+
+
+ ☆ Shai: A large language model for asset management
+
+
+ This paper introduces "Shai" a 10B level large language model specifically
+designed for the asset management industry, built upon an open-source
+foundational model. With continuous pre-training and fine-tuning using a
+targeted corpus, Shai demonstrates enhanced performance in tasks relevant to
+its domain, outperforming baseline models. Our research includes the
+development of an innovative evaluation framework, which integrates
+professional qualification exams, tailored tasks, open-ended question
+answering, and safety assessments, to comprehensively assess Shai's
+capabilities. Furthermore, we discuss the challenges and implications of
+utilizing large language models like GPT-4 for performance assessment in asset
+management, suggesting a combination of automated evaluation and human
+judgment. Shai's development, showcasing the potential and versatility of
+10B-level large language models in the financial sector with significant
+performance and modest computational requirements, hopes to provide practical
+insights and methodologies to assist industry peers in their similar endeavors.
+
+
+
+
+
+
+
+ ☆ Illuminating the Black Box: A Psychometric Investigation into the
+ Multifaceted Nature of Large Language Models
+
+
+
+
+
+
+
+
+ Yang Lu, Jordan Yu, Shou-Hsuan Stephen Huang
+
+
+ This study explores the idea of AI Personality or AInality suggesting that
+Large Language Models (LLMs) exhibit patterns similar to human personalities.
+Assuming that LLMs share these patterns with humans, we investigate using
+human-centered psychometric tests such as the Myers-Briggs Type Indicator
+(MBTI), Big Five Inventory (BFI), and Short Dark Triad (SD3) to identify and
+confirm LLM personality types. By introducing role-play prompts, we demonstrate
+the adaptability of LLMs, showing their ability to switch dynamically between
+different personality types. Using projective tests, such as the Washington
+University Sentence Completion Test (WUSCT), we uncover hidden aspects of LLM
+personalities that are not easily accessible through direct questioning.
+Projective tests allowed for a deep exploration of LLMs cognitive processes and
+thought patterns and gave us a multidimensional view of AInality. Our machine
+learning analysis revealed that LLMs exhibit distinct AInality traits and
+manifest diverse personality types, demonstrating dynamic shifts in response to
+external instructions. This study pioneers the application of projective tests
+on LLMs, shedding light on their diverse and adaptable AInality traits.
+
+
+
+
+
+
+
+ ☆ Benchmarking and Defending Against Indirect Prompt Injection Attacks on
+ Large Language Models
+
+
+ Recent remarkable advancements in large language models (LLMs) have led to
+their widespread adoption in various applications. A key feature of these
+applications is the combination of LLMs with external content, where user
+instructions and third-party content are combined to create prompts for LLM
+processing. These applications, however, are vulnerable to indirect prompt
+injection attacks, where malicious instructions embedded within external
+content compromise LLM's output, causing their responses to deviate from user
+expectations. Despite the discovery of this security issue, no comprehensive
+analysis of indirect prompt injection attacks on different LLMs is available
+due to the lack of a benchmark. Furthermore, no effective defense has been
+proposed.
+ In this work, we introduce the first benchmark, BIPIA, to measure the
+robustness of various LLMs and defenses against indirect prompt injection
+attacks. Our experiments reveal that LLMs with greater capabilities exhibit
+more vulnerable to indirect prompt injection attacks for text tasks, resulting
+in a higher ASR. We hypothesize that indirect prompt injection attacks are
+mainly due to the LLMs' inability to distinguish between instructions and
+external content. Based on this conjecture, we propose four black-box methods
+based on prompt learning and a white-box defense methods based on fine-tuning
+with adversarial training to enable LLMs to distinguish between instructions
+and external content and ignore instructions in the external content. Our
+experimental results show that our black-box defense methods can effectively
+reduce ASR but cannot completely thwart indirect prompt injection attacks,
+while our white-box defense method can reduce ASR to nearly zero with little
+adverse impact on the LLM's performance on general tasks. We hope that our
+benchmark and defenses can inspire future work in this important area.
+
+
+
+
+
+
+
+ ♻ ☆ Cascade Speculative Drafting for Even Faster LLM Inference
+
+
+ Speculative decoding enhances the efficiency of large language models (LLMs)
+by leveraging a draft model to draft for a larger target model to review.
+However, drafting in speculative decoding involves slow autoregressive
+generation and generating tokens of different importance with the same time
+allocation. These two inefficiencies lead to its suboptimal performance. To
+address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a
+novel approach that employs two types of cascades. The Vertical Cascade
+eliminates autoregressive generation from neural models. The Horizontal Cascade
+constitutes efficient time allocation in drafting with its optimality supported
+by our theoretical analysis. Combining both cascades, our CS. Drafting
+algorithm has achieved up to 72 percent additional speedup over speculative
+decoding in our experiments while keeping the same output distribution.
+
+
+ Predicting turn-taking in multiparty conversations has many practical
+applications in human-computer/robot interaction. However, the complexity of
+human communication makes it a challenging task. Recent advances have shown
+that synchronous multi-perspective egocentric data can significantly improve
+turn-taking prediction compared to asynchronous, single-perspective
+transcriptions. Building on this research, we propose a new multimodal
+transformer-based architecture for predicting turn-taking in embodied,
+synchronized multi-perspective data. Our experimental results on the recently
+introduced EgoCom dataset show a substantial performance improvement of up to
+14.01% on average compared to existing baselines and alternative
+transformer-based approaches. The source code, and the pre-trained models of
+our 3M-Transformer will be available upon acceptance.
+
+
+
+ comment: Accepted to ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ Prot2Text: Multimodal Protein's Function Generation with GNNs and
+ Transformers
+
+
+ The complex nature of big biological systems pushed some scientists to
+classify its understanding under the inconceivable missions. Different leveled
+challenges complicated this task, one of is the prediction of a protein's
+function. In recent years, significant progress has been made in this field
+through the development of various machine learning approaches. However, most
+existing methods formulate the task as a multi-classification problem, i.e
+assigning predefined labels to proteins. In this work, we propose a novel
+approach, \textbf{Prot2Text}, which predicts a protein function's in a free
+text style, moving beyond the conventional binary or categorical
+classifications. By combining Graph Neural Networks(GNNs) and Large Language
+Models(LLMs), in an encoder-decoder framework, our model effectively integrates
+diverse data types including proteins' sequences, structures, and textual
+annotations. This multimodal approach allows for a holistic representation of
+proteins' functions, enabling the generation of detailed and accurate
+descriptions. To evaluate our model, we extracted a multimodal protein dataset
+from SwissProt, and demonstrate empirically the effectiveness of Prot2Text.
+These results highlight the transformative impact of multimodal models,
+specifically the fusion of GNNs and LLMs, empowering researchers with powerful
+tools for more accurate prediction of proteins' functions. The code, the models
+and a demo will be publicly released.
+
+
+
+
+
+
+
+ ♻ ☆ DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4
+
+
+ The digitization of healthcare has facilitated the sharing and re-using of
+medical data but has also raised concerns about confidentiality and privacy.
+HIPAA (Health Insurance Portability and Accountability Act) mandates removing
+re-identifying information before the dissemination of medical records. Thus,
+effective and efficient solutions for de-identifying medical data, especially
+those in free-text forms, are highly needed. While various computer-assisted
+de-identification methods, including both rule-based and learning-based, have
+been developed and used in prior practice, such solutions still lack
+generalizability or need to be fine-tuned according to different scenarios,
+significantly imposing restrictions in wider use. The advancement of large
+language models (LLM), such as ChatGPT and GPT-4, have shown great potential in
+processing text data in the medical domain with zero-shot in-context learning,
+especially in the task of privacy protection, as these models can identify
+confidential information by their powerful named entity recognition (NER)
+capability. In this work, we developed a novel GPT4-enabled de-identification
+framework (``DeID-GPT") to automatically identify and remove the identifying
+information. Compared to existing commonly used medical text data
+de-identification methods, our developed DeID-GPT showed the highest accuracy
+and remarkable reliability in masking private information from the unstructured
+medical text while preserving the original structure and meaning of the text.
+This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text
+data processing and de-identification, which provides insights for further
+research and solution development on the use of LLMs such as ChatGPT/GPT-4 in
+healthcare. Codes and benchmarking data information are available at
+https://github.com/yhydhx/ChatGPT-API.
+
+
+
+
+
+
+
+ ♻ ☆ Are ChatGPT and GPT-4 Good Poker Players? -- A Pre-Flop Analysis
+
+
+ Since the introduction of ChatGPT and GPT-4, these models have been tested
+across a large number of tasks. Their adeptness across domains is evident, but
+their aptitude in playing games, and specifically their aptitude in the realm
+of poker has remained unexplored. Poker is a game that requires decision making
+under uncertainty and incomplete information. In this paper, we put ChatGPT and
+GPT-4 through the poker test and evaluate their poker skills. Our findings
+reveal that while both models display an advanced understanding of poker,
+encompassing concepts like the valuation of starting hands, playing positions
+and other intricacies of game theory optimal (GTO) poker, both ChatGPT and
+GPT-4 are NOT game theory optimal poker players.
+ Profitable strategies in poker are evaluated in expectations over large
+samples. Through a series of experiments, we first discover the characteristics
+of optimal prompts and model parameters for playing poker with these models.
+Our observations then unveil the distinct playing personas of the two models.
+We first conclude that GPT-4 is a more advanced poker player than ChatGPT. This
+exploration then sheds light on the divergent poker tactics of the two models:
+ChatGPT's conservativeness juxtaposed against GPT-4's aggression. In poker
+vernacular, when tasked to play GTO poker, ChatGPT plays like a nit, which
+means that it has a propensity to only engage with premium hands and folds a
+majority of hands. When subjected to the same directive, GPT-4 plays like a
+maniac, showcasing a loose and aggressive style of play. Both strategies,
+although relatively advanced, are not game theory optimal.
+
+
+
+
+
+
+
+ ♻ ☆ A Survey of Reasoning with Foundation Models: Concepts, Methodologies,
+ and Outlook
+
+
+ Reasoning, a crucial ability for complex problem-solving, plays a pivotal
+role in various real-world settings such as negotiation, medical diagnosis, and
+criminal investigation. It serves as a fundamental methodology in the field of
+Artificial General Intelligence (AGI). With the ongoing development of
+foundation models, there is a growing interest in exploring their abilities in
+reasoning tasks. In this paper, we introduce seminal foundation models proposed
+or adaptable for reasoning, highlighting the latest advancements in various
+reasoning tasks, methods, and benchmarks. We then delve into the potential
+future directions behind the emergence of reasoning abilities within foundation
+models. We also discuss the relevance of multimodal learning, autonomous
+agents, and super alignment in the context of reasoning. By discussing these
+future research directions, we hope to inspire researchers in their exploration
+of this field, stimulate further advancements in reasoning with foundation
+models, and contribute to the development of AGI.
+
+
+ A large body of NLP research has documented the ways gender biases manifest
+and amplify within large language models (LLMs), though this research has
+predominantly operated within a gender binary-centric context. A growing body
+of work has identified the harmful limitations of this gender-exclusive
+framing; many LLMs cannot correctly and consistently refer to persons outside
+the gender binary, especially if they use neopronouns. While data scarcity has
+been identified as a possible culprit, the precise mechanisms through which it
+influences LLM misgendering remain underexplored. Our work addresses this gap
+by studying data scarcity's role in subword tokenization and, consequently, the
+formation of LLM word representations. We uncover how the Byte-Pair Encoding
+(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun
+misgendering through out-of-vocabulary behavior. We introduce pronoun
+tokenization parity (PTP), a novel approach to reduce LLM neopronoun
+misgendering by preserving a token's functional structure. We evaluate PTP's
+efficacy using pronoun consistency-based metrics and a novel syntax-based
+metric. Through several controlled experiments, finetuning LLMs with PTP
+improves neopronoun consistency from 14.5% to 58.4%, highlighting the
+significant role tokenization plays in LLM pronoun consistency.
+
+
+
+ comment: Accepted to 2023 Neurips Queer in AI workshop
+
+ Keyphrase extraction is a fundamental task in natural language processing and
+information retrieval that aims to extract a set of phrases with important
+information from a source document. Identifying important keyphrase is the
+central component of the keyphrase extraction task, and its main challenge is
+how to represent information comprehensively and discriminate importance
+accurately. In this paper, to address these issues, we design a new hyperbolic
+matching model (HyperMatch) to represent phrases and documents in the same
+hyperbolic space and explicitly estimate the phrase-document relevance via the
+Poincar\'e distance as the important score of each phrase. Specifically, to
+capture the hierarchical syntactic and semantic structure information,
+HyperMatch takes advantage of the hidden representations in multiple layers of
+RoBERTa and integrates them as the word embeddings via an adaptive mixing
+layer. Meanwhile, considering the hierarchical structure hidden in the
+document, HyperMatch embeds both phrases and documents in the same hyperbolic
+space via a hyperbolic phrase encoder and a hyperbolic document encoder. This
+strategy can further enhance the estimation of phrase-document relevance due to
+the good properties of hyperbolic space. In this setting, the keyphrase
+extraction can be taken as a matching problem and effectively implemented by
+minimizing a hyperbolic margin-based triplet loss. Extensive experiments are
+conducted on six benchmarks and demonstrate that HyperMatch outperforms the
+state-of-the-art baselines.
+
+
+
+ comment: 12 pages, 3 figures, Accepted by NAACL2022
+
+
+
+
+
+
+ ♻ ☆ Importance Estimation from Multiple Perspectives for Keyphrase
+ Extraction EMNLP2021
+
+
+ Keyphrase extraction is a fundamental task in Natural Language Processing,
+which usually contains two main parts: candidate keyphrase extraction and
+keyphrase importance estimation. From the view of human understanding
+documents, we typically measure the importance of phrase according to its
+syntactic accuracy, information saliency, and concept consistency
+simultaneously. However, most existing keyphrase extraction approaches only
+focus on the part of them, which leads to biased results. In this paper, we
+propose a new approach to estimate the importance of keyphrase from multiple
+perspectives (called as \textit{KIEMP}) and further improve the performance of
+keyphrase extraction. Specifically, \textit{KIEMP} estimates the importance of
+phrase with three modules: a chunking module to measure its syntactic accuracy,
+a ranking module to check its information saliency, and a matching module to
+judge the concept (i.e., topic) consistency between phrase and the whole
+document. These three modules are seamlessly jointed together via an end-to-end
+multi-task learning model, which is helpful for three parts to enhance each
+other and balance the effects of three perspectives. Experimental results on
+six benchmark datasets show that \textit{KIEMP} outperforms the existing
+state-of-the-art keyphrase extraction approaches in most cases.
+
+
+
+ comment: 11 pages, 2 figures, Accepted by EMNLP2021
+
+ Recently, instruction-following audio-language models have received broad
+attention for audio interaction with humans. However, the absence of
+pre-trained audio models capable of handling diverse audio types and tasks has
+hindered progress in this field. Consequently, most existing works have only
+been able to support a limited range of interaction capabilities. In this
+paper, we develop the Qwen-Audio model and address this limitation by scaling
+up audio-language pre-training to cover over 30 tasks and various audio types,
+such as human speech, natural sounds, music, and songs, to facilitate universal
+audio understanding abilities. However, directly co-training all tasks and
+datasets can lead to interference issues, as the textual labels associated with
+different datasets exhibit considerable variations due to differences in task
+focus, language, granularity of annotation, and text structure. To overcome the
+one-to-many interference, we carefully design a multi-task training framework
+by conditioning on a sequence of hierarchical tags to the decoder for
+encouraging knowledge sharing and avoiding interference through shared and
+specified tags respectively. Remarkably, Qwen-Audio achieves impressive
+performance across diverse benchmark tasks without requiring any task-specific
+fine-tuning, surpassing its counterparts. Building upon the capabilities of
+Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from
+various audios and text inputs, enabling multi-turn dialogues and supporting
+various audio-central scenarios.
+
+
+
+ comment: The code, checkpoints and demo are released at
+ https://github.com/QwenLM/Qwen-Audio
+
+
+
+
+
+
+ ♻ ☆ Context Matters: Data-Efficient Augmentation of Large Language Models
+ for Scientific Applications
+
+
+ In this paper, we explore the challenges inherent to Large Language Models
+(LLMs) like GPT-4, particularly their propensity for hallucinations, logic
+mistakes, and incorrect conclusions when tasked with answering complex
+questions. The capacity of LLMs to present erroneous answers in a coherent and
+semantically rigorous manner further complicates the detection of factual
+inaccuracies. This issue is especially pronounced in fields that require
+specialized expertise. Our work delves into these challenges, aiming to enhance
+the understanding and mitigation of such errors, thereby contributing to the
+improvement of LLM accuracy and reliability in scientific and other specialized
+domains. Our findings reveal a non-linear relationship between the context's
+relevancy and the answers' measured quality. In addition, we demonstrate that
+with the correct calibration, it is possible to automate the grading procedure
+-- a finding suggesting that, at least to some degree, the LLMs can be used to
+self-examine the quality of their own performance. Finally, we describe an
+experimental platform that can be seen as a proof-of-concept of the techniques
+described in this work.
+
+
+
+ comment: 11 pages, 6 figures, 4 tables, 3 pages of supplementary material
+
+
+
+
+
+
+ ♻ ☆ From Artificially Real to Real: Leveraging Pseudo Data from Large
+ Language Models for Low-Resource Molecule Discovery AAAI2024
+
+
+ Molecule discovery serves as a cornerstone in numerous scientific domains,
+fueling the development of new materials and innovative drug designs. Recent
+developments of in-silico molecule discovery have highlighted the promising
+results of cross-modal techniques, which bridge molecular structures with their
+descriptive annotations. However, these cross-modal methods frequently
+encounter the issue of data scarcity, hampering their performance and
+application. In this paper, we address the low-resource challenge by utilizing
+artificially-real data generated by Large Language Models (LLMs). We first
+introduce a retrieval-based prompting strategy to construct high-quality pseudo
+data, then explore the optimal method to effectively leverage this pseudo data.
+Experiments show that using pseudo data for domain adaptation outperforms all
+existing methods, while also requiring a smaller model scale, reduced data size
+and lower training cost, highlighting its efficiency. Furthermore, our method
+shows a sustained improvement as the volume of pseudo data increases, revealing
+the great potential of pseudo data in advancing low-resource cross-modal
+molecule discovery. Our code and data are available at
+https://github.com/SCIR-HI/ArtificiallyR2R.
+
+
+
+ comment: Accepted to AAAI2024
+
+
+
+
+
+
+ ♻ ☆ FedJudge: Federated Legal Large Language Model DASFAA 2024
+
+
+
+
+
+
+
+
+ Linan Yue, Qi Liu, Yichao Du, Weibo Gao, Ye Liu, Fangzhou Yao
+
+
+ Large Language Models (LLMs) have gained prominence in the field of Legal
+Intelligence, offering potential applications in assisting legal professionals
+and laymen. However, the centralized training of these Legal LLMs raises data
+privacy concerns, as legal data is distributed among various institutions
+containing sensitive individual information. This paper addresses this
+challenge by exploring the integration of Legal LLMs with Federated Learning
+(FL) methodologies. By employing FL, Legal LLMs can be fine-tuned locally on
+devices or clients, and their parameters are aggregated and distributed on a
+central server, ensuring data privacy without directly sharing raw data.
+However, computation and communication overheads hinder the full fine-tuning of
+LLMs under the FL setting. Moreover, the distribution shift of legal data
+reduces the effectiveness of FL methods. To this end, in this paper, we propose
+the first Federated Legal Large Language Model (FedJudge) framework, which
+fine-tunes Legal LLMs efficiently and effectively. Specifically, FedJudge
+utilizes parameter-efficient fine-tuning methods to update only a few
+additional parameters during the FL training. Besides, we explore the continual
+learning methods to preserve the global model's important parameters when
+training local clients to mitigate the problem of data shifts. Extensive
+experimental results on three real-world datasets clearly validate the
+effectiveness of FedJudge. Code is released at
+https://github.com/yuelinan/FedJudge.
+
+
+
+ comment: Submitted to DASFAA 2024
+
+
+
+
+
+
+ ♻ ☆ Exploring Large Language Model for Graph Data Understanding in Online
+ Job Recommendations
+
+
+ Large Language Models (LLMs) have revolutionized natural language processing
+tasks, demonstrating their exceptional capabilities in various domains.
+However, their potential for behavior graph understanding in job
+recommendations remains largely unexplored. This paper focuses on unveiling the
+capability of large language models in understanding behavior graphs and
+leveraging this understanding to enhance recommendations in online recruitment,
+including the promotion of out-of-distribution (OOD) application. We present a
+novel framework that harnesses the rich contextual information and semantic
+representations provided by large language models to analyze behavior graphs
+and uncover underlying patterns and relationships. Specifically, we propose a
+meta-path prompt constructor that leverages LLM recommender to understand
+behavior graphs for the first time and design a corresponding path augmentation
+module to alleviate the prompt bias introduced by path-based sequence input. By
+leveraging this capability, our framework enables personalized and accurate job
+recommendations for individual users. We evaluate the effectiveness of our
+approach on a comprehensive dataset and demonstrate its ability to improve the
+relevance and quality of recommended quality. This research not only sheds
+light on the untapped potential of large language models but also provides
+valuable insights for developing advanced recommendation systems in the
+recruitment market. The findings contribute to the growing field of natural
+language processing and offer practical implications for enhancing job search
+experiences. We release the code at https://github.com/WLiK/GLRec.
+
+
+
+
+
+
+
+ ♻ ☆ Machine Mindset: An MBTI Exploration of Large Language Models
+
+
+ We present a novel approach for integrating Myers-Briggs Type Indicator
+(MBTI) personality traits into large language models (LLMs), addressing the
+challenges of personality consistency in personalized AI. Our method, "Machine
+Mindset," involves a two-phase fine-tuning and Direct Preference Optimization
+(DPO) to embed MBTI traits into LLMs. This approach ensures that models
+internalize these traits, offering a stable and consistent personality profile.
+We demonstrate the effectiveness of our models across various domains, showing
+alignment between model performance and their respective MBTI traits. The paper
+highlights significant contributions in the development of personality datasets
+and a new training methodology for personality integration in LLMs, enhancing
+the potential for personalized AI applications. We also open-sourced our model
+and part of the data at \url{https://github.com/PKU-YuanGroup/Machine-Mindset}.
+
+
+ Existing neural models are demonstrated to struggle with compositional
+generalization (CG), i.e., the ability to systematically generalize to unseen
+compositions of seen components. A key reason for failure on CG is that the
+syntactic and semantic representations of sequences in both the uppermost layer
+of the encoder and decoder are entangled. However, previous work concentrates
+on separating the learning of syntax and semantics instead of exploring the
+reasons behind the representation entanglement (RE) problem to solve it. We
+explain why it exists by analyzing the representation evolving mechanism from
+the bottom to the top of the Transformer layers. We find that the ``shallow''
+residual connections within each layer fail to fuse previous layers'
+information effectively, leading to information forgetting between layers and
+further the RE problems. Inspired by this, we propose LRF, a novel
+\textbf{L}ayer-wise \textbf{R}epresentation \textbf{F}usion framework for CG,
+which learns to fuse previous layers' information back into the encoding and
+decoding process effectively through introducing a \emph{fuse-attention module}
+at each encoder and decoder layer. LRF achieves promising results on two
+realistic benchmarks, empirically demonstrating the effectiveness of our
+proposal.
+
+
+
+ comment: accepted by aaai24. arXiv admin note: substantial text overlap with
+ arXiv:2305.12169
+
+
+
+
+
+
+ ♻ ☆ Contrastive variational information bottleneck for aspect-based
+ sentiment analysis
+
+
+ Deep learning techniques have dominated the literature on aspect-based
+sentiment analysis (ABSA), achieving state-of-the-art performance. However,
+deep models generally suffer from spurious correlations between input features
+and output labels, which hurts the robustness and generalization capability by
+a large margin. In this paper, we propose to reduce spurious correlations for
+ABSA, via a novel Contrastive Variational Information Bottleneck framework
+(called CVIB). The proposed CVIB framework is composed of an original network
+and a self-pruned network, and these two networks are optimized simultaneously
+via contrastive learning. Concretely, we employ the Variational Information
+Bottleneck (VIB) principle to learn an informative and compressed network
+(self-pruned network) from the original network, which discards the superfluous
+patterns or spurious correlations between input features and prediction labels.
+Then, self-pruning contrastive learning is devised to pull together
+semantically similar positive pairs and push away dissimilar pairs, where the
+representations of the anchor learned by the original and self-pruned networks
+respectively are regarded as a positive pair while the representations of two
+different sentences within a mini-batch are treated as a negative pair. To
+verify the effectiveness of our CVIB method, we conduct extensive experiments
+on five benchmark ABSA datasets and the experimental results show that our
+approach achieves better performance than the strong competitors in terms of
+overall prediction performance, robustness, and generalization. Code and data
+to reproduce the results in this paper is available at:
+https://github.com/shesshan/CVIB.
+
+
+
+ comment: Accepted by Knowledge-Based Systems (KBS)
+
+ Augmenting large language models (LLMs) with external tools has emerged as a
+promising approach to extending the capability of LLMs. Although some works
+employ open-source LLMs for the tool learning task, most of them are trained in
+a controlled environment in which LLMs only learn to execute the human-provided
+tools. However, selecting proper tools from the large toolset is also a crucial
+ability for the tool learning model to be applied in real-world applications.
+Existing methods usually directly employ self-instruction methods to train the
+model, which ignores differences in tool complexity. In this paper, we propose
+the Confucius, a novel tool learning framework to train LLM to use complicated
+tools in real-world scenarios, which contains two main phases: (1) We first
+propose a multi-stage learning method to teach the LLM to use various tools
+from an easy-to-difficult curriculum; (2) thenceforth, we propose the Iterative
+Self-instruct from Introspective Feedback (ISIF) to dynamically construct the
+dataset to improve the ability to use the complicated tool. Extensive
+experiments conducted on both controlled and real-world settings demonstrate
+the superiority of our tool learning framework in the real-world application
+scenarios compared to both tuning-free (e.g. ChatGPT, Claude) and tuning-based
+baselines (e.g. GPT4Tools).
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ BloombergGPT: A Large Language Model for Finance
+
+
+
+
+
+
+
+
+ Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann
+
+
+ The use of NLP in the realm of financial technology is broad and complex,
+with applications ranging from sentiment analysis and named entity recognition
+to question answering. Large Language Models (LLMs) have been shown to be
+effective on a variety of tasks; however, no LLM specialized for the financial
+domain has been reported in literature. In this work, we present BloombergGPT,
+a 50 billion parameter language model that is trained on a wide range of
+financial data. We construct a 363 billion token dataset based on Bloomberg's
+extensive data sources, perhaps the largest domain-specific dataset yet,
+augmented with 345 billion tokens from general purpose datasets. We validate
+BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite
+of internal benchmarks that most accurately reflect our intended usage. Our
+mixed dataset training leads to a model that outperforms existing models on
+financial tasks by significant margins without sacrificing performance on
+general LLM benchmarks. Additionally, we explain our modeling choices, training
+process, and evaluation methodology. We release Training Chronicles (Appendix
+C) detailing our experience in training BloombergGPT.
+
+
+
+ comment: Updated to include Training Chronicles (Appendix C)
+
+
+
+
+
+
+ ♻ ☆ Can Transformers Learn Sequential Function Classes In Context?
+
+
+
+
+
+
+
+
+ Ryan Campbell, Emma Guo, Evan Hu, Reya Vir, Ethan Hsiao
+
+
+ In-context learning (ICL) has revolutionized the capabilities of transformer
+models in NLP. In our project, we extend the understanding of the mechanisms
+underpinning ICL by exploring whether transformers can learn from sequential,
+non-textual function class data distributions. We introduce a novel sliding
+window sequential function class and employ toy-sized transformers with a GPT-2
+architecture to conduct our experiments. Our analysis indicates that these
+models can indeed leverage ICL when trained on non-textual sequential function
+classes. Additionally, our experiments with randomized y-label sequences
+highlights that transformers retain some ICL capabilities even when the label
+associations are obfuscated. We provide evidence that transformers can reason
+with and understand sequentiality encoded within function classes, as reflected
+by the effective learning of our proposed tasks. Our results also show that the
+performance deteriorated with increasing randomness in the labels, though not
+to the extent one might expect, implying a potential robustness of learned
+sequentiality against label noise. Future research may want to look into how
+previous explanations of transformers, such as induction heads and task
+vectors, relate to sequentiality in ICL in these toy examples. Our
+investigation lays the groundwork for further research into how transformers
+process and perceive sequential data.
+
+
+
+ comment: 8 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ An Empirical Study of CLIP for Text-based Person Search AAAI 2024
+
+
+
+
+
+
+
+
+ Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang
+
+
+ Text-based Person Search (TBPS) aims to retrieve the person images using
+natural language descriptions. Recently, Contrastive Language Image Pretraining
+(CLIP), a universal large cross-modal vision-language pre-training model, has
+remarkably performed over various cross-modal downstream tasks due to its
+powerful cross-modal semantic learning capacity. TPBS, as a fine-grained
+cross-modal retrieval task, is also facing the rise of research on the
+CLIP-based TBPS. In order to explore the potential of the visual-language
+pre-training model for downstream TBPS tasks, this paper makes the first
+attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus
+contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the
+TBPS community. We revisit critical design considerations under CLIP, including
+data augmentation and loss function. The model, with the aforementioned designs
+and practical training tricks, can attain satisfactory performance without any
+sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in
+model generalization and model compression, demonstrating the effectiveness of
+TBPS-CLIP from various aspects. This work is expected to provide empirical
+insights and highlight future CLIP-based TBPS research.
+
+
+
+ comment: Accepted by AAAI 2024. Code is available at
+ https://github.com/Flame-Chasers/TBPS-CLIP
+
+
+
+
+
+
+ ♻ ☆ Towards Better Serialization of Tabular Data for Few-shot Classification
+ with Large Language Models
+
+
+ We present a study on the integration of Large Language Models (LLMs) in
+tabular data classification, emphasizing an efficient framework. Building upon
+existing work done in TabLLM (arXiv:2210.10723), we introduce three novel
+serialization techniques, including the standout LaTeX serialization method.
+This method significantly boosts the performance of LLMs in processing
+domain-specific datasets, Our method stands out for its memory efficiency and
+ability to fully utilize complex data structures. Through extensive
+experimentation, including various serialization approaches like feature
+combination and importance, we demonstrate our work's superiority in accuracy
+and efficiency over traditional models.
+
+
+
+ comment: 4 pages, 2 figures
+
+
+
+
+
+
+ ♻ ☆ Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors AAAI 2024
+
+
+
+
+
+
+
+
+ Yi-Fan Zhang, Zhang Zhang, Liang Wang, Tieniu Tan, Rong Jin
+
+
+ To combat the potential misuse of Natural Language Generation (NLG)
+technology, a variety of algorithms have been developed for the detection of
+AI-generated texts. Traditionally, this task is treated as a binary
+classification problem. Although supervised learning has demonstrated promising
+results, acquiring labeled data for detection purposes poses real-world
+challenges and the risk of overfitting. In an effort to address these issues,
+we delve into the realm of zero-shot machine-generated text detection. Existing
+zero-shot detectors, typically designed for specific tasks or topics, often
+assume uniform testing scenarios, limiting their practicality. In our research,
+we explore various advanced Large Language Models (LLMs) and their specialized
+variants, contributing to this field in several ways. In empirical studies, we
+uncover a significant correlation between topics and detection performance.
+Secondly, we delve into the influence of topic shifts on zero-shot detectors.
+These investigations shed light on the adaptability and robustness of these
+detection methods across diverse topics. The code is available at
+\url{https://github.com/yfzhang114/robustness-detection}.
+
+
+
+ comment: 8 pages, 3 figures, AAAI 2024 Workshop on Responsible Language Models
+
+
+
+
+
+
+
+ Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
+
+
+ Existing algorithms for reinforcement learning from human feedback (RLHF) can
+incentivize responses at odds with preferences because they are based on models
+that assume independence of irrelevant alternatives (IIA). The perverse
+incentives induced by IIA give rise to egregious behavior when innovating on
+query formats or learning algorithms.
+
+
+
+
+
+
+
+ ♻ ☆ Shall We Pretrain Autoregressive Language Models with Retrieval? A
+ Comprehensive Study EMNLP 2023
+
+
+
+
+
+
+
+
+ Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro
+
+
+ Large decoder-only language models (LMs) can be largely improved in terms of
+perplexity by retrieval (e.g., RETRO), but its impact on text generation
+quality and downstream task accuracy is unclear. Thus, it is still an open
+question: shall we pretrain large autoregressive LMs with retrieval? To answer
+it, we perform a comprehensive study on a scalable pre-trained
+retrieval-augmented LM (i.e., RETRO) compared with standard GPT and
+retrieval-augmented GPT incorporated at fine-tuning or inference stages. We
+first provide the recipe to reproduce RETRO up to 9.5B parameters while
+retrieving a text corpus with 330B tokens. Based on that, we have the following
+novel findings: i) RETRO outperforms GPT on text generation with much less
+degeneration (i.e., repetition), moderately higher factual accuracy, and
+slightly lower toxicity with a nontoxic retrieval database. ii) On the LM
+Evaluation Harness benchmark, RETRO largely outperforms GPT on
+knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore,
+we introduce a simple variant of the model, RETRO++, which largely improves
+open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural
+Question) and significantly outperforms retrieval-augmented GPT in both
+fine-tuning and zero-shot evaluation settings. Our findings highlight the
+promising direction of pretraining autoregressive LMs with retrieval as future
+foundation models. We release our code and model at:
+https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md
+
+
+
+ comment: EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
+
+
+
+
+
+
+
+
+ Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li
+
+
+ Existing open-vocabulary image segmentation methods require a fine-tuning
+step on mask annotations and/or image-text datasets. Mask labels are
+labor-intensive, which limits the number of categories in segmentation
+datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is
+severely reduced after fine-tuning. However, without fine-tuning, VLMs trained
+under weak image-text supervision tend to make suboptimal mask predictions when
+there are text queries referring to non-existing concepts in the image. To
+alleviate these issues, we introduce a novel recurrent framework that
+progressively filters out irrelevant texts and enhances mask quality without
+training efforts. The recurrent unit is a two-stage segmenter built upon a VLM
+with frozen weights. Thus, our model retains the VLM's broad vocabulary space
+and strengthens its segmentation capability. Experimental results show that our
+method outperforms not only the training-free counterparts, but also those
+fine-tuned with millions of additional data samples, and sets new
+state-of-the-art records for both zero-shot semantic and referring image
+segmentation tasks. Specifically, we improve the current record by 28.8, 16.0,
+and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
+
+
+ We introduce MMMU: a new benchmark designed to evaluate multimodal models on
+massive multi-discipline tasks demanding college-level subject knowledge and
+deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal
+questions from college exams, quizzes, and textbooks, covering six core
+disciplines: Art & Design, Business, Science, Health & Medicine, Humanities &
+Social Science, and Tech & Engineering. These questions span 30 subjects and
+183 subfields, comprising 30 highly heterogeneous image types, such as charts,
+diagrams, maps, tables, music sheets, and chemical structures. Unlike existing
+benchmarks, MMMU focuses on advanced perception and reasoning with
+domain-specific knowledge, challenging models to perform tasks akin to those
+faced by experts. The evaluation of 14 open-source LMMs as well as the
+proprietary GPT-4V(ision) and Gemini highlights the substantial challenges
+posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve
+accuracies of 56% and 59% respectively, indicating significant room for
+improvement. We believe MMMU will stimulate the community to build
+next-generation multimodal foundation models towards expert artificial general
+intelligence.
+
+
+
+
+
+ ☆ 3D Pose Estimation of Two Interacting Hands from a Monocular Event
+ Camera
+
+
+
+
+
+
+
+
+ Christen Millerdurai, Diogo Luvizon, Viktor Rudnev, André Jonas, Jiayi Wang, Christian Theobalt, Vladislav Golyanik
+
+
+ 3D hand tracking from a monocular video is a very challenging problem due to
+hand interactions, occlusions, left-right hand ambiguity, and fast motion. Most
+existing methods rely on RGB inputs, which have severe limitations under
+low-light conditions and suffer from motion blur. In contrast, event cameras
+capture local brightness changes instead of full image frames and do not suffer
+from the described effects. Unfortunately, existing image-based techniques
+cannot be directly applied to events due to significant differences in the data
+modalities. In response to these challenges, this paper introduces the first
+framework for 3D tracking of two fast-moving and interacting hands from a
+single monocular event camera. Our approach tackles the left-right hand
+ambiguity with a novel semi-supervised feature-wise attention mechanism and
+integrates an intersection loss to fix hand collisions. To facilitate advances
+in this research domain, we release a new synthetic large-scale dataset of two
+interacting hands, Ev2Hands-S, and a new real benchmark with real event streams
+and ground-truth 3D annotations, Ev2Hands-R. Our approach outperforms existing
+methods in terms of the 3D reconstruction accuracy and generalises to real data
+under severe light conditions.
+
+
+ Toward unlocking the potential of generative models in immersive 4D
+experiences, we introduce Virtual Pet, a novel pipeline to model realistic and
+diverse motions for target animal species within a 3D environment. To
+circumvent the limited availability of 3D motion data aligned with
+environmental geometry, we leverage monocular internet videos and extract
+deformable NeRF representations for the foreground and static NeRF
+representations for the background. For this, we develop a reconstruction
+strategy, encompassing species-level shared template learning and per-video
+fine-tuning. Utilizing the reconstructed data, we then train a conditional 3D
+motion model to learn the trajectory and articulation of foreground animals in
+the context of 3D backgrounds. We showcase the efficacy of our pipeline with
+comprehensive qualitative and quantitative evaluations using cat videos. We
+also demonstrate versatility across unseen cats and indoor environments,
+producing temporally coherent 4D outputs for enriched virtual experiences.
+
+
+
+
+
+
+
+
+ Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li
+
+
+ We study how vision-language models (VLMs) trained on web-scale data can be
+integrated into end-to-end driving systems to boost generalization and enable
+interactivity with human users. While recent approaches adapt VLMs to driving
+via single-round visual question answering (VQA), human drivers reason about
+decisions in multiple steps. Starting from the localization of key objects,
+humans estimate object interactions before taking actions. The key insight is
+that with our proposed task, Graph VQA, where we model graph-structured
+reasoning through perception, prediction and planning question-answer pairs, we
+obtain a suitable proxy task to mimic the human reasoning process. We
+instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose
+a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA
+and end-to-end driving. The experiments demonstrate that Graph VQA provides a
+simple, principled framework for reasoning about a driving scene, and
+DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent
+baseline performs end-to-end autonomous driving competitively in comparison to
+state-of-the-art driving-specific architectures. Notably, its benefits are
+pronounced when it is evaluated zero-shot on unseen objects or sensor
+configurations. We hope this work can be the starting point to shed new light
+on how to apply VLMs for autonomous driving. To facilitate future research, all
+code, data, and models are available to the public.
+
+
+ The crux of learning vision-language models is to extract semantically
+aligned information from visual and linguistic data. Existing attempts usually
+face the problem of coarse alignment, \textit{e.g.}, the vision encoder
+struggles in localizing an attribute-specified object. In this work, we propose
+an embarrassingly simple approach to better align image and text features with
+no need of additional data formats other than image-text pairs. Concretely,
+given an image and its paired text, we manage to parse objects (\textit{e.g.},
+cat) and attributes (\textit{e.g.}, black) from the description, which are
+highly likely to exist in the image. It is noteworthy that the parsing pipeline
+is fully automatic and thus enjoys good scalability. With these parsed
+semantics as supervision signals, we can complement the commonly used
+image-text contrastive loss with the multi-tag classification loss. Extensive
+experimental results on a broad suite of semantic segmentation datasets
+substantiate the average 3.65\% improvement of our framework over existing
+alternatives. Furthermore, the visualization results indicate that attribute
+supervision makes vision-language models accurately localize
+attribute-specified objects. Project page can be found at
+https://qinying-liu.github.io/Tag-Align/
+
+
+ Current advances in human head modeling allow to generate plausible-looking
+3D head models via neural representations. Nevertheless, constructing complete
+high-fidelity head models with explicitly controlled animation remains an
+issue. Furthermore, completing the head geometry based on a partial
+observation, e.g. coming from a depth sensor, while preserving details is often
+problematic for the existing methods. We introduce a generative model for
+detailed 3D head meshes on top of an articulated 3DMM which allows explicit
+animation and high-detail preservation at the same time. Our method is trained
+in two stages. First, we register a parametric head model with vertex
+displacements to each mesh of the recently introduced NPHM dataset of accurate
+3D head scans. The estimated displacements are baked into a hand-crafted UV
+layout. Second, we train a StyleGAN model in order to generalize over the UV
+maps of displacements. The decomposition of the parametric model and
+high-quality vertex displacements allows us to animate the model and modify it
+semantically. We demonstrate the results of unconditional generation and
+fitting to the full or partial observation. The project page is available at
+https://seva100.github.io/headcraft.
+
+
+ Weakly-supervised temporal action localization aims to localize action
+instances in videos with only video-level action labels. Existing methods
+mainly embrace a localization-by-classification pipeline that optimizes the
+snippet-level prediction with a video classification loss. However, this
+formulation suffers from the discrepancy between classification and detection,
+resulting in inaccurate separation of foreground and background (F\&B)
+snippets. To alleviate this problem, we propose to explore the underlying
+structure among the snippets by resorting to unsupervised snippet clustering,
+rather than heavily relying on the video classification loss. Specifically, we
+propose a novel clustering-based F\&B separation algorithm. It comprises two
+core components: a snippet clustering component that groups the snippets into
+multiple latent clusters and a cluster classification component that further
+classifies the cluster as foreground or background. As there are no
+ground-truth labels to train these two components, we introduce a unified
+self-labeling mechanism based on optimal transport to produce high-quality
+pseudo-labels that match several plausible prior distributions. This ensures
+that the cluster assignments of the snippets can be accurately associated with
+their F\&B labels, thereby boosting the F\&B separation. We evaluate our method
+on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves
+promising performance on all three benchmarks while being significantly more
+lightweight than previous methods. Code is available at
+https://github.com/Qinying-Liu/CASE
+
+
+
+ comment: ICCV2023
+
+
+
+
+
+
+ ☆ $\textit{V}^*$: Guided Visual Search as a Core Mechanism in Multimodal
+ LLMs
+
+
+ When we look around and perform complex tasks, how we see and selectively
+process what we see is crucial. However, the lack of this visual search
+mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on
+important visual details, especially when handling high-resolution and visually
+crowded images. To address this, we introduce $\textit{V}^*$, an LLM-guided
+visual search mechanism that employs the world knowledge in LLMs for efficient
+visual querying. When combined with an MLLM, this mechanism enhances
+collaborative reasoning, contextual understanding, and precise targeting of
+specific visual elements. This integration results in a new MLLM
+meta-architecture, named $\textbf{S}$how, s$\textbf{EA}$rch, and
+Tel$\textbf{L}$ (SEAL). We further create $\textit{V}^*$Bench, a benchmark
+specifically designed to evaluate MLLMs in their ability to process
+high-resolution images and focus on visual details. Our study highlights the
+necessity of incorporating visual search capabilities into multimodal systems.
+The code is available https://github.com/penghao-wu/vstar.
+
+
+ Learning rewards from expert videos offers an affordable and effective
+solution to specify the intended behaviors for reinforcement learning tasks. In
+this work, we propose Diffusion Reward, a novel framework that learns rewards
+from expert videos via conditional video diffusion models for solving complex
+visual RL problems. Our key insight is that lower generative diversity is
+observed when conditioned on expert trajectories. Diffusion Reward is
+accordingly formalized by the negative of conditional entropy that encourages
+productive exploration of expert-like behaviors. We show the efficacy of our
+method over 10 robotic manipulation tasks from MetaWorld and Adroit with visual
+input and sparse reward. Moreover, Diffusion Reward could even solve unseen
+tasks successfully and effectively, largely surpassing baseline methods.
+Project page and code: https://diffusion-reward.github.io/.
+
+
+
+ comment: Project page and code: https://diffusion-reward.github.io/
+
+
+
+
+
+
+ ☆ DUSt3R: Geometric 3D Vision Made Easy
+
+
+
+
+
+
+
+
+ Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, Jerome Revaud
+
+
+ Multi-view stereo reconstruction (MVS) in the wild requires to first estimate
+the camera parameters e.g. intrinsic and extrinsic parameters. These are
+usually tedious and cumbersome to obtain, yet they are mandatory to triangulate
+corresponding pixels in 3D space, which is the core of all best performing MVS
+algorithms. In this work, we take an opposite stance and introduce DUSt3R, a
+radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction
+of arbitrary image collections, i.e. operating without prior information about
+camera calibration nor viewpoint poses. We cast the pairwise reconstruction
+problem as a regression of pointmaps, relaxing the hard constraints of usual
+projective camera models. We show that this formulation smoothly unifies the
+monocular and binocular reconstruction cases. In the case where more than two
+images are provided, we further propose a simple yet effective global alignment
+strategy that expresses all pairwise pointmaps in a common reference frame. We
+base our network architecture on standard Transformer encoders and decoders,
+allowing us to leverage powerful pretrained models. Our formulation directly
+provides a 3D model of the scene as well as depth information, but
+interestingly, we can seamlessly recover from it, pixel matches, relative and
+absolute camera. Exhaustive experiments on all these tasks showcase that the
+proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on
+monocular/multi-view depth estimation as well as relative pose estimation. In
+summary, DUSt3R makes many geometric 3D vision tasks easy.
+
+
+
+
+
+
+
+
+ Bardia Safaei, Vibashan VS, Celso M. de Melo, Vishal M. Patel
+
+
+ Active Learning (AL) aims to enhance the performance of deep models by
+selecting the most informative samples for annotation from a pool of unlabeled
+data. Despite impressive performance in closed-set settings, most AL methods
+fail in real-world scenarios where the unlabeled data contains unknown
+categories. Recently, a few studies have attempted to tackle the AL problem for
+the open-set setting. However, these methods focus more on selecting known
+samples and do not efficiently utilize unknown samples obtained during AL
+rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which
+leverages both known and unknown distributions effectively to select
+informative samples during AL rounds. Specifically, our approach employs two
+different entropy scores. One measures the uncertainty of a sample with respect
+to the known-class distributions. The other measures the uncertainty of the
+sample with respect to the unknown-class distributions. By utilizing these two
+entropy scores we effectively separate the known and unknown samples from the
+unlabeled data resulting in better sampling. Through extensive experiments, we
+show that the proposed method outperforms existing state-of-the-art methods on
+CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at
+\url{https://github.com/bardisafa/EOAL}.
+
+
+
+ comment: Accepted in AAAI 2024
+
+
+
+
+
+
+ ☆ VideoPoet: A Large Language Model for Zero-Shot Video Generation
+
+
+
+
+
+
+
+
+ Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, Lu Jiang
+
+
+ We present VideoPoet, a language model capable of synthesizing high-quality
+video, with matching audio, from a large variety of conditioning signals.
+VideoPoet employs a decoder-only transformer architecture that processes
+multimodal inputs -- including images, videos, text, and audio. The training
+protocol follows that of Large Language Models (LLMs), consisting of two
+stages: pretraining and task-specific adaptation. During pretraining, VideoPoet
+incorporates a mixture of multimodal generative objectives within an
+autoregressive Transformer framework. The pretrained LLM serves as a foundation
+that can be adapted for a range of video generation tasks. We present empirical
+results demonstrating the model's state-of-the-art capabilities in zero-shot
+video generation, specifically highlighting VideoPoet's ability to generate
+high-fidelity motions. Project page: http://sites.research.google/videopoet/
+
+
+
+
+
+
+
+ ☆ Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance
+ Generation
+
+
+
+
+
+
+
+
+ Philipp Schröppel, Christopher Wewer, Jan Eric Lenssen, Eddy Ilg, Thomas Brox
+
+
+ Controllable generation of 3D assets is important for many practical
+applications like content creation in movies, games and engineering, as well as
+in AR/VR. Recently, diffusion models have shown remarkable results in
+generation quality of 3D objects. However, none of the existing models enable
+disentangled generation to control the shape and appearance separately. For the
+first time, we present a suitable representation for 3D diffusion models to
+enable such disentanglement by introducing a hybrid point cloud and neural
+radiance field approach. We model a diffusion process over point positions
+jointly with a high-dimensional feature space for a local density and radiance
+decoder. While the point positions represent the coarse shape of the object,
+the point features allow modeling the geometry and appearance details. This
+disentanglement enables us to sample both independently and therefore to
+control both separately. Our approach sets a new state of the art in generation
+compared to previous disentanglement-capable methods by reduced FID scores of
+30-90% and is on-par with other non disentanglement-capable state-of-the art
+methods.
+
+
+
+
+
+
+
+ ☆ LingoQA: Video Question Answering for Autonomous Driving
+
+
+
+
+
+
+
+
+ Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Oleg Sinavski
+
+
+ Autonomous driving has long faced a challenge with public acceptance due to
+the lack of explainability in the decision-making process. Video
+question-answering (QA) in natural language provides the opportunity for
+bridging this gap. Nonetheless, evaluating the performance of Video QA models
+has proved particularly tough due to the absence of comprehensive benchmarks.
+To fill this gap, we introduce LingoQA, a benchmark specifically for autonomous
+driving Video QA. The LingoQA trainable metric demonstrates a 0.95 Spearman
+correlation coefficient with human evaluations. We introduce a Video QA dataset
+of central London consisting of 419k samples that we release with the paper. We
+establish a baseline vision-language model and run extensive ablation studies
+to understand its performance.
+
+
+
+ comment: Benchmark and dataset are available at
+ https://github.com/wayveai/LingoQA/
+
+
+
+
+
+
+ ☆ HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image
+ Inpainting with Diffusion Models
+
+
+ Recent progress in text-guided image inpainting, based on the unprecedented
+success of text-to-image diffusion models, has led to exceptionally realistic
+and visually plausible results. However, there is still significant potential
+for improvement in current text-to-image inpainting models, particularly in
+better aligning the inpainted area with user prompts and performing
+high-resolution inpainting. Therefore, in this paper we introduce HD-Painter, a
+completely training-free approach that accurately follows to prompts and
+coherently scales to high-resolution image inpainting. To this end, we design
+the Prompt-Aware Introverted Attention (PAIntA) layer enhancing self-attention
+scores by prompt information and resulting in better text alignment
+generations. To further improve the prompt coherence we introduce the
+Reweighting Attention Score Guidance (RASG) mechanism seamlessly integrating a
+post-hoc sampling strategy into general form of DDIM to prevent
+out-of-distribution latent shifts. Moreover, HD-Painter allows extension to
+larger scales by introducing a specialized super-resolution technique
+customized for inpainting, enabling the completion of missing regions in images
+of up to 2K resolution. Our experiments demonstrate that HD-Painter surpasses
+existing state-of-the-art approaches qualitatively and quantitatively,
+achieving an impressive generation accuracy improvement of 61.4% vs 51.9%. We
+will make the codes publicly available at:
+https://github.com/Picsart-AI-Research/HD-Painter
+
+
+
+
+
+
+
+ ☆ LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
+ Understanding
+
+
+ Recently, Large Language Models (LLMs) and Multimodal Large Language Models
+(MLLMs) have shown promise in instruction following and 2D image understanding.
+While these models are powerful, they have not yet been developed to comprehend
+the more challenging 3D physical scenes, especially when it comes to the sparse
+outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw
+LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs
+to gain a comprehensive understanding of outdoor 3D scenes. The central insight
+of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a
+language modeling problem, encompassing tasks such as 3D captioning, 3D
+grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D
+LiDAR-text pairing data, we introduce a three-stage training strategy and
+generate relevant datasets, progressively aligning the 3D modality with the
+language embedding space of LLM. Furthermore, we design a View-Aware
+Transformer (VAT) to connect the 3D encoder with the LLM, which effectively
+bridges the modality gap and enhances the LLM's spatial orientation
+comprehension of visual features. Our experiments show that LiDAR-LLM possesses
+favorable capabilities to comprehend various instructions regarding 3D scenes
+and engage in complex spatial reasoning. LiDAR-LLM attains a 40.9 BLEU-1 on the
+3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\%
+BEV mIoU on the 3D grounding task. Web page:
+https://sites.google.com/view/lidar-llm
+
+
+
+
+
+
+
+ ☆ A Strong Baseline for Temporal Video-Text Alignment
+
+
+ In this paper, we consider the problem of temporally aligning the video and
+texts from instructional videos, specifically, given a long-term video, and
+associated text sentences, our goal is to determine their corresponding
+timestamps in the video. To this end, we establish a simple, yet strong model
+that adopts a Transformer-based architecture with all texts as queries,
+iteratively attending to the visual features, to infer the optimal timestamp.
+We conduct thorough experiments to investigate: (i) the effect of upgrading ASR
+systems to reduce errors from speech recognition, (ii) the effect of various
+visual-textual backbones, ranging from CLIP to S3D, to the more recent
+InternVideo, (iii) the effect of transforming noisy ASR transcripts into
+descriptive steps by prompting a large language model (LLM), to summarize the
+core activities within the ASR transcript as a new training dataset. As a
+result, our proposed simple model demonstrates superior performance on both
+narration alignment and procedural step grounding tasks, surpassing existing
+state-of-the-art methods by a significant margin on three public benchmarks,
+namely, 9.3% on HT-Step, 3.4% on HTM-Align and 4.7% on CrossTask. We believe
+the proposed model and dataset with descriptive steps can be treated as a
+strong baseline for future research in temporal video-text alignment. All
+codes, models, and the resulting dataset will be publicly released to the
+research community.
+
+
+
+
+
+
+
+ ☆ Dual Attention U-Net with Feature Infusion: Pushing the Boundaries of
+ Multiclass Defect Segmentation
+
+
+
+
+
+
+
+
+ Rasha Alshawi, Md Tamjidul Hoque, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Kendall Niles, Ken Prathak, Joe Tom, Jordan Klein, Murtada Mousa, Johny Javier Lopez
+
+
+ The proposed architecture, Dual Attentive U-Net with Feature Infusion (DAU-FI
+Net), addresses challenges in semantic segmentation, particularly on multiclass
+imbalanced datasets with limited samples. DAU-FI Net integrates multiscale
+spatial-channel attention mechanisms and feature injection to enhance precision
+in object localization. The core employs a multiscale depth-separable
+convolution block, capturing localized patterns across scales. This block is
+complemented by a spatial-channel squeeze and excitation (scSE) attention unit,
+modeling inter-dependencies between channels and spatial regions in feature
+maps. Additionally, additive attention gates refine segmentation by connecting
+encoder-decoder pathways.
+ To augment the model, engineered features using Gabor filters for textural
+analysis, Sobel and Canny filters for edge detection are injected guided by
+semantic masks to expand the feature space strategically. Comprehensive
+experiments on a challenging sewer pipe and culvert defect dataset and a
+benchmark dataset validate DAU-FI Net's capabilities. Ablation studies
+highlight incremental benefits from attention blocks and feature injection.
+DAU-FI Net achieves state-of-the-art mean Intersection over Union (IoU) of
+95.6% and 98.8% on the defect test set and benchmark respectively, surpassing
+prior methods by 8.9% and 12.6%, respectively. Ablation studies highlight
+incremental benefits from attention blocks and feature injection. The proposed
+architecture provides a robust solution, advancing semantic segmentation for
+multiclass problems with limited training data. Our sewer-culvert defects
+dataset, featuring pixel-level annotations, opens avenues for further research
+in this crucial domain. Overall, this work delivers key innovations in
+architecture, attention, and feature engineering to elevate semantic
+segmentation efficacy.
+
+
+
+ comment: under review in IEEE Transactions on Artificial Intelligence
+
+
+
+
+
+
+ ☆ Geometric Awareness in Neural Fields for 3D Human Registration
+
+
+ Aligning a template to 3D human point clouds is a long-standing problem
+crucial for tasks like animation, reconstruction, and enabling supervised
+learning pipelines. Recent data-driven methods leverage predicted surface
+correspondences; however, they are not robust to varied poses or distributions.
+In contrast, industrial solutions often rely on expensive manual annotations or
+multi-view capturing systems. Recently, neural fields have shown promising
+results, but their purely data-driven nature lacks geometric awareness, often
+resulting in a trivial misalignment of the template registration. In this work,
+we propose two solutions: LoVD, a novel neural field model that predicts the
+direction towards the localized SMPL vertices on the target surface; and INT,
+the first self-supervised task dedicated to neural fields that, at test time,
+refines the backbone, exploiting the target geometry. We combine them into
+INLoVD, a robust 3D Human body registration pipeline trained on a large MoCap
+dataset. INLoVD is efficient (takes less than a minute), solidly achieves the
+state of the art over public benchmarks, and provides unprecedented
+generalization on out-of-distribution data. We will release code and
+checkpoints in \url{url}.
+
+
+
+
+
+
+
+ ☆ Deep Learning Based Face Recognition Method using Siamese Network
+
+
+ Achieving state-of-the-art results in face verification systems typically
+hinges on the availability of labeled face training data, a resource that often
+proves challenging to acquire in substantial quantities. In this research
+endeavor, we proposed employing Siamese networks for face recognition,
+eliminating the need for labeled face images. We achieve this by strategically
+leveraging negative samples alongside nearest neighbor counterparts, thereby
+establishing positive and negative pairs through an unsupervised methodology.
+The architectural framework adopts a VGG encoder, trained as a double branch
+siamese network. Our primary aim is to circumvent the necessity for labeled
+face image data, thus proposing the generation of training pairs in an entirely
+unsupervised manner. Positive training data are selected within a dataset based
+on their highest cosine similarity scores with a designated anchor, while
+negative training data are culled in a parallel fashion, though drawn from an
+alternate dataset. During training, the proposed siamese network conducts
+binary classification via cross-entropy loss. Subsequently, during the testing
+phase, we directly extract face verification scores from the network's output
+layer. Experimental results reveal that the proposed unsupervised system
+delivers a performance on par with a similar but fully supervised baseline.
+
+
+
+
+
+
+
+ ☆ Open-Set: ID Card Presentation Attack Detection using Neural Transfer
+ Style
+
+
+
+
+
+
+
+
+ Reuben Markham, Juan M. Espin, Mario Nieto-Hidalgo, Juan E. Tapia
+
+
+ The accurate detection of ID card Presentation Attacks (PA) is becoming
+increasingly important due to the rising number of online/remote services that
+require the presentation of digital photographs of ID cards for digital
+onboarding or authentication. Furthermore, cybercriminals are continuously
+searching for innovative ways to fool authentication systems to gain
+unauthorized access to these services. Although advances in neural network
+design and training have pushed image classification to the state of the art,
+one of the main challenges faced by the development of fraud detection systems
+is the curation of representative datasets for training and evaluation. The
+handcrafted creation of representative presentation attack samples often
+requires expertise and is very time-consuming, thus an automatic process of
+obtaining high-quality data is highly desirable. This work explores ID card
+Presentation Attack Instruments (PAI) in order to improve the generation of
+samples with four Generative Adversarial Networks (GANs) based image
+translation models and analyses the effectiveness of the generated data for
+training fraud detection systems. Using open-source data, we show that
+synthetic attack presentations are an adequate complement for additional real
+attack presentations, where we obtain an EER performance increase of 0.63%
+points for print attacks and a loss of 0.29% for screen capture attacks.
+
+
+
+
+
+
+
+
+ Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman
+
+
+ Recent advancements in the text-to-3D task leverage finetuned text-to-image
+diffusion models to generate multi-view images, followed by NeRF
+reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still
+suffer from multi-view inconsistency and the resulting NeRF artifacts. Although
+training longer with SFT improves consistency, it also causes distribution
+shift, which reduces diversity and realistic details. We argue that the SFT of
+multi-view diffusion models resembles the instruction finetuning stage of the
+LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods.
+Essentially, RLFT methods optimize models beyond their SFT data distribution by
+using their own outputs, effectively mitigating distribution shift. To this
+end, we introduce Carve3D, a RLFT method coupled with the Multi-view
+Reconstruction Consistency (MRC) metric, to improve the consistency of
+multi-view diffusion models. To compute MRC on a set of multi-view images, we
+compare them with their corresponding renderings of the reconstructed NeRF at
+the same viewpoints. We validate the robustness of MRC with extensive
+experiments conducted under controlled inconsistency levels. We enhance the
+base RLFT algorithm to stabilize the training process, reduce distribution
+shift, and identify scaling laws. Through qualitative and quantitative
+experiments, along with a user study, we demonstrate Carve3D's improved
+multi-view consistency, the resulting superior NeRF reconstruction quality, and
+minimal distribution shift compared to longer SFT. Project webpage:
+https://desaixie.github.io/carve-3d.
+
+
+
+
+
+
+
+
+ Han Huang, Yulun Wu, Junsheng Zhou, Ge Gao, Ming Gu, Yushen Liu
+
+
+ Recently, neural implicit functions have demonstrated remarkable results in
+the field of multi-view reconstruction. However, most existing methods are
+tailored for dense views and exhibit unsatisfactory performance when dealing
+with sparse views. Several latest methods have been proposed for generalizing
+implicit reconstruction to address the sparse view reconstruction task, but
+they still suffer from high training costs and are merely valid under carefully
+selected perspectives. In this paper, we propose a novel sparse view
+reconstruction framework that leverages on-surface priors to achieve highly
+faithful surface reconstruction. Specifically, we design several constraints on
+global geometry alignment and local geometry refinement for jointly optimizing
+coarse shapes and fine details. To achieve this, we train a neural network to
+learn a global implicit field from the on-surface points obtained from SfM and
+then leverage it as a coarse geometric constraint. To exploit local geometric
+consistency, we project on-surface points onto seen and unseen views, treating
+the consistent loss of projected features as a fine geometric constraint. The
+experimental results with DTU and BlendedMVS datasets in two prevalent sparse
+settings demonstrate significant improvements over the state-of-the-art
+methods.
+
+
+ Recent advancements in personalized text-to-image (T2I) models have
+revolutionized content creation, empowering non-experts to generate stunning
+images with unique styles. While promising, adding realistic motions into these
+personalized images by text poses significant challenges in preserving distinct
+styles, high-fidelity details, and achieving motion controllability by text. In
+this paper, we present PIA, a Personalized Image Animator that excels in
+aligning with condition images, achieving motion controllability by text, and
+the compatibility with various personalized T2I models without specific tuning.
+To achieve these goals, PIA builds upon a base T2I model with well-trained
+temporal alignment layers, allowing for the seamless transformation of any
+personalized T2I model into an image animation model. A key component of PIA is
+the introduction of the condition module, which utilizes the condition frame
+and inter-frame affinity as input to transfer appearance information guided by
+the affinity hint for individual frame synthesis in the latent space. This
+design mitigates the challenges of appearance-related image alignment within
+and allows for a stronger focus on aligning with motion-related guidance.
+
+
+ Generating photorealistic 3D faces from given conditions is a challenging
+task. Existing methods often rely on time-consuming one-by-one optimization
+approaches, which are not efficient for modeling the same distribution content,
+e.g., faces. Additionally, an ideal controllable 3D face generation model
+should consider both facial attributes and expressions. Thus we propose a novel
+approach called TEx-Face(TExt & Expression-to-Face) that addresses these
+challenges by dividing the task into three components, i.e., 3D GAN Inversion,
+Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion,
+we introduce two methods which aim to enhance the representation of style codes
+and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser
+to incorporate multiple conditions into the style code and propose a data
+augmentation strategy to address the issue of insufficient paired
+visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and
+CelebA-Dialog demonstrate the promising performance of our TEx-Face in
+achieving the efficient and controllable generation of photorealistic 3D faces.
+The code will be available at https://github.com/sxl142/TEx-Face.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models
+
+
+ This paper presents Paint3D, a novel coarse-to-fine generative framework that
+is capable of producing high-resolution, lighting-less, and diverse 2K UV
+texture maps for untextured 3D meshes conditioned on text or image inputs. The
+key challenge addressed is generating high-quality textures without embedded
+illumination information, which allows the textures to be re-lighted or
+re-edited within modern graphics pipelines. To achieve this, our method first
+leverages a pre-trained depth-aware 2D diffusion model to generate
+view-conditional images and perform multi-view texture fusion, producing an
+initial coarse texture map. However, as 2D models cannot fully represent 3D
+shapes and disable lighting effects, the coarse texture map exhibits incomplete
+areas and illumination artifacts. To resolve this, we train separate UV
+Inpainting and UVHD diffusion models specialized for the shape-aware refinement
+of incomplete areas and the removal of illumination artifacts. Through this
+coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that
+maintain semantic consistency while being lighting-less, significantly
+advancing the state-of-the-art in texturing 3D objects.
+
+
+
+
+
+
+
+ ☆ EfficientPPS: Part-aware Panoptic Segmentation of Transparent Objects
+ for Robotic Manipulation
+
+
+
+
+
+
+
+
+ Benjamin Alt, Minh Dang Nguyen, Andreas Hermann, Darko Katic, Rainer Jäkel, Rüdiger Dillmann, Eric Sax
+
+
+ The use of autonomous robots for assistance tasks in hospitals has the
+potential to free up qualified staff and im-prove patient care. However, the
+ubiquity of deformable and transparent objects in hospital settings poses
+signif-icant challenges to vision-based perception systems. We present
+EfficientPPS, a neural architecture for part-aware panoptic segmentation that
+provides robots with semantically rich visual information for grasping and
+ma-nipulation tasks. We also present an unsupervised data collection and
+labelling method to reduce the need for human involvement in the training
+process. EfficientPPS is evaluated on a dataset containing real-world hospital
+objects and demonstrated to be robust and efficient in grasping transparent
+transfusion bags with a collaborative robot arm.
+
+
+
+ comment: 8 pages, 8 figures, presented at the 56th International Symposium on
+ Robotics (ISR Europe)
+
+ The zero-shot performance of visual question answering (VQA) models relies
+heavily on prompts. For example, a zero-shot VQA for disaster scenarios could
+leverage well-designed Chain of Thought (CoT) prompts to stimulate the model's
+potential. However, using CoT prompts has some problems, such as causing an
+incorrect answer in the end due to the hallucination in the thought process. In
+this paper, we propose a zero-shot VQA named Flood Disaster VQA with Two-Stage
+Prompt (VQA-TSP). The model generates the thought process in the first stage
+and then uses the thought process to generate the final answer in the second
+stage. In particular, visual context is added in the second stage to relieve
+the hallucination problem that exists in the thought process. Experimental
+results show that our method exceeds the performance of state-of-the-art
+zero-shot VQA models for flood disaster scenarios in total. Our study provides
+a research basis for improving the performance of CoT-based zero-shot VQA.
+
+
+
+ comment: already be accepted by 2024 3rd International Conference on Computer,
+ Artificial Intelligence and Control Engineering (CAICE 2024)
+
+ In various verification systems, Restricted Boltzmann Machines (RBMs) have
+demonstrated their efficacy in both front-end and back-end processes. In this
+work, we propose the use of RBMs to the image clustering tasks. RBMs are
+trained to convert images into image embeddings. We employ the conventional
+bottom-up Agglomerative Hierarchical Clustering (AHC) technique. To address the
+challenge of limited test face image data, we introduce Agglomerative
+Hierarchical Clustering based Method for Image Clustering using Restricted
+Boltzmann Machine (AHC-RBM) with two major steps. Initially, a universal RBM
+model is trained using all available training dataset. Subsequently, we train
+an adapted RBM model using the data from each test image. Finally, RBM vectors
+which is the embedding vector is generated by concatenating the
+visible-to-hidden weight matrices of these adapted models, and the bias
+vectors. These vectors effectively preserve class-specific information and are
+utilized in image clustering tasks. Our experimental results, conducted on two
+benchmark image datasets (MS-Celeb-1M and DeepFashion), demonstrate that our
+proposed approach surpasses well-known clustering algorithms such as k-means,
+spectral clustering, and approximate Rank-order.
+
+
+
+
+
+
+
+ ☆ Towards Efficient Time Stepping for Numerical Shape Correspondence
+
+
+ The computation of correspondences between shapes is a principal task in
+shape analysis. To this end, methods based on partial differential equations
+(PDEs) have been established, encompassing e.g. the classic heat kernel
+signature as well as numerical solution schemes for geometric PDEs. In this
+work we focus on the latter approach.
+ We consider here several time stepping schemes. The goal of this
+investigation is to assess, if one may identify a useful property of methods
+for time integration for the shape analysis context. Thereby we investigate the
+dependence on time step size, since the class of implicit schemes that are
+useful candidates in this context should ideally yield an invariant behaviour
+with respect to this parameter.
+ To this end we study integration of heat and wave equation on a manifold. In
+order to facilitate this study, we propose an efficient, unified model order
+reduction framework for these models. We show that specific $l_0$ stable
+schemes are favourable for numerical shape analysis. We give an experimental
+evaluation of the methods at hand of classical TOSCA data sets.
+
+
+
+
+
+
+
+
+ Thomas Norrenbrock, Marco Rudolph, Bodo Rosenhahn
+
+
+ Explanations in Computer Vision are often desired, but most Deep Neural
+Networks can only provide saliency maps with questionable faithfulness.
+Self-Explaining Neural Networks (SENN) extract interpretable concepts with
+fidelity, diversity, and grounding to combine them linearly for
+decision-making. While they can explain what was recognized, initial
+realizations lack accuracy and general applicability. We propose the
+Quantized-Self-Explaining Neural Network Q-SENN. Q-SENN satisfies or exceeds
+the desiderata of SENN while being applicable to more complex datasets and
+maintaining most or all of the accuracy of an uninterpretable baseline model,
+out-performing previous work in all considered metrics. Q-SENN describes the
+relationship between every class and feature as either positive, negative or
+neutral instead of an arbitrary number of possible relations, enforcing more
+binary human-friendly features. Since every class is assigned just 5
+interpretable features on average, Q-SENN shows convincing local and global
+interpretability. Additionally, we propose a feature alignment method, capable
+of aligning learned features with human language-based concepts without
+additional supervision. Thus, what is learned can be more easily verbalized.
+The code is published: https://github.com/ThomasNorr/Q-SENN
+
+
+
+ comment: Accepted to AAAI 2024, SRRAI
+
+
+
+
+
+
+ ☆ SyncDreamer for 3D Reconstruction of Endangered Animal Species with NeRF
+ and NeuS
+
+
+
+
+
+
+
+
+ Ahmet Haydar Ornek, Deniz Sen, Esmanur Civil
+
+
+ The main aim of this study is to demonstrate how innovative view synthesis
+and 3D reconstruction techniques can be used to create models of endangered
+species using monocular RGB images. To achieve this, we employed SyncDreamer to
+produce unique perspectives and NeuS and NeRF to reconstruct 3D
+representations. We chose four different animals, including the oriental stork,
+frog, dragonfly, and tiger, as our subjects for this study. Our results show
+that the combination of SyncDreamer, NeRF, and NeuS techniques can successfully
+create 3D models of endangered animals. However, we also observed that NeuS
+produced blurry images, while NeRF generated sharper but noisier images. This
+study highlights the potential of modeling endangered animals and offers a new
+direction for future research in this field. By showcasing the effectiveness of
+these advanced techniques, we hope to encourage further exploration and
+development of techniques for preserving and studying endangered species.
+
+
+
+ comment: 8 figures
+
+
+
+
+
+
+ ☆ Universal Noise Annotation: Unveiling the Impact of Noisy annotation on
+ Object Detection
+
+
+
+
+
+
+
+
+ Kwangrok Ryoo, Yeonsik Jo, Seungjun Lee, Mira Kim, Ahra Jo, Seung Hwan Kim, Seungryong Kim, Soonyoung Lee
+
+
+ For object detection task with noisy labels, it is important to consider not
+only categorization noise, as in image classification, but also localization
+noise, missing annotations, and bogus bounding boxes. However, previous studies
+have only addressed certain types of noise (e.g., localization or
+categorization). In this paper, we propose Universal-Noise Annotation (UNA), a
+more practical setting that encompasses all types of noise that can occur in
+object detection, and analyze how UNA affects the performance of the detector.
+We analyzed the development direction of previous works of detection algorithms
+and examined the factors that impact the robustness of detection model learning
+method. We open-source the code for injecting UNA into the dataset and all the
+training log and weight are also shared.
+
+
+
+ comment: appendix and code : https://github.com/Ryoo72/UNA
+
+
+
+
+
+
+ ☆ Super-resolution of THz time-domain images based on low-rank
+ representation
+
+
+ Terahertz time-domain spectroscopy (THz-TDS) employs sub-picosecond pulses to
+probe dielectric properties of materials giving as a result a 3-dimensional
+hyperspectral data cube. The spatial resolution of THz images is primarily
+limited by two sources: a non-zero THz beam waist and the acquisition step
+size. Acquisition with a small step size allows for the visualisation of
+smaller details in images at the expense of acquisition time, but the
+frequency-dependent point-spread function remains the biggest bottleneck for
+THz imaging. This work presents a super-resolution approach to restore THz
+time-domain images acquired with medium-to-big step sizes. The results show the
+optimized and robust performance for different frequency bands (from 0.5 to 3.5
+THz) obtaining higher resolution and additionally removing effects of blur at
+lower frequencies and noise at higher frequencies.
+
+
+
+ comment: This work was presented at the Sixth International Workshop on Mobile
+ Terahertz Systems (IWMTS)
+
+
+
+
+
+
+ ☆ An Approach to Colour Morphological Supremum Formation using the
+ LogSumExp Approximation
+
+
+
+
+
+
+
+
+ Marvin Kahra, Michael Breuß, Andreas Kleefeld, Martin Welk
+
+
+ Mathematical morphology is a part of image processing that has proven to be
+fruitful for numerous applications. Two main operations in mathematical
+morphology are dilation and erosion. These are based on the construction of a
+supremum or infimum with respect to an order over the tonal range in a certain
+section of the image. The tonal ordering can easily be realised in grey-scale
+morphology, and some morphological methods have been proposed for colour
+morphology. However, all of these have certain limitations. In this paper we
+present a novel approach to colour morphology extending upon previous work in
+the field based on the Loewner order. We propose to consider an approximation
+of the supremum by means of a log-sum exponentiation introduced by Maslov. We
+apply this to the embedding of an RGB image in a field of symmetric $2\times2$
+matrices. In this way we obtain nearly isotropic matrices representing colours
+and the structural advantage of transitivity. In numerical experiments we
+highlight some remarkable properties of the proposed approach.
+
+
+
+ comment: 12 pages, 28 figures, submitted to IAPR Third International
+ Conference on Discrete Geometry and Mathematical Morphology
+
+
+
+
+
+
+ ☆ TinySAM: Pushing the Envelope for Efficient Segment Anything Model
+
+
+ Recently segment anything model (SAM) has shown powerful segmentation
+capability and has drawn great attention in computer vision fields. Massive
+following works have developed various applications based on the pretrained SAM
+and achieved impressive performance on downstream vision tasks. However, SAM
+consists of heavy architectures and requires massive computational capacity,
+which hinders the further application of SAM on computation constrained edge
+devices. To this end, in this paper we propose a framework to obtain a tiny
+segment anything model (TinySAM) while maintaining the strong zero-shot
+performance. We first propose a full-stage knowledge distillation method with
+online hard prompt sampling strategy to distill a lightweight student model. We
+also adapt the post-training quantization to the promptable segmentation task
+and further reduce the computational cost. Moreover, a hierarchical segmenting
+everything strategy is proposed to accelerate the everything inference by
+$2\times$ with almost no performance degradation. With all these proposed
+methods, our TinySAM leads to orders of magnitude computational reduction and
+pushes the envelope for efficient segment anything task. Extensive experiments
+on various zero-shot transfer tasks demonstrate the significantly advantageous
+performance of our TinySAM against counterpart methods. Pre-trained models and
+codes will be available at https://github.com/xinghaochen/TinySAM and
+https://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.
+
+
+
+
+
+
+
+ ☆ Few Shot Part Segmentation Reveals Compositional Logic for Industrial
+ Anomaly Detection AAAI2024
+
+
+
+
+
+
+
+
+ Soopil Kim, Sion An, Philip Chikontwe, Myeongkyun Kang, Ehsan Adeli, Kilian M. Pohl, Sanghyun Park
+
+
+ Logical anomalies (LA) refer to data violating underlying logical constraints
+e.g., the quantity, arrangement, or composition of components within an image.
+Detecting accurately such anomalies requires models to reason about various
+component types through segmentation. However, curation of pixel-level
+annotations for semantic segmentation is both time-consuming and expensive.
+Although there are some prior few-shot or unsupervised co-part segmentation
+algorithms, they often fail on images with industrial object. These images have
+components with similar textures and shapes, and a precise differentiation
+proves challenging. In this study, we introduce a novel component segmentation
+model for LA detection that leverages a few labeled samples and unlabeled
+images sharing logical constraints. To ensure consistent segmentation across
+unlabeled images, we employ a histogram matching loss in conjunction with an
+entropy loss. As segmentation predictions play a crucial role, we propose to
+enhance both local and global sample validity detection by capturing key
+aspects from visual semantics via three memory banks: class histograms,
+component composition embeddings and patch-level representations. For effective
+LA detection, we propose an adaptive scaling strategy to standardize anomaly
+scores from different memory banks in inference. Extensive experiments on the
+public benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA
+detection vs. 89.6% from competing methods.
+
+
+
+ comment: Accepted at AAAI2024
+
+
+
+
+
+
+ ☆ Progressive Evolution from Single-Point to Polygon for Scene Text
+
+
+ The advancement of text shape representations towards compactness has
+enhanced text detection and spotting performance, but at a high annotation
+cost. Current models use single-point annotations to reduce costs, yet they
+lack sufficient localization information for downstream applications. To
+overcome this limitation, we introduce Point2Polygon, which can efficiently
+transform single-points into compact polygons. Our method uses a coarse-to-fine
+process, starting with creating and selecting anchor points based on
+recognition confidence, then vertically and horizontally refining the polygon
+using recognition information to optimize its shape. We demonstrate the
+accuracy of the generated polygons through extensive experiments: 1) By
+creating polygons from ground truth points, we achieved an accuracy of 82.0% on
+ICDAR 2015; 2) In training detectors with polygons generated by our method, we
+attained 86% of the accuracy relative to training with ground truth (GT); 3)
+Additionally, the proposed Point2Polygon can be seamlessly integrated to
+empower single-point spotters to generate polygons. This integration led to an
+impressive 82.5% accuracy for the generated polygons. It is worth mentioning
+that our method relies solely on synthetic recognition information, eliminating
+the need for any manual annotation beyond single points.
+
+
+
+
+
+
+
+ ☆ Pose-based Tremor Type and Level Analysis for Parkinson's Disease from
+ Video
+
+
+
+
+
+
+
+
+ Haozheng Zhang, Edmond S. L. Ho, Xiatian Zhang, Silvia Del Din, Hubert P. H. Shum
+
+
+ Purpose:Current methods for diagnosis of PD rely on clinical examination. The
+accuracy of diagnosis ranges between 73% and 84%, and is influenced by the
+experience of the clinical assessor. Hence, an automatic, effective and
+interpretable supporting system for PD symptom identification would support
+clinicians in making more robust PD diagnostic decisions. Methods: We propose
+to analyze Parkinson's tremor (PT) to support the analysis of PD, since PT is
+one of the most typical symptoms of PD with broad generalizability. To realize
+the idea, we present SPA-PTA, a deep learning-based PT classification and
+severity estimation system that takes consumer-grade videos of front-facing
+humans as input. The core of the system is a novel attention module with a
+lightweight pyramidal channel-squeezing-fusion architecture that effectively
+extracts relevant PT information and filters noise. It enhances modeling
+performance while improving system interpretability. Results:We validate our
+system via individual-based leave-one-out cross-validation on two tasks: the PT
+classification task and the tremor severity rating estimation task. Our system
+presents a 91.3% accuracy and 80.0% F1-score in classifying PT with non-PT
+class, while providing a 76.4% accuracy and 76.7% F1-score in more complex
+multiclass tremor rating classification task. Conclusion: Our system offers a
+cost-effective PT classification and tremor severity estimation results as
+warning signs of PD for undiagnosed patients with PT symptoms. In addition, it
+provides a potential solution for supporting PD diagnosis in regions with
+limited clinical resources.
+
+
+ Recent advancements in large language models (LLMs) have led to the creation
+of intelligent agents capable of performing complex tasks. This paper
+introduces a novel LLM-based multimodal agent framework designed to operate
+smartphone applications. Our framework enables the agent to operate smartphone
+applications through a simplified action space, mimicking human-like
+interactions such as tapping and swiping. This novel approach bypasses the need
+for system back-end access, thereby broadening its applicability across diverse
+apps. Central to our agent's functionality is its innovative learning method.
+The agent learns to navigate and use new apps either through autonomous
+exploration or by observing human demonstrations. This process generates a
+knowledge base that the agent refers to for executing complex tasks across
+different applications. To demonstrate the practicality of our agent, we
+conducted extensive testing over 50 tasks in 10 different applications,
+including social media, email, maps, shopping, and sophisticated image editing
+tools. The results affirm our agent's proficiency in handling a diverse array
+of high-level tasks.
+
+
+
+ comment: 10 pages, 3 figures, 2 tables
+
+
+
+
+
+
+ ☆ 3D Points Splatting for Real-Time Dynamic Hand Reconstruction
+
+
+
+
+
+
+
+
+ Zheheng Jiang, Hossein Rahmani, Sue Black, Bryan M. Williams
+
+
+ We present 3D Points Splatting Hand Reconstruction (3D-PSHR), a real-time and
+photo-realistic hand reconstruction approach. We propose a self-adaptive
+canonical points upsampling strategy to achieve high-resolution hand geometry
+representation. This is followed by a self-adaptive deformation that deforms
+the hand from the canonical space to the target pose, adapting to the dynamic
+changing of canonical points which, in contrast to the common practice of
+subdividing the MANO model, offers greater flexibility and results in improved
+geometry fitting. To model texture, we disentangle the appearance color into
+the intrinsic albedo and pose-aware shading, which are learned through a
+Context-Attention module. Moreover, our approach allows the geometric and the
+appearance models to be trained simultaneously in an end-to-end manner. We
+demonstrate that our method is capable of producing animatable, photorealistic
+and relightable hand reconstructions using multiple datasets, including
+monocular videos captured with handheld smartphones and large-scale multi-view
+videos featuring various hand poses. We also demonstrate that our approach
+achieves real-time rendering speeds while simultaneously maintaining superior
+performance compared to existing state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ A Semantic Space is Worth 256 Language Descriptions: Make Stronger
+ Segmentation Models with Descriptive Properties
+
+
+ This paper introduces ProLab, a novel approach using property-level label
+space for creating strong interpretable segmentation models. Instead of relying
+solely on category-specific annotations, ProLab uses descriptive properties
+grounded in common sense knowledge for supervising segmentation models. It is
+based on two core designs. First, we employ Large Language Models (LLMs) and
+carefully crafted prompts to generate descriptions of all involved categories
+that carry meaningful common sense knowledge and follow a structured format.
+Second, we introduce a description embedding model preserving semantic
+correlation across descriptions and then cluster them into a set of descriptive
+properties (e.g., 256) using K-Means. These properties are based on
+interpretable common sense knowledge consistent with theories of human
+recognition. We empirically show that our approach makes segmentation models
+perform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal
+Context, Cityscapes, and BDD). Our method also shows better scalability with
+extended training steps than category-level supervision. Our interpretable
+segmentation framework also emerges with the generalization ability to segment
+out-of-domain or unknown categories using only in-domain descriptive
+properties. Code is available at https://github.com/lambert-x/ProLab.
+
+
+
+ comment: Preprint. Code is available at https://github.com/lambert-x/ProLab
+
+
+
+
+
+
+ ☆ Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed
+ Diffusion Models
+
+
+
+
+
+
+
+
+ Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, Karsten Kreis
+
+
+ Text-guided diffusion models have revolutionized image and video generation
+and have also been successfully used for optimization-based 3D object
+synthesis. Here, we instead focus on the underexplored text-to-4D setting and
+synthesize dynamic, animated 3D objects using score distillation methods with
+an additional temporal dimension. Compared to previous work, we pursue a novel
+compositional generation-based approach, and combine text-to-image,
+text-to-video, and 3D-aware multiview diffusion models to provide feedback
+during 4D object optimization, thereby simultaneously enforcing temporal
+consistency, high-quality visual appearance and realistic geometry. Our method,
+called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with
+deformation fields as 4D representation. Crucial to AYG is a novel method to
+regularize the distribution of the moving 3D Gaussians and thereby stabilize
+the optimization and induce motion. We also propose a motion amplification
+mechanism as well as a new autoregressive synthesis scheme to generate and
+combine multiple 4D sequences for longer generation. These techniques allow us
+to synthesize vivid dynamic scenes, outperform previous work qualitatively and
+quantitatively and achieve state-of-the-art text-to-4D performance. Due to the
+Gaussian 4D representation, different 4D animations can be seamlessly combined,
+as we demonstrate. AYG opens up promising avenues for animation, simulation and
+digital content creation as well as synthetic data generation.
+
+
+
+
+
+
+
+ ☆ Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the
+ AIIB23 challenge
+
+
+
+
+
+
+
+
+ Yang Nan, Xiaodan Xing, Shiyi Wang, Zeyu Tang, Federico N Felder, Sheng Zhang, Roberta Eufrasia Ledda, Xiaoliu Ding, Ruiqi Yu, Weiping Liu, Feng Shi, Tianyang Sun, Zehong Cao, Minghui Zhang, Yun Gu, Hanxiao Zhang, Jian Gao, Wen Tang, Pengxin Yu, Han Kang, Junqiang Chen, Xing Lu, Boyu Zhang, Michail Mamalakis, Francesco Prinzi, Gianluca Carlini, Lisa Cuneo, Abhirup Banerjee, Zhaohu Xing, Lei Zhu, Zacharia Mesbah, Dhruv Jain, Tsiry Mayet, Hongyu Yuan, Qing Lyu, Athol Wells, Simon LF Walsh, Guang Yang
+
+
+ Airway-related quantitative imaging biomarkers are crucial for examination,
+diagnosis, and prognosis in pulmonary diseases. However, the manual delineation
+of airway trees remains prohibitively time-consuming. While significant efforts
+have been made towards enhancing airway modelling, current public-available
+datasets concentrate on lung diseases with moderate morphological variations.
+The intricate honeycombing patterns present in the lung tissues of fibrotic
+lung disease patients exacerbate the challenges, often leading to various
+prediction errors. To address this issue, the 'Airway-Informed Quantitative CT
+Imaging Biomarker for Fibrotic Lung Disease 2023' (AIIB23) competition was
+organized in conjunction with the official 2023 International Conference on
+Medical Image Computing and Computer Assisted Intervention (MICCAI). The airway
+structures were meticulously annotated by three experienced radiologists.
+Competitors were encouraged to develop automatic airway segmentation models
+with high robustness and generalization abilities, followed by exploring the
+most correlated QIB of mortality prediction. A training set of 120
+high-resolution computerised tomography (HRCT) scans were publicly released
+with expert annotations and mortality status. The online validation set
+incorporated 52 HRCT scans from patients with fibrotic lung disease and the
+offline test set included 140 cases from fibrosis and COVID-19 patients. The
+results have shown that the capacity of extracting airway trees from patients
+with fibrotic lung disease could be enhanced by introducing voxel-wise weighted
+general union loss and continuity loss. In addition to the competitive image
+biomarkers for prognosis, a strong airway-derived biomarker (Hazard ratio>1.5,
+p<0.0001) was revealed for survival prognostication compared with existing
+clinical measurements, clinician assessment and AI-based biomarkers.
+
+
+
+ comment: 19 pages
+
+
+
+
+
+
+ ☆ Video Recognition in Portrait Mode
+
+
+ The creation of new datasets often presents new challenges for video
+recognition and can inspire novel ideas while addressing these challenges.
+While existing datasets mainly comprise landscape mode videos, our paper seeks
+to introduce portrait mode videos to the research community and highlight the
+unique challenges associated with this video format. With the growing
+popularity of smartphones and social media applications, recognizing portrait
+mode videos is becoming increasingly important. To this end, we have developed
+the first dataset dedicated to portrait mode video recognition, namely
+PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a
+data-driven manner, comprising 400 fine-grained categories, and rigorous
+quality assurance was implemented to ensure the accuracy of human annotations.
+In addition to the new dataset, we conducted a comprehensive analysis of the
+impact of video format (portrait mode versus landscape mode) on recognition
+accuracy and spatial bias due to the different formats. Furthermore, we
+designed extensive experiments to explore key aspects of portrait mode video
+recognition, including the choice of data augmentation, evaluation procedure,
+the importance of temporal information, and the role of audio modality.
+Building on the insights from our experimental results and the introduction of
+PortraitMode-400, our paper aims to inspire further research efforts in this
+emerging research area.
+
+
+
+ comment: See mingfei.info/PMV for data and code information
+
+ Detection Transformer (DETR) and its variants have shown great potential for
+accurate object detection in recent years. The mechanism of object query
+enables DETR family to directly obtain a fixed number of object predictions and
+streamlines the detection pipeline. Meanwhile, recent studies also reveal that
+with proper architecture design, convolution networks (ConvNets) also achieve
+competitive performance with transformers, \eg, ConvNeXt. To this end, in this
+paper we explore whether we could build a query-based end-to-end object
+detection framework with ConvNets instead of sophisticated transformer
+architecture. The proposed framework, \ie, Detection ConvNet (DECO), is
+composed of a backbone and convolutional encoder-decoder architecture. We
+carefully design the DECO encoder and propose a novel mechanism for our DECO
+decoder to perform interaction between object queries and image features via
+convolutional layers. We compare the proposed DECO against prior detectors on
+the challenging COCO benchmark. Despite its simplicity, our DECO achieves
+competitive performance in terms of detection accuracy and running speed.
+Specifically, with the ResNet-50 and ConvNeXt-Tiny backbone, DECO obtains
+$38.6\%$ and $40.8\%$ AP on COCO \textit{val} set with $35$ and $28$ FPS
+respectively and outperforms the DETR model. Incorporated with advanced
+multi-scale feature module, our DECO+ achieves $47.8\%$ AP with $34$ FPS. We
+hope the proposed DECO brings another perspective for designing object
+detection framework.
+
+
+
+
+
+
+
+ ☆ Gaussian Splitting Algorithm with Color and Opacity Depended on Viewing
+ Direction
+
+
+ Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of
+neural networks to capture the intricacies of 3D objects. By encoding the shape
+and color information within neural network weights, NeRFs excel at producing
+strikingly sharp novel views of 3D objects. Recently, numerous generalizations
+of NeRFs utilizing generative models have emerged, expanding its versatility.
+In contrast, Gaussian Splatting (GS) offers a similar renders quality with
+faster training and inference as it does not need neural networks to work. We
+encode information about the 3D objects in the set of Gaussian distributions
+that can be rendered in 3D similarly to classical meshes. Unfortunately, GS are
+difficult to condition since they usually require circa hundred thousand
+Gaussian components. To mitigate the caveats of both models, we propose a
+hybrid model that uses GS representation of the 3D object's shape and
+NeRF-based encoding of color and opacity. Our model uses Gaussian distributions
+with trainable positions (i.e. means of Gaussian), shape (i.e. covariance of
+Gaussian), color and opacity, and neural network, which takes parameters of
+Gaussian and viewing direction to produce changes in color and opacity.
+Consequently, our model better describes shadows, light reflections, and
+transparency of 3D objects.
+
+
+
+
+
+
+
+ ☆ Bootstrap Masked Visual Modeling via Hard Patches Mining
+
+
+ Masked visual modeling has attracted much attention due to its promising
+potential in learning generalizable representations. Typical approaches urge
+models to predict specific contents of masked tokens, which can be intuitively
+considered as teaching a student (the model) to solve given problems
+(predicting masked contents). Under such settings, the performance is highly
+correlated with mask strategies (the difficulty of provided problems). We argue
+that it is equally important for the model to stand in the shoes of a teacher
+to produce challenging problems by itself. Intuitively, patches with high
+values of reconstruction loss can be regarded as hard samples, and masking
+those hard patches naturally becomes a demanding reconstruction task. To
+empower the model as a teacher, we propose Hard Patches Mining (HPM),
+predicting patch-wise losses and subsequently determining where to mask.
+Technically, we introduce an auxiliary loss predictor, which is trained with a
+relative objective to prevent overfitting to exact loss values. Also, to
+gradually guide the training procedure, we propose an easy-to-hard mask
+strategy. Empirically, HPM brings significant improvements under both image and
+video benchmarks. Interestingly, solely incorporating the extra loss prediction
+objective leads to better representations, verifying the efficacy of
+determining where is hard to reconstruct. The code is available at
+https://github.com/Haochen-Wang409/HPM.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2304.05919
+
+
+
+
+
+
+ ☆ DreamTuner: Single Image is Enough for Subject-Driven Generation
+
+
+
+
+
+
+
+
+ Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, Qian He
+
+
+ Diffusion-based models have demonstrated impressive capabilities for
+text-to-image generation and are expected for personalized applications of
+subject-driven generation, which require the generation of customized concepts
+with one or a few reference images. However, existing methods based on
+fine-tuning fail to balance the trade-off between subject learning and the
+maintenance of the generation capabilities of pretrained models. Moreover,
+other methods that utilize additional image encoders tend to lose important
+details of the subject due to encoding compression. To address these
+challenges, we propose DreamTurner, a novel method that injects reference
+information from coarse to fine to achieve subject-driven image generation more
+effectively. DreamTurner introduces a subject-encoder for coarse subject
+identity preservation, where the compressed general subject features are
+introduced through an attention layer before visual-text cross-attention. We
+then modify the self-attention layers within pretrained text-to-image models to
+self-subject-attention layers to refine the details of the target subject. The
+generated image queries detailed features from both the reference image and
+itself in self-subject-attention. It is worth emphasizing that
+self-subject-attention is an effective, elegant, and training-free method for
+maintaining the detailed features of customized subjects and can serve as a
+plug-and-play solution during inference. Finally, with additional
+subject-driven fine-tuning, DreamTurner achieves remarkable performance in
+subject-driven image generation, which can be controlled by a text or other
+conditions such as pose. For further details, please visit the project page at
+https://dreamtuner-diffusion.github.io/.
+
+
+
+
+
+
+
+ ☆ Free-Editor: Zero-shot Text-driven 3D Scene Editing
+
+
+
+
+
+
+
+
+ Nazmul Karim, Umar Khalid, Hasan Iqbal, Jing Hua, Chen Chen
+
+
+ Text-to-Image (T2I) diffusion models have gained popularity recently due to
+their multipurpose and easy-to-use nature, e.g. image and video generation as
+well as editing. However, training a diffusion model specifically for 3D scene
+editing is not straightforward due to the lack of large-scale datasets. To
+date, editing 3D scenes requires either re-training the model to adapt to
+various 3D edited scenes or design-specific methods for each special editing
+type. Furthermore, state-of-the-art (SOTA) methods require multiple
+synchronized edited images from the same scene to facilitate the scene editing.
+Due to the current limitations of T2I models, it is very challenging to apply
+consistent editing effects to multiple images, i.e. multi-view inconsistency in
+editing. This in turn compromises the desired 3D scene editing performance if
+these images are used. In our work, we propose a novel training-free 3D scene
+editing technique, Free-Editor, which allows users to edit 3D scenes without
+further re-training the model during test time. Our proposed method
+successfully avoids the multi-view style inconsistency issue in SOTA methods
+with the help of a "single-view editing" scheme. Specifically, we show that
+editing a particular 3D scene can be performed by only modifying a single view.
+To this end, we introduce an Edit Transformer that enforces intra-view
+consistency and inter-view style transfer by utilizing self- and
+cross-attention, respectively. Since it is no longer required to re-train the
+model and edit every view in a scene, the editing time, as well as memory
+resources, are reduced significantly, e.g., the runtime being $\sim \textbf{20}
+\times$ faster than SOTA. We have conducted extensive experiments on a wide
+range of benchmark datasets and achieve diverse editing capabilities with our
+proposed technique.
+
+
+
+
+
+
+
+ ☆ Compositional Zero-Shot Learning for Attribute-Based Object Reference in
+ Human-Robot Interaction
+
+
+
+
+
+
+
+
+ Peng Gao, Ahmed Jaafar, Brian Reily, Christopher Reardon, Hao Zhang
+
+
+ Language-enabled robots have been widely studied over the past years to
+enable natural human-robot interaction and teaming in various real-world
+applications. Language-enabled robots must be able to comprehend referring
+expressions to identify a particular object from visual perception using a set
+of referring attributes extracted from natural language. However, visual
+observations of an object may not be available when it is referred to, and the
+number of objects and attributes may also be unbounded in open worlds. To
+address the challenges, we implement an attribute-based compositional zero-shot
+learning method that uses a list of attributes to perform referring expression
+comprehension in open worlds. We evaluate the approach on two datasets
+including the MIT-States and the Clothing 16K. The preliminary experimental
+results show that our implemented approach allows a robot to correctly identify
+the objects referred to by human commands.
+
+
+
+ comment: Equal contribution from the first two authors
+
+ State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS)
+using image-level labels exhibit severe performance degradation on driving
+scene datasets such as Cityscapes. To address this challenge, we develop a new
+WSSS framework tailored to driving scene datasets. Based on extensive analysis
+of dataset characteristics, we employ Contrastive Language-Image Pre-training
+(CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key
+challenges: (1) pseudo-masks from CLIP lack in representing small object
+classes, and (2) these masks contain notable noise. We propose solutions for
+each issue as follows. (1) We devise Global-Local View Training that seamlessly
+incorporates small-scale patches during model training, thereby enhancing the
+model's capability to handle small-sized yet critical objects in driving scenes
+(e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing
+(CARB), a novel technique that discerns reliable and noisy regions through
+evaluating the consistency between CLIP masks and segmentation predictions. It
+prioritizes reliable pixels over noisy pixels via adaptive loss weighting.
+Notably, the proposed method achieves 51.8\% mIoU on the Cityscapes test
+dataset, showcasing its potential as a strong WSSS baseline on driving scene
+datasets. Experimental results on CamVid and WildDash2 demonstrate the
+effectiveness of our method across diverse datasets, even with small-scale
+datasets or visually challenging conditions. The code is available at
+https://github.com/k0u-id/CARB.
+
+
+
+
+
+
+
+ ☆ SPGroup3D: Superpoint Grouping Network for Indoor 3D Object Detection AAAI 2024
+
+
+
+
+
+
+
+
+ Yun Zhu, Le Hui, Yaqi Shen, Jin Xie
+
+
+ Current 3D object detection methods for indoor scenes mainly follow the
+voting-and-grouping strategy to generate proposals. However, most methods
+utilize instance-agnostic groupings, such as ball query, leading to
+inconsistent semantic information and inaccurate regression of the proposals.
+To this end, we propose a novel superpoint grouping network for indoor
+anchor-free one-stage 3D object detection. Specifically, we first adopt an
+unsupervised manner to partition raw point clouds into superpoints, areas with
+semantic consistency and spatial similarity. Then, we design a geometry-aware
+voting module that adapts to the centerness in anchor-free detection by
+constraining the spatial relationship between superpoints and object centers.
+Next, we present a superpoint-based grouping module to explore the consistent
+representation within proposals. This module includes a superpoint attention
+layer to learn feature interaction between neighboring superpoints, and a
+superpoint-voxel fusion layer to propagate the superpoint-level information to
+the voxel level. Finally, we employ effective multiple matching to capitalize
+on the dynamic receptive fields of proposals based on superpoints during the
+training. Experimental results demonstrate our method achieves state-of-the-art
+performance on ScanNet V2, SUN RGB-D, and S3DIS datasets in the indoor
+one-stage 3D object detection. Source code is available at
+https://github.com/zyrant/SPGroup3D.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
+ Grounding
+
+
+
+
+
+
+
+
+ Haifeng Huang, Yang Zhao, Zehan Wang, Yan Xia, Zhou Zhao
+
+
+ Temporal Video Grounding (TVG) aims to localize the temporal boundary of a
+specific segment in an untrimmed video based on a given language query. Since
+datasets in this domain are often gathered from limited video scenes, models
+tend to overfit to scene-specific factors, which leads to suboptimal
+performance when encountering new scenes in real-world applications. In a new
+scene, the fine-grained annotations are often insufficient due to the expensive
+labor cost, while the coarse-grained video-query pairs are easier to obtain.
+Thus, to address this issue and enhance model performance on new scenes, we
+explore the TVG task in an unsupervised domain adaptation (UDA) setting across
+scenes for the first time, where the video-query pairs in the source scene
+(domain) are labeled with temporal boundaries, while those in the target scene
+are not. Under the UDA setting, we introduce a novel Adversarial Multi-modal
+Domain Adaptation (AMDA) method to adaptively adjust the model's scene-related
+knowledge by incorporating insights from the target data. Specifically, we
+tackle the domain gap by utilizing domain discriminators, which help identify
+valuable scene-related features effective across both domains. Concurrently, we
+mitigate the semantic gap between different modalities by aligning video-query
+pairs with related semantics. Furthermore, we employ a mask-reconstruction
+approach to enhance the understanding of temporal semantics within a scene.
+Extensive experiments on Charades-STA, ActivityNet Captions, and YouCook2
+demonstrate the effectiveness of our proposed method.
+
+
+
+
+
+
+
+ ☆ ProvFL: Client-Driven Interpretability of Global Model Predictions in
+ Federated Learning
+
+
+
+
+
+
+
+
+ Waris Gill, Ali Anwar, Muhammad Ali Gulzar
+
+
+ Federated Learning (FL) trains a collaborative machine learning model by
+aggregating multiple privately trained clients' models over several training
+rounds. Such a long, continuous action of model aggregations poses significant
+challenges in reasoning about the origin and composition of such a global
+model. Regardless of the quality of the global model or if it has a fault,
+understanding the model's origin is equally important for debugging,
+interpretability, and explainability in federated learning. FL application
+developers often question: (1) what clients contributed towards a global model
+and (2) if a global model predicts a label, which clients are responsible for
+it?
+ We introduce, neuron provenance, a fine-grained lineage capturing mechanism
+that tracks the flow of information between the individual participating
+clients in FL and the final global model. We operationalize this concept in
+ProvFL that functions on two key principles. First, recognizing that monitoring
+every neuron of every client's model statically is ineffective and noisy due to
+the uninterpretable nature of individual neurons, ProvFL dynamically isolates
+influential and sensitive neurons in the global model, significantly reducing
+the search space. Second, as multiple clients' models are fused in each round
+to form a global model, tracking each client's contribution becomes
+challenging. ProvFL leverages the invertible nature of fusion algorithms to
+precisely isolate each client's contribution derived from selected neurons.
+When asked to localize the clients responsible for the given behavior (i.e.,
+prediction) of the global model, ProvFL successfully localizes them with an
+average provenance accuracy of 97%. Additionally, ProvFL outperforms the
+state-of-the-art FL fault localization approach by an average margin of 50%.
+
+
+
+ comment: 22 pages. For access to the source code used in this study, please
+ contact the authors directly
+
+
+
+
+
+
+ ☆ Diff-Oracle: Diffusion Model for Oracle Character Generation with
+ Controllable Styles and Contents
+
+
+ Deciphering the oracle bone script plays a significant role in Chinese
+archaeology and philology. However, it is significantly challenging due to the
+scarcity of oracle character images. To overcome this issue, we propose
+Diff-Oracle, based on diffusion models (DMs), to generate sufficient
+controllable oracle characters. In contrast to most DMs that rely on text
+prompts, we incorporate a style encoder to control style information during the
+generation process. This encoder extracts style prompts from existing oracle
+character images, where style details are converted from a CLIP model into a
+text embedding format. Inspired by ControlNet, we introduce a content encoder
+to capture desired content information from content images, ensuring the
+fidelity of character glyphs. To train Diff-Oracle effectively, we propose to
+obtain pixel-level paired oracle character images (i.e., style and content
+images) by a pre-trained image-to-image translation model. Extensive
+qualitative and quantitative experiments conducted on two benchmark datasets,
+Oracle-241 and OBC306, demonstrate that our Diff-Oracle outperforms existing
+generative methods in terms of image generation, further enhancing recognition
+accuracy. Source codes will be available.
+
+
+
+
+
+
+
+ ☆ MFABA: A More Faithful and Accelerated Boundary-based Attribution Method
+ for Deep Neural Networks AAAI
+
+
+ To better understand the output of deep neural networks (DNN), attribution
+based methods have been an important approach for model interpretability, which
+assign a score for each input dimension to indicate its importance towards the
+model outcome. Notably, the attribution methods use the axioms of sensitivity
+and implementation invariance to ensure the validity and reliability of
+attribution results. Yet, the existing attribution methods present challenges
+for effective interpretation and efficient computation. In this work, we
+introduce MFABA, an attribution algorithm that adheres to axioms, as a novel
+method for interpreting DNN. Additionally, we provide the theoretical proof and
+in-depth analysis for MFABA algorithm, and conduct a large scale experiment.
+The results demonstrate its superiority by achieving over 101.5142 times faster
+speed than the state-of-the-art attribution algorithms. The effectiveness of
+MFABA is thoroughly evaluated through the statistical analysis in comparison to
+other methods, and the full implementation package is open-source at:
+https://github.com/LMBTough/MFABA
+
+
+
+ comment: Accepted by The 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI-24)
+
+
+
+
+
+
+ ☆ A Comprehensive End-to-End Computer Vision Framework for Restoration and
+ Recognition of Low-Quality Engineering Drawings
+
+
+ The digitization of engineering drawings is crucial for efficient reuse,
+distribution, and archiving. Existing computer vision approaches for digitizing
+engineering drawings typically assume the input drawings have high quality.
+However, in reality, engineering drawings are often blurred and distorted due
+to improper scanning, storage, and transmission, which may jeopardize the
+effectiveness of existing approaches. This paper focuses on restoring and
+recognizing low-quality engineering drawings, where an end-to-end framework is
+proposed to improve the quality of the drawings and identify the graphical
+symbols on them. The framework uses K-means clustering to classify different
+engineering drawing patches into simple and complex texture patches based on
+their gray level co-occurrence matrix statistics. Computer vision operations
+and a modified Enhanced Super-Resolution Generative Adversarial Network
+(ESRGAN) model are then used to improve the quality of the two types of
+patches, respectively. A modified Faster Region-based Convolutional Neural
+Network (Faster R-CNN) model is used to recognize the quality-enhanced
+graphical symbols. Additionally, a multi-stage task-driven collaborative
+learning strategy is proposed to train the modified ESRGAN and Faster R-CNN
+models to improve the resolution of engineering drawings in the direction that
+facilitates graphical symbol recognition, rather than human visual perception.
+A synthetic data generation method is also proposed to construct
+quality-degraded samples for training the framework. Experiments on real-world
+electrical diagrams show that the proposed framework achieves an accuracy of
+98.98% and a recall of 99.33%, demonstrating its superiority over previous
+approaches. Moreover, the framework is integrated into a widely-used power
+system software application to showcase its practicality.
+
+
+
+ comment: 20 pages, 13 figures, submitted to Engineering Applications of
+ Artificial Intelligence
+
+
+
+
+
+
+ ☆ Ponymation: Learning 3D Animal Motions from Unlabeled Online Videos
+
+
+ We introduce Ponymation, a new method for learning a generative model of
+articulated 3D animal motions from raw, unlabeled online videos. Unlike
+existing approaches for motion synthesis, our model does not require any pose
+annotations or parametric shape models for training, and is learned purely from
+a collection of raw video clips obtained from the Internet. We build upon a
+recent work, MagicPony, which learns articulated 3D animal shapes purely from
+single image collections, and extend it on two fronts. First, instead of
+training on static images, we augment the framework with a video training
+pipeline that incorporates temporal regularizations, achieving more accurate
+and temporally consistent reconstructions. Second, we learn a generative model
+of the underlying articulated 3D motion sequences via a spatio-temporal
+transformer VAE, simply using 2D reconstruction losses without relying on any
+explicit pose annotations. At inference time, given a single 2D image of a new
+animal instance, our model reconstructs an articulated, textured 3D mesh, and
+generates plausible 3D animations by sampling from the learned motion latent
+space.
+
+
+
+ comment: Project page: https://keqiangsun.github.io/projects/ponymation. The
+ first two authors contributed equally to this work. The last two authors
+ contributed equally
+
+
+
+
+
+
+ ☆ Towards More Faithful Natural Language Explanation Using Multi-Level
+ Contrastive Learning in VQA AAAI 2024
+
+
+ Natural language explanation in visual question answer (VQA-NLE) aims to
+explain the decision-making process of models by generating natural language
+sentences to increase users' trust in the black-box systems. Existing post-hoc
+methods have achieved significant progress in obtaining a plausible
+explanation. However, such post-hoc explanations are not always aligned with
+human logical inference, suffering from the issues on: 1) Deductive
+unsatisfiability, the generated explanations do not logically lead to the
+answer; 2) Factual inconsistency, the model falsifies its counterfactual
+explanation for answers without considering the facts in images; and 3)
+Semantic perturbation insensitivity, the model can not recognize the semantic
+changes caused by small perturbations. These problems reduce the faithfulness
+of explanations generated by models. To address the above issues, we propose a
+novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive
+\textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for
+VQA with semantic-level, image-level, and instance-level factual and
+counterfactual samples. MCLE extracts discriminative features and aligns the
+feature spaces from explanations with visual question and answer to generate
+more consistent explanations. We conduct extensive experiments, ablation
+analysis, and case study to demonstrate the effectiveness of our method on two
+VQA-NLE benchmarks.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
+ Single Image Talking Face Generation
+
+
+ The generation of emotional talking faces from a single portrait image
+remains a significant challenge. The simultaneous achievement of expressive
+emotional talking and accurate lip-sync is particularly difficult, as
+expressiveness is often compromised for the accuracy of lip-sync. As widely
+adopted by many prior works, the LSTM network often fails to capture the
+subtleties and variations of emotional expressions. To address these
+challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven
+framework, tailored for generating diverse expressions and accurate lip-sync
+concurrently. In the first stage, we propose EmoDiff, a novel diffusion module
+that generates diverse highly dynamic emotional expressions and head poses in
+accordance with the audio and the referenced emotion style. Given the strong
+correlation between lip motion and audio, we then refine the dynamics with
+enhanced lip-sync accuracy using audio features and emotion style. To this end,
+we deploy a video-to-video rendering module to transfer the expressions and lip
+motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively
+and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of
+expressiveness, lip-sync accuracy and perceptual quality.
+
+
+
+ comment: Project Page at https://magic-research.github.io/dream-talk/
+
+ Network binarization exhibits great potential for deployment on
+resource-constrained devices due to its low computational cost. Despite the
+critical importance, the security of binarized neural networks (BNNs) is rarely
+investigated. In this paper, we present ARBiBench, a comprehensive benchmark to
+evaluate the robustness of BNNs against adversarial perturbations on CIFAR-10
+and ImageNet. We first evaluate the robustness of seven influential BNNs on
+various white-box and black-box attacks. The results reveal that 1) The
+adversarial robustness of BNNs exhibits a completely opposite performance on
+the two datasets under white-box attacks. 2) BNNs consistently exhibit better
+adversarial robustness under black-box attacks. 3) Different BNNs exhibit
+certain similarities in their robustness performance. Then, we conduct
+experiments to analyze the adversarial robustness of BNNs based on these
+insights. Our research contributes to inspiring future research on enhancing
+the robustness of BNNs and advancing their application in real-world scenarios.
+
+
+
+
+
+
+
+ ☆ The Truth is in There: Improving Reasoning in Language Models with
+ Layer-Selective Rank Reduction
+
+
+
+
+
+
+
+
+ Pratyusha Sharma, Jordan T. Ash, Dipendra Misra
+
+
+ Transformer-based Large Language Models (LLMs) have become a fixture in
+modern machine learning. Correspondingly, significant resources are allocated
+towards research that aims to further advance this technology, typically
+resulting in models of increasing size that are trained on increasing amounts
+of data. This work, however, demonstrates the surprising result that it is
+often possible to significantly improve the performance of LLMs by selectively
+removing higher-order components of their weight matrices. This simple
+intervention, which we call LAyer-SElective Rank reduction (LASER), can be done
+on a model after training has completed, and requires no additional parameters
+or data. We show extensive experiments demonstrating the generality of this
+finding across language models and datasets, and provide in-depth analyses
+offering insights into both when LASER is effective and the mechanism by which
+it operates.
+
+
+ The capacity to generalize to future unseen data stands as one of the utmost
+crucial attributes of deep neural networks. Sharpness-Aware Minimization (SAM)
+aims to enhance the generalizability by minimizing worst-case loss using
+one-step gradient ascent as an approximation. However, as training progresses,
+the non-linearity of the loss landscape increases, rendering one-step gradient
+ascent less effective. On the other hand, multi-step gradient ascent will incur
+higher training cost. In this paper, we introduce a normalized Hessian trace to
+accurately measure the curvature of loss landscape on {\em both} training and
+test sets. In particular, to counter excessive non-linearity of loss landscape,
+we propose Curvature Regularized SAM (CR-SAM), integrating the normalized
+Hessian trace as a SAM regularizer. Additionally, we present an efficient way
+to compute the trace via finite differences with parallelism. Our theoretical
+analysis based on PAC-Bayes bounds establishes the regularizer's efficacy in
+reducing generalization error. Empirical evaluation on CIFAR and ImageNet
+datasets shows that CR-SAM consistently enhances classification performance for
+ResNet and Vision Transformer (ViT) models across various datasets. Our code is
+available at https://github.com/TrustAIoT/CR-SAM.
+
+
+
+ comment: AAAI 2024, main track
+
+
+
+
+
+
+ ☆ HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in
+ Image Editing via Hypernetworks AAAI2024
+
+
+
+
+
+
+
+
+ Hai Zhang, Chunwei Wu, Guitao Cao, Hailing Wang, Wenming Cao
+
+
+ Editing real images authentically while also achieving cross-domain editing
+remains a challenge. Recent studies have focused on converting real images into
+latent codes and accomplishing image editing by manipulating these codes.
+However, merely manipulating the latent codes would constrain the edited images
+to the generator's image domain, hindering the attainment of diverse editing
+goals. In response, we propose an innovative image editing method called
+HyperEditor, which utilizes weight factors generated by hypernetworks to
+reassign the weights of the pre-trained StyleGAN2's generator. Guided by CLIP's
+cross-modal image-text semantic alignment, this innovative approach enables us
+to simultaneously accomplish authentic attribute editing and cross-domain style
+transfer, a capability not realized in previous methods. Additionally, we
+ascertain that modifying only the weights of specific layers in the generator
+can yield an equivalent editing result. Therefore, we introduce an adaptive
+layer selector, enabling our hypernetworks to autonomously identify the layers
+requiring output weight factors, which can further improve our hypernetworks'
+efficiency. Extensive experiments on abundant challenging datasets demonstrate
+the effectiveness of our method.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ☆ SE(3)-Equivariant and Noise-Invariant 3D Motion Tracking in Medical
+ Images
+
+
+
+
+
+
+
+
+ Benjamin Billot, Daniel Moyer, Neel Dey, Malte Hoffmann, Esra Abaci Turk, Borjan Gagoski, Ellen Grant, Polina Golland
+
+
+ Rigid motion tracking is paramount in many medical imaging applications where
+movements need to be detected, corrected, or accounted for. Modern strategies
+rely on convolutional neural networks (CNN) and pose this problem as rigid
+registration. Yet, CNNs do not exploit natural symmetries in this task, as they
+are equivariant to translations (their outputs shift with their inputs) but not
+to rotations. Here we propose EquiTrack, the first method that uses recent
+steerable SE(3)-equivariant CNNs (E-CNN) for motion tracking. While steerable
+E-CNNs can extract corresponding features across different poses, testing them
+on noisy medical images reveals that they do not have enough learning capacity
+to learn noise invariance. Thus, we introduce a hybrid architecture that pairs
+a denoiser with an E-CNN to decouple the processing of anatomically irrelevant
+intensity features from the extraction of equivariant spatial features. Rigid
+transforms are then estimated in closed-form. EquiTrack outperforms
+state-of-the-art learning and optimisation methods for motion tracking in adult
+brain MRI and fetal MRI time series. Our code is available at
+github.com/BBillot/equitrack.
+
+
+
+
+
+
+
+ ☆ DyBluRF: Dynamic Deblurring Neural Radiance Fields for Blurry Monocular
+ Video
+
+
+
+
+
+
+
+
+ Minh-Quan Viet Bui, Jongmin Park, Jihyong Oh, Munchurl Kim
+
+
+ Video view synthesis, allowing for the creation of visually appealing frames
+from arbitrary viewpoints and times, offers immersive viewing experiences.
+Neural radiance fields, particularly NeRF, initially developed for static
+scenes, have spurred the creation of various methods for video view synthesis.
+However, the challenge for video view synthesis arises from motion blur, a
+consequence of object or camera movement during exposure, which hinders the
+precise synthesis of sharp spatio-temporal views. In response, we propose a
+novel dynamic deblurring NeRF framework for blurry monocular video, called
+DyBluRF, consisting of an Interleave Ray Refinement (IRR) stage and a Motion
+Decomposition-based Deblurring (MDD) stage. Our DyBluRF is the first that
+addresses and handles the novel view synthesis for blurry monocular video. The
+IRR stage jointly reconstructs dynamic 3D scenes and refines the inaccurate
+camera pose information to combat imprecise pose information extracted from the
+given blurry frames. The MDD stage is a novel incremental latent sharp-rays
+prediction (ILSP) approach for the blurry monocular video frames by decomposing
+the latent sharp rays into global camera motion and local object motion
+components. Extensive experimental results demonstrate that our DyBluRF
+outperforms qualitatively and quantitatively the very recent state-of-the-art
+methods. Our project page including source codes and pretrained model are
+publicly available at https://kaist-viclab.github.io/dyblurf-site/.
+
+
+
+ comment: The first three authors contributed equally to this work. Please
+ visit our project page at https://kaist-viclab.github.io/dyblurf-site/
+
+
+
+
+
+
+ ☆ Rethinking of Feature Interaction for Multi-task Learning on Dense
+ Prediction
+
+
+
+
+
+
+
+
+ Jingdong Zhang, Jiayuan Fan, Peng Ye, Bo Zhang, Hancheng Ye, Baopu Li, Yancheng Cai, Tao Chen
+
+
+ Existing works generally adopt the encoder-decoder structure for Multi-task
+Dense Prediction, where the encoder extracts the task-generic features, and
+multiple decoders generate task-specific features for predictions. We observe
+that low-level representations with rich details and high-level representations
+with abundant task information are not both involved in the multi-task
+interaction process. Additionally, low-quality and low-efficiency issues also
+exist in current multi-task learning architectures. In this work, we propose to
+learn a comprehensive intermediate feature globally from both task-generic and
+task-specific features, we reveal an important fact that this intermediate
+feature, namely the bridge feature, is a good solution to the above issues.
+Based on this, we propose a novel Bridge-Feature-Centirc Interaction (BRFI)
+method. A Bridge Feature Extractor (BFE) is designed for the generation of
+strong bridge features and Task Pattern Propagation (TPP) is applied to ensure
+high-quality task interaction participants. Then a Task-Feature Refiner (TFR)
+is developed to refine final task predictions with the well-learned knowledge
+from the bridge features. Extensive experiments are conducted on NYUD-v2 and
+PASCAL Context benchmarks, and the superior performance shows the proposed
+architecture is effective and powerful in promoting different dense prediction
+tasks simultaneously.
+
+
+ Accurate assessment of patient actions plays a crucial role in healthcare as
+it contributes significantly to disease progression monitoring and treatment
+effectiveness. However, traditional approaches to assess patient actions often
+rely on manual observation and scoring, which are subjective and
+time-consuming. In this paper, we propose an automated approach for patient
+action assessment using a Multi-Residual Spatio Temporal Graph Network
+(MR-STGN) that incorporates both angular and positional 3D skeletons. The
+MR-STGN is specifically designed to capture the spatio-temporal dynamics of
+patient actions. It achieves this by integrating information from multiple
+residual layers, with each layer extracting features at distinct levels of
+abstraction. Furthermore, we integrate an attention fusion mechanism into the
+network, which facilitates the adaptive weighting of various features. This
+empowers the model to concentrate on the most pertinent aspects of the
+patient's movements, offering precise instructions regarding specific body
+parts or movements that require attention. Ablation studies are conducted to
+analyze the impact of individual components within the proposed model. We
+evaluate our model on the UI-PRMD dataset demonstrating its performance in
+accurately predicting real-time patient action scores, surpassing
+state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ SPDGAN: A Generative Adversarial Network based on SPD Manifold Learning
+ for Automatic Image Colorization
+
+
+ This paper addresses the automatic colorization problem, which converts a
+gray-scale image to a colorized one. Recent deep-learning approaches can
+colorize automatically grayscale images. However, when it comes to different
+scenes which contain distinct color styles, it is difficult to accurately
+capture the color characteristics. In this work, we propose a fully automatic
+colorization approach based on Symmetric Positive Definite (SPD) Manifold
+Learning with a generative adversarial network (SPDGAN) that improves the
+quality of the colorization results. Our SPDGAN model establishes an
+adversarial game between two discriminators and a generator. The latter is
+based on ResNet architecture with few alterations. Its goal is to generate fake
+colorized images without losing color information across layers through
+residual connections. Then, we employ two discriminators from different
+domains. The first one is devoted to the image pixel domain, while the second
+one is to the Riemann manifold domain which helps to avoid color misalignment.
+Extensive experiments are conducted on the Places365 and COCO-stuff databases
+to test the effect of each component of our SPDGAN. In addition, quantitative
+and qualitative comparisons with state-of-the-art methods demonstrate the
+effectiveness of our model by achieving more realistic colorized images with
+less artifacts visually, and good results of PSNR, SSIM, and FID values.
+
+
+
+
+
+
+
+ ☆ InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
+ Multimodal and Language Models
+
+
+
+
+
+
+
+
+ Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, Lijuan Wang
+
+
+ In this paper, we build a visual dialogue dataset, named InfoVisDial, which
+provides rich informative answers in each round even with external knowledge
+related to the visual content. Different from existing datasets where the
+answer is compact and short, InfoVisDial contains long free-form answers with
+rich information in each round of dialogue. For effective data collection, the
+key idea is to bridge the large-scale multimodal model (e.g., GIT) and the
+language models (e.g., GPT-3). GIT can describe the image content even with
+scene text, while GPT-3 can generate informative dialogue based on the image
+description and appropriate prompting techniques. With such automatic pipeline,
+we can readily generate informative visual dialogue data at scale. Then, we ask
+human annotators to rate the generated dialogues to filter the low-quality
+conversations.Human analyses show that InfoVisDial covers informative and
+diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image
+scene texts, and $36.7\%$ require external knowledge. Each round's answer is
+also long and open-ended: $87.3\%$ of answers are unique with an average length
+of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a
+strong baseline by adapting the GIT model for the visual dialogue task and
+fine-tune the model on InfoVisDial. Hopefully, our work can motivate more
+effort on this direction.
+
+
+ In a privacy-focused era, Federated Learning (FL) has emerged as a promising
+machine learning technique. However, most existing FL studies assume that the
+data distribution remains nearly fixed over time, while real-world scenarios
+often involve dynamic and continual changes. To equip FL systems with continual
+model evolution capabilities, we focus on an important problem called Federated
+Continual Novel Class Learning (FedCN) in this work. The biggest challenge in
+FedCN is to merge and align novel classes that are discovered and learned by
+different clients without compromising privacy. To address this, we propose a
+Global Alignment Learning (GAL) framework that can accurately estimate the
+global novel class number and provide effective guidance for local training
+from a global perspective, all while maintaining privacy protection.
+Specifically, GAL first locates high-density regions in the representation
+space through a bi-level clustering mechanism to estimate the novel class
+number, with which the global prototypes corresponding to novel classes can be
+constructed. Then, GAL uses a novel semantic weighted loss to capture all
+possible correlations between these prototypes and the training data for
+mitigating the impact of pseudo-label noise and data heterogeneity. Extensive
+experiments on various datasets demonstrate GAL's superior performance over
+state-of-the-art novel class discovery methods. In particular, GAL achieves
+significant improvements in novel-class performance, increasing the accuracy by
+5.1% to 10.6% in the case of one novel class learning stage and by 7.8% to
+17.9% in the case of two novel class learning stages, without sacrificing
+known-class performance. Moreover, GAL is shown to be effective in equipping a
+variety of different mainstream FL algorithms with novel class discovery and
+learning capability, highlighting its potential for many real-world
+applications.
+
+
+
+
+
+
+
+
+ David Nakath, Xiangyu Weng, Mengkun She, Kevin Köser
+
+
+ When created faithfully from real-world data, Digital 3D representations of
+objects can be useful for human or computer-assisted analysis. Such models can
+also serve for generating training data for machine learning approaches in
+settings where data is difficult to obtain or where too few training data
+exists, e.g. by providing novel views or images in varying conditions. While
+the vast amount of visual 3D reconstruction approaches focus on non-physical
+models, textured object surfaces or shapes, in this contribution we propose a
+volumetric reconstruction approach that obtains a physical model including the
+interior of partially translucent objects such as plankton or insects. Our
+technique photographs the object under different poses in front of a bright
+white light source and computes absorption and scattering per voxel. It can be
+interpreted as visual tomography that we solve by inverse raytracing. We
+additionally suggest a method to convert non-physical NeRF media into a
+physically-based volumetric grid for initialization and illustrate the
+usefulness of the approach using two real-world plankton validation sets, the
+lab-scanned models being finally also relighted and virtually submerged in a
+scenario with augmented medium and illumination conditions. Please visit the
+project homepage at www.marine.informatik.uni-kiel.de/go/vito
+
+
+
+ comment: Accepted for publication at 3DV '24
+
+
+
+
+
+
+ ☆ Autoencoder Based Face Verification System
+
+
+ The primary objective of this work is to present an alternative approach
+aimed at reducing the dependency on labeled data. Our proposed method involves
+utilizing autoencoder pre-training within a face image recognition task with
+two step processes. Initially, an autoencoder is trained in an unsupervised
+manner using a substantial amount of unlabeled training dataset. Subsequently,
+a deep learning model is trained with initialized parameters from the
+pre-trained autoencoder. This deep learning training process is conducted in a
+supervised manner, employing relatively limited labeled training dataset.
+During evaluation phase, face image embeddings is generated as the output of
+deep neural network layer. Our training is executed on the CelebA dataset,
+while evaluation is performed using benchmark face recognition datasets such as
+Labeled Faces in the Wild (LFW) and YouTube Faces (YTF). Experimental results
+demonstrate that by initializing the deep neural network with pre-trained
+autoencoder parameters achieve comparable results to state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ Fine-grained Forecasting Models Via Gaussian Process Blurring Effect
+
+
+ Time series forecasting is a challenging task due to the existence of complex
+and dynamic temporal dependencies. This can lead to incorrect predictions by
+even the best forecasting models. Using more training data is one way to
+improve the accuracy, but this source is often limited. In contrast, we are
+building on successful denoising approaches for image generation by advocating
+for an end-to-end forecasting and denoising paradigm.
+ We propose an end-to-end forecast-blur-denoise forecasting framework by
+encouraging a division of labors between the forecasting and the denoising
+models. The initial forecasting model is directed to focus on accurately
+predicting the coarse-grained behavior, while the denoiser model focuses on
+capturing the fine-grained behavior that is locally blurred by integrating a
+Gaussian Process model. All three parts are interacting for the best end-to-end
+performance. Our extensive experiments demonstrate that our proposed approach
+is able to improve the forecasting accuracy of several state-of-the-art
+forecasting models as well as several other denoising approaches.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce
+ Lidar
+
+
+ 3D reconstruction from a single-view is challenging because of the ambiguity
+from monocular cues and lack of information about occluded regions. Neural
+radiance fields (NeRF), while popular for view synthesis and 3D reconstruction,
+are typically reliant on multi-view images. Existing methods for single-view 3D
+reconstruction with NeRF rely on either data priors to hallucinate views of
+occluded regions, which may not be physically accurate, or shadows observed by
+RGB cameras, which are difficult to detect in ambient light and low albedo
+backgrounds. We propose using time-of-flight data captured by a single-photon
+avalanche diode to overcome these limitations. Our method models two-bounce
+optical paths with NeRF, using lidar transient data for supervision. By
+leveraging the advantages of both NeRF and two-bounce light measured by lidar,
+we demonstrate that we can reconstruct visible and occluded geometry without
+data priors or reliance on controlled ambient lighting or scene albedo. In
+addition, we demonstrate improved generalization under practical constraints on
+sensor spatial- and temporal-resolution. We believe our method is a promising
+direction as single-photon lidars become ubiquitous on consumer devices, such
+as phones, tablets, and headsets.
+
+
+
+
+
+
+
+ ☆ InternVL: Scaling up Vision Foundation Models and Aligning for Generic
+ Visual-Linguistic Tasks
+
+
+
+
+
+
+
+
+ Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
+
+
+ The exponential growth of large language models (LLMs) has opened up numerous
+possibilities for multi-modal AGI systems. However, the progress in vision and
+vision-language foundation models, which are also critical elements of
+multi-modal AGI, has not kept pace with LLMs. In this work, we design a
+large-scale vision-language foundation model (InternVL), which scales up the
+vision foundation model to 6 billion parameters and progressively aligns it
+with the large language model, using web-scale image-text data from various
+sources. This model can be broadly applied to and achieve state-of-the-art
+performance on visual perception tasks such as image-level or pixel-level
+recognition, vision-language tasks such as zero-shot image/video
+classification, zero-shot image/video-text retrieval, and link with LLMs to
+create multi-modal dialogue systems. We hope that our research could contribute
+to the development of multi-modal large models. Code and models are available
+at https://github.com/OpenGVLab/InternVL.
+
+
+
+ comment: 25 pages, 5 figures, 28 tables
+
+
+
+
+
+
+ ☆ Neural Spline Fields for Burst Image Fusion and Layer Separation
+
+
+
+
+
+
+
+
+ Ilya Chugunov, David Shustin, Ruyu Yan, Chenyang Lei, Felix Heide
+
+
+ Each photo in an image burst can be considered a sample of a complex 3D
+scene: the product of parallax, diffuse and specular materials, scene motion,
+and illuminant variation. While decomposing all of these effects from a stack
+of misaligned images is a highly ill-conditioned task, the conventional
+align-and-merge burst pipeline takes the other extreme: blending them into a
+single image. In this work, we propose a versatile intermediate representation:
+a two-layer alpha-composited image plus flow model constructed with neural
+spline fields -- networks trained to map input coordinates to spline control
+points. Our method is able to, during test-time optimization, jointly fuse a
+burst image capture into one high-resolution reconstruction and decompose it
+into transmission and obstruction layers. Then, by discarding the obstruction
+layer, we can perform a range of tasks including seeing through occlusions,
+reflection suppression, and shadow removal. Validated on complex synthetic and
+in-the-wild captures we find that, with no post-processing steps or learned
+priors, our generalizable model is able to outperform existing dedicated
+single-image and multi-view obstruction removal approaches.
+
+
+ Humans possess the remarkable skill of Visual Perception, the ability to see
+and understand the seen, helping them make sense of the visual world and, in
+turn, reason. Multimodal Large Language Models (MLLM) have recently achieved
+impressive performance on vision-language tasks ranging from visual
+question-answering and image captioning to visual reasoning and image
+generation. However, when prompted to identify or count (perceive) the entities
+in a given image, existing MLLM systems fail. Working towards developing an
+accurate MLLM system for perception and reasoning, we propose using Versatile
+vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the
+VCoder with perception modalities such as segmentation or depth maps, improving
+the MLLM's perception abilities. Secondly, we leverage the images from COCO and
+outputs from off-the-shelf vision perception models to create our COCO
+Segmentation Text (COST) dataset for training and evaluating MLLMs on the
+object perception task. Thirdly, we introduce metrics to assess the object
+perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive
+experimental evidence proving the VCoder's improved object-level perception
+skills over existing Multimodal LLMs, including GPT-4V. We open-source our
+dataset, code, and models to promote research. We open-source our code at
+https://github.com/SHI-Labs/VCoder
+
+
+
+
+
+
+
+ ☆ Parrot Captions Teach CLIP to Spot Text
+
+
+
+
+
+
+
+
+ Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou
+
+
+ Despite CLIP being the foundation model in numerous vision-language
+applications, the CLIP suffers from a severe text spotting bias. Such bias
+causes CLIP models to `Parrot' the visual text embedded within images while
+disregarding the authentic visual semantics. We uncover that in the most
+popular image-text dataset LAION-2B, the captions also densely parrot (spell)
+the text embedded in images. Our analysis shows that around \textbf{50\%} of
+images are embedded with visual text content, and \textbf{90\%} of their
+captions more or less parrot the visual text. Based on such observation, we
+thoroughly inspect the different release d versions of CLIP models and verify
+that the visual text is the dominant factor in measuring the LAION-style
+image-text similarity for these models. To examine whether these parrot
+captions shape the text spotting bias, we train a series of CLIP models with
+LAION subsets curated by different parrot-caption-oriented criteria. We show
+that training with parrot captions easily shapes such bias but harms the
+expected visual-language representation learning in CLIP models. This suggests
+that it is urgent to revisit either the design of CLIP-like models or the
+existing image-text dataset curation pipeline built on CLIP score filtering.
+
+
+ Shortcut learning is when a model -- e.g. a cardiac disease classifier --
+exploits correlations between the target label and a spurious shortcut feature,
+e.g. a pacemaker, to predict the target label based on the shortcut rather than
+real discriminative features. This is common in medical imaging, where
+treatment and clinical annotations correlate with disease labels, making them
+easy shortcuts to predict disease. We propose a novel detection and
+quantification of the impact of potential shortcut features via a fast
+diffusion-based counterfactual image generation that can synthetically remove
+or add shortcuts. Via a novel inpainting-based modification we spatially limit
+the changes made with no extra inference step, encouraging the removal of
+spatially constrained shortcut features while ensuring that the shortcut-free
+counterfactuals preserve their remaining image features to a high degree. Using
+these, we assess how shortcut features influence model predictions.
+ This is enabled by our second contribution: An efficient diffusion-based
+counterfactual explanation method with significant inference speed-up at
+comparable image quality as state-of-the-art. We confirm this on two large
+chest X-ray datasets, a skin lesion dataset, and CelebA.
+
+
+ Deep Neural Networks (DNNs) are widely acknowledged to be susceptible to
+adversarial examples, wherein imperceptible perturbations are added to clean
+examples through diverse input transformation attacks. However, these methods
+originally designed for non-targeted attacks exhibit low success rates in
+targeted attacks. Recent targeted adversarial attacks mainly pay attention to
+gradient optimization, attempting to find the suitable perturbation direction.
+However, few of them are dedicated to input transformation.In this work, we
+observe a positive correlation between the logit/probability of the target
+class and diverse input transformation methods in targeted attacks. To this
+end, we propose a novel targeted adversarial attack called AutoAugment Input
+Transformation (AAIT). Instead of relying on hand-made strategies, AAIT
+searches for the optimal transformation policy from a transformation space
+comprising various operations. Then, AAIT crafts adversarial examples using the
+found optimal transformation policy to boost the adversarial transferability in
+targeted attacks. Extensive experiments conducted on CIFAR-10 and
+ImageNet-Compatible datasets demonstrate that the proposed AAIT surpasses other
+transfer-based targeted attacks significantly.
+
+
+ Open-vocabulary image segmentation aims to partition an image into semantic
+regions according to arbitrary text descriptions. However, complex visual
+scenes can be naturally decomposed into simpler parts and abstracted at
+multiple levels of granularity, introducing inherent segmentation ambiguity.
+Unlike existing methods that typically sidestep this ambiguity and treat it as
+an external factor, our approach actively incorporates a hierarchical
+representation encompassing different semantic-levels into the learning
+process. We propose a decoupled text-image fusion mechanism and representation
+learning modules for both "things" and "stuff". Additionally, we systematically
+examine the differences that exist in the textual and visual features between
+these types of categories. Our resulting model, named HIPIE, tackles
+HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a
+unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,
+Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the
+state-of-the-art results at various levels of image comprehension, including
+semantic-level (e.g., semantic segmentation), instance-level (e.g.,
+panoptic/referring segmentation and object detection), as well as part-level
+(e.g., part/subpart segmentation) tasks. Our code is released at
+https://github.com/berkeley-hipie/HIPIE.
+
+
+
+
+
+
+
+
+ Yuming Gu, Hongyi Xu, You Xie, Guoxian Song, Yichun Shi, Di Chang, Jing Yang, Linjie Luo
+
+
+ We present DiffPortrait3D, a conditional diffusion model that is capable of
+synthesizing 3D-consistent photo-realistic novel views from as few as a single
+in-the-wild portrait. Specifically, given a single RGB input, we aim to
+synthesize plausible but consistent facial details rendered from novel camera
+views with retained both identity and facial expression. In lieu of
+time-consuming optimization and fine-tuning, our zero-shot method generalizes
+well to arbitrary face portraits with unposed camera views, extreme facial
+expressions, and diverse artistic depictions. At its core, we leverage the
+generative prior of 2D diffusion models pre-trained on large-scale image
+datasets as our rendering backbone, while the denoising is guided with
+disentangled attentive control of appearance and camera pose. To achieve this,
+we first inject the appearance context from the reference image into the
+self-attention layers of the frozen UNets. The rendering view is then
+manipulated with a novel conditional control module that interprets the camera
+pose by watching a condition image of a crossed subject from the same view.
+Furthermore, we insert a trainable cross-view attention module to enhance view
+consistency, which is further strengthened with a novel 3D-aware noise
+generation process during inference. We demonstrate state-of-the-art results
+both qualitatively and quantitatively on our challenging in-the-wild and
+multi-view benchmarks.
+
+
+
+
+
+
+
+ ♻ ☆ Image Captioners Are Scalable Vision Learners Too NeurIPS 2023
+
+
+
+
+
+
+
+
+ Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer
+
+
+ Contrastive pretraining on image-text pairs from the web is one of the most
+popular large-scale pretraining strategies for vision backbones, especially in
+the context of large multimodal models. At the same time, image captioning on
+this type of data is commonly considered an inferior pretraining strategy. In
+this paper, we perform a fair comparison of these two pretraining strategies,
+carefully matching training data, compute, and model capacity. Using a standard
+encoder-decoder transformer, we find that captioning alone is surprisingly
+effective: on classification tasks, captioning produces vision encoders
+competitive with contrastively pretrained encoders, while surpassing them on
+vision & language tasks. We further analyze the effect of the model
+architecture and scale, as well as the pretraining data on the representation
+quality, and find that captioning exhibits the same or better scaling behavior
+along these axes. Overall our results show that plain image captioning is a
+more powerful pretraining strategy than was previously believed.
+
+
+
+ comment: Accepted at NeurIPS 2023. v2 adds SugarCrepe results and more
+ ablations, v3 has minor fixes. v4 adds a code link (
+ https://github.com/google-research/big_vision ). v5 has minor fixes
+
+ Predicting turn-taking in multiparty conversations has many practical
+applications in human-computer/robot interaction. However, the complexity of
+human communication makes it a challenging task. Recent advances have shown
+that synchronous multi-perspective egocentric data can significantly improve
+turn-taking prediction compared to asynchronous, single-perspective
+transcriptions. Building on this research, we propose a new multimodal
+transformer-based architecture for predicting turn-taking in embodied,
+synchronized multi-perspective data. Our experimental results on the recently
+introduced EgoCom dataset show a substantial performance improvement of up to
+14.01% on average compared to existing baselines and alternative
+transformer-based approaches. The source code, and the pre-trained models of
+our 3M-Transformer will be available upon acceptance.
+
+
+
+ comment: Accepted to ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ Unifying GANs and Score-Based Diffusion as Generative Particle Models
+
+
+
+
+
+
+
+
+ Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, Alain Rakotomamonjy
+
+
+ Particle-based deep generative models, such as gradient flows and score-based
+diffusion models, have recently gained traction thanks to their striking
+performance. Their principle of displacing particle distributions using
+differential equations is conventionally seen as opposed to the previously
+widespread generative adversarial networks (GANs), which involve training a
+pushforward generator network. In this paper we challenge this interpretation,
+and propose a novel framework that unifies particle and adversarial generative
+models by framing generator training as a generalization of particle models.
+This suggests that a generator is an optional addition to any such generative
+model. Consequently, integrating a generator into a score-based diffusion model
+and training a GAN without a generator naturally emerge from our framework. We
+empirically test the viability of these original models as proofs of concepts
+of potential applications of our framework.
+
+
+
+
+
+
+
+ ♻ ☆ ThoraX-PriorNet: A Novel Attention-Based Architecture Using Anatomical
+ Prior Probability Maps for Thoracic Disease Classification
+
+
+
+
+
+
+
+
+ Md. Iqbal Hossain, Mohammad Zunaed, Md. Kawsar Ahmed, S. M. Jawwad Hossain, Anwarul Hasan, Taufiq Hasan
+
+
+ Objective: Computer-aided disease diagnosis and prognosis based on medical
+images is a rapidly emerging field. Many Convolutional Neural Network (CNN)
+architectures have been developed by researchers for disease classification and
+localization from chest X-ray images. It is known that different thoracic
+disease lesions are more likely to occur in specific anatomical regions
+compared to others. This article aims to incorporate this disease and
+region-dependent prior probability distribution within a deep learning
+framework. Methods: We present the ThoraX-PriorNet, a novel attention-based CNN
+model for thoracic disease classification. We first estimate a
+disease-dependent spatial probability, i.e., an anatomical prior, that
+indicates the probability of occurrence of a disease in a specific region in a
+chest X-ray image. Next, we develop a novel attention-based classification
+model that combines information from the estimated anatomical prior and
+automatically extracted chest region of interest (ROI) masks to provide
+attention to the feature maps generated from a deep convolution network. Unlike
+previous works that utilize various self-attention mechanisms, the proposed
+method leverages the extracted chest ROI masks along with the probabilistic
+anatomical prior information, which selects the region of interest for
+different diseases to provide attention. Results: The proposed method shows
+superior performance in disease classification on the NIH ChestX-ray14 dataset
+compared to existing state-of-the-art methods while reaching an area under the
+ROC curve (%AUC) of 84.67. Regarding disease localization, the anatomy prior
+attention method shows competitive performance compared to state-of-the-art
+methods, achieving an accuracy of 0.80, 0.63, 0.49, 0.33, 0.28, 0.21, and 0.04
+with an Intersection over Union (IoU) threshold of 0.1, 0.2, 0.3, 0.4, 0.5,
+0.6, and 0.7, respectively.
+
+
+
+ comment: Accepted to IEEE ACCESS
+
+
+
+
+
+
+ ♻ ☆ Estimating Generic 3D Room Structures from 2D Annotations NeurIPS 2023
+
+
+
+
+
+
+
+
+ Denys Rozumnyi, Stefan Popov, Kevis-Kokitsi Maninis, Matthias Nießner, Vittorio Ferrari
+
+
+ Indoor rooms are among the most common use cases in 3D scene understanding.
+Current state-of-the-art methods for this task are driven by large annotated
+datasets. Room layouts are especially important, consisting of structural
+elements in 3D, such as wall, floor, and ceiling. However, they are difficult
+to annotate, especially on pure RGB video. We propose a novel method to produce
+generic 3D room layouts just from 2D segmentation masks, which are easy to
+annotate for humans. Based on these 2D annotations, we automatically
+reconstruct 3D plane equations for the structural elements and their spatial
+extent in the scene, and connect adjacent elements at the appropriate contact
+edges. We annotate and publicly release 2246 3D room layouts on the
+RealEstate10k dataset, containing YouTube videos. We demonstrate the high
+quality of these 3D layouts annotations with extensive experiments.
+
+
+
+ comment: https://github.com/google-research/cad-estate Accepted at 37th
+ Conference on Neural Information Processing Systems (NeurIPS 2023) Track on
+ Datasets and Benchmarks
+
+
+
+
+
+
+
+ Luan Wei, Anna Hilsmann, Peter Eisert
+
+
+ Piece-wise planar 3D reconstruction simultaneously segments plane instances
+and recovers their 3D plane parameters from an image, which is particularly
+useful for indoor or man-made environments. Efficient reconstruction of 3D
+planes coupled with semantic predictions offers advantages for a wide range of
+applications requiring scene understanding and concurrent spatial mapping.
+However, most existing planar reconstruction models either neglect semantic
+predictions or do not run efficiently enough for real-time applications. We
+introduce SOLOPlanes, a real-time planar reconstruction model based on a
+modified instance segmentation architecture which simultaneously predicts
+semantics for each plane instance, along with plane parameters and piece-wise
+plane instance masks. We achieve an improvement in instance mask segmentation
+by including multi-view guidance for plane predictions in the training process.
+This cross-task improvement, training for plane prediction but improving the
+mask segmentation, is due to the nature of feature sharing in multi-task
+learning. Our model simultaneously predicts semantics using single images at
+inference time, while achieving real-time predictions at 43 FPS.
+
+
+
+ comment: For code, see https://github.com/fraunhoferhhi/SOLOPlanes
+
+
+
+
+
+
+ ♻ ☆ Invariant Learning via Probability of Sufficient and Necessary Causes
+
+
+
+
+
+
+
+
+ Mengyue Yang, Zhen Fang, Yonggang Zhang, Yali Du, Furui Liu, Jean-Francois Ton, Jianhong Wang, Jun Wang
+
+
+ Out-of-distribution (OOD) generalization is indispensable for learning models
+in the wild, where testing distribution typically unknown and different from
+the training. Recent methods derived from causality have shown great potential
+in achieving OOD generalization. However, existing methods mainly focus on the
+invariance property of causes, while largely overlooking the property of
+\textit{sufficiency} and \textit{necessity} conditions. Namely, a necessary but
+insufficient cause (feature) is invariant to distribution shift, yet it may not
+have required accuracy. By contrast, a sufficient yet unnecessary cause
+(feature) tends to fit specific data well but may have a risk of adapting to a
+new domain. To capture the information of sufficient and necessary causes, we
+employ a classical concept, the probability of sufficiency and necessary causes
+(PNS), which indicates the probability of whether one is the necessary and
+sufficient cause. To associate PNS with OOD generalization, we propose PNS risk
+and formulate an algorithm to learn representation with a high PNS value. We
+theoretically analyze and prove the generalizability of the PNS risk.
+Experiments on both synthetic and real-world benchmarks demonstrate the
+effectiveness of the proposed method. The details of the implementation can be
+found at the GitHub repository: https://github.com/ymy4323460/CaSN.
+
+
+
+
+
+
+
+ ♻ ☆ Fair GANs through model rebalancing for extremely imbalanced class
+ distributions
+
+
+ Deep generative models require large amounts of training data. This often
+poses a problem as the collection of datasets can be expensive and difficult,
+in particular datasets that are representative of the appropriate underlying
+distribution (e.g. demographic). This introduces biases in datasets which are
+further propagated in the models. We present an approach to construct an
+unbiased generative adversarial network (GAN) from an existing biased GAN by
+rebalancing the model distribution. We do so by generating balanced data from
+an existing imbalanced deep generative model using an evolutionary algorithm
+and then using this data to train a balanced generative model. Additionally, we
+propose a bias mitigation loss function that minimizes the deviation of the
+learned class distribution from being equiprobable. We show results for the
+StyleGAN2 models while training on the Flickr Faces High Quality (FFHQ) dataset
+for racial fairness and see that the proposed approach improves on the fairness
+metric by almost 5 times, whilst maintaining image quality. We further validate
+our approach by applying it to an imbalanced CIFAR10 dataset where we show that
+we can obtain comparable fairness and image quality as when training on a
+balanced CIFAR10 dataset which is also twice as large. Lastly, we argue that
+the traditionally used image quality metrics such as Frechet inception distance
+(FID) are unsuitable for scenarios where the class distributions are imbalanced
+and a balanced reference set is not available.
+
+
+
+
+
+
+
+ ♻ ☆ Limitations of Face Image Generation AAAI
+
+
+
+
+
+
+
+
+ Harrison Rosenberg, Shimaa Ahmed, Guruprasad V Ramesh, Ramya Korlakai Vinayak, Kassem Fawaz
+
+
+ Text-to-image diffusion models have achieved widespread popularity due to
+their unprecedented image generation capability. In particular, their ability
+to synthesize and modify human faces has spurred research into using generated
+face images in both training data augmentation and model performance
+assessments. In this paper, we study the efficacy and shortcomings of
+generative models in the context of face generation. Utilizing a combination of
+qualitative and quantitative measures, including embedding-based metrics and
+user studies, we present a framework to audit the characteristics of generated
+faces conditioned on a set of social attributes. We applied our framework on
+faces generated through state-of-the-art text-to-image diffusion models. We
+identify several limitations of face image generation that include faithfulness
+to the text prompt, demographic disparities, and distributional shifts.
+Furthermore, we present an analytical model that provides insights into how
+training data selection contributes to the performance of generative models.
+
+
+
+ comment: Accepted to The 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI 2024)
+
+ Denoising Diffusion models have exhibited remarkable capabilities in image
+generation. However, generating high-quality samples requires a large number of
+iterations. Knowledge distillation for diffusion models is an effective method
+to address this limitation with a shortened sampling process but causes
+degraded generative quality. Based on our analysis with bias-variance
+decomposition and experimental observations, we attribute the degradation to
+the spatial fitting error occurring in the training of both the teacher and
+student model. Accordingly, we propose $\textbf{S}$patial
+$\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction
+$\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention
+guidance from the teacher model and a designed semantic gradient predictor to
+reduce the student's fitting error. Empirically, our proposed model facilitates
+high-quality sample generation in a few function evaluations. We achieve an FID
+of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step,
+outperforming existing diffusion methods. Our study provides a new perspective
+on diffusion distillation by highlighting the intrinsic denoising ability of
+models. Project link: \url{https://github.com/Sainzerjj/SFERD}.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Towards domain-invariant Self-Supervised Learning with Batch Styles
+ Standardization
+
+
+ In Self-Supervised Learning (SSL), models are typically pretrained,
+fine-tuned, and evaluated on the same domains. However, they tend to perform
+poorly when evaluated on unseen domains, a challenge that Unsupervised Domain
+Generalization (UDG) seeks to address. Current UDG methods rely on domain
+labels, which are often challenging to collect, and domain-specific
+architectures that lack scalability when confronted with numerous domains,
+making the current methodology impractical and rigid. Inspired by
+contrastive-based UDG methods that mitigate spurious correlations by
+restricting comparisons to examples from the same domain, we hypothesize that
+eliminating style variability within a batch could provide a more convenient
+and flexible way to reduce spurious correlations without requiring domain
+labels. To verify this hypothesis, we introduce Batch Styles Standardization
+(BSS), a relatively simple yet powerful Fourier-based method to standardize the
+style of images in a batch specifically designed for integration with SSL
+methods to tackle UDG. Combining BSS with existing SSL methods offers serious
+advantages over prior UDG methods: (1) It eliminates the need for domain labels
+or domain-specific network components to enhance domain-invariance in SSL
+representations, and (2) offers flexibility as BSS can be seamlessly integrated
+with diverse contrastive-based but also non-contrastive-based SSL methods.
+Experiments on several UDG datasets demonstrate that it significantly improves
+downstream task performances on unseen domains, often outperforming or rivaling
+with UDG methods. Finally, this work clarifies the underlying mechanisms
+contributing to BSS's effectiveness in improving domain-invariance in SSL
+representations and performances on unseen domain.
+
+
+
+
+
+
+
+
+ Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, Soon-heung Jung
+
+
+ Traditional multi-view stereo (MVS) methods rely heavily on photometric and
+geometric consistency constraints, but newer machine learning-based MVS methods
+check geometric consistency across multiple source views only as a
+post-processing step. In this paper, we present a novel approach that
+explicitly encourages geometric consistency of reference view depth maps across
+multiple source views at different scales during learning (see Fig. 1). We find
+that adding this geometric consistency loss significantly accelerates learning
+by explicitly penalizing geometrically inconsistent pixels, reducing the
+training iteration requirements to nearly half that of other MVS methods. Our
+extensive experiments show that our approach achieves a new state-of-the-art on
+the DTU and BlendedMVS datasets, and competitive results on the Tanks and
+Temples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt
+to enforce multi-view, multi-scale geometric consistency during learning.
+
+
+
+ comment: Accepted in WACV 2024 Link:
+ https://openaccess.thecvf.com/content/WACV2024/html/Vats_GC-MVSNet_Multi-View_Multi-Scale_Geometrically-Consistent_Multi-View_Stereo_WACV_2024_paper.html
+
+
+
+
+
+
+
+ Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
+
+
+ Due to the high price and heavy energy consumption of GPUs, deploying deep
+models on IoT devices such as microcontrollers makes significant contributions
+for ecological AI. Conventional methods successfully enable convolutional
+neural network inference of high resolution images on microcontrollers, while
+the framework for vision transformers that achieve the state-of-the-art
+performance in many vision applications still remains unexplored. In this
+paper, we propose a hardware-algorithm co-optimizations method called MCUFormer
+to deploy vision transformers on microcontrollers with extremely limited
+memory, where we jointly design transformer architecture and construct the
+inference operator library to fit the memory resource constraint. More
+specifically, we generalize the one-shot network architecture search (NAS) to
+discover the optimal architecture with highest task performance given the
+memory budget from the microcontrollers, where we enlarge the existing search
+space of vision transformers by considering the low-rank decomposition
+dimensions and patch resolution for memory reduction. For the construction of
+the inference operator library of vision transformers, we schedule the memory
+buffer during inference through operator integration, patch embedding
+decomposition, and token overwriting, allowing the memory buffer to be fully
+utilized to adapt to the forward pass of the vision transformer. Experimental
+results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on
+ImageNet for image classification with 320KB memory on STM32F746
+microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.
+
+
+
+ comment: Accepted by NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Repaint123: Fast and High-quality One Image to 3D Generation with
+ Progressive Controllable 2D Repainting
+
+
+ Recent one image to 3D generation methods commonly adopt Score Distillation
+Sampling (SDS). Despite the impressive results, there are multiple deficiencies
+including multi-view inconsistency, over-saturated and over-smoothed textures,
+as well as the slow generation speed. To address these deficiencies, we present
+Repaint123 to alleviate multi-view bias as well as texture degradation and
+speed up the generation process. The core idea is to combine the powerful image
+generation capability of the 2D diffusion model and the texture alignment
+ability of the repainting strategy for generating high-quality multi-view
+images with consistency. We further propose visibility-aware adaptive
+repainting strength for overlap regions to enhance the generated image quality
+in the repainting process. The generated high-quality and multi-view consistent
+images enable the use of simple Mean Square Error (MSE) loss for fast 3D
+content generation. We conduct extensive experiments and show that our method
+has a superior ability to generate high-quality 3D content with multi-view
+consistency and fine textures in 2 minutes from scratch. Our webpage is
+available at https://junwuzhang19.github.io/repaint123/.
+
+
+ The past decade has witnessed the rapid development of ML and DL
+methodologies in agricultural systems, showcased by great successes in variety
+of agricultural applications. However, these conventional ML/DL models have
+certain limitations: They heavily rely on large, costly-to-acquire labeled
+datasets for training, require specialized expertise for development and
+maintenance, and are mostly tailored for specific tasks, thus lacking
+generalizability. Recently, foundation models have demonstrated remarkable
+successes in language and vision tasks across various domains. These models are
+trained on a vast amount of data from multiple domains and modalities. Once
+trained, they can accomplish versatile tasks with just minor fine-tuning and
+minimal task-specific labeled data. Despite their proven effectiveness and huge
+potential, there has been little exploration of applying FMs to agriculture
+fields. Therefore, this study aims to explore the potential of FMs in the field
+of smart agriculture. In particular, we present conceptual tools and technical
+background to facilitate the understanding of the problem space and uncover new
+research directions in this field. To this end, we first review recent FMs in
+the general computer science domain and categorize them into four categories:
+language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs.
+Subsequently, we outline the process of developing agriculture FMs and discuss
+their potential applications in smart agriculture. We also discuss the unique
+challenges associated with developing AFMs, including model training,
+validation, and deployment. Through this study, we contribute to the
+advancement of AI in agriculture by introducing AFMs as a promising paradigm
+that can significantly mitigate the reliance on extensive labeled datasets and
+enhance the efficiency, effectiveness, and generalization of agricultural AI
+systems.
+
+
+
+ comment: 16 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ Classification of Single Tree Decay Stages from Combined Airborne LiDAR
+ Data and CIR Imagery
+
+
+ Understanding forest health is of great importance for the conservation of
+the integrity of forest ecosystems. In this regard, evaluating the amount and
+quality of dead wood is of utmost interest as they are favorable indicators of
+biodiversity. Apparently, remote sensing-based machine learning techniques have
+proven to be more efficient and sustainable with unprecedented accuracy in
+forest inventory. This study, for the first time, automatically categorizing
+individual coniferous trees (Norway spruce) into five decay stages (live,
+declining, dead, loose bark, and clean) from combined airborne laser scanning
+(ALS) point clouds and color infrared (CIR) images using three different
+Machine Learning methods - 3D point cloud-based deep learning (KPConv),
+Convolutional Neural Network (CNN), and Random Forest (RF). First, CIR
+colorized point clouds are created by fusing the ALS point clouds and color
+infrared images. Then, individual tree segmentation is conducted, after which
+the results are further projected onto four orthogonal planes. Finally, the
+classification is conducted on the two datasets (3D multispectral point clouds
+and 2D projected images) based on the three Machine Learning algorithms. All
+models achieved promising results, reaching overall accuracy (OA) of up to
+88.8%, 88.4% and 85.9% for KPConv, CNN and RF, respectively. The experimental
+results reveal that color information, 3D coordinates, and intensity of point
+clouds have significant impact on the promising classification performance. The
+performance of our models, therefore, shows the significance of machine/deep
+learning for individual tree decay stages classification and landscape-wide
+assessment of the dead wood amount and quality by using modern airborne remote
+sensing techniques. The proposed method can contribute as an important and
+reliable tool for monitoring biodiversity in forest ecosystems.
+
+
+
+
+
+
+
+ ♻ ☆ A Survey of Reasoning with Foundation Models: Concepts, Methodologies,
+ and Outlook
+
+
+ Reasoning, a crucial ability for complex problem-solving, plays a pivotal
+role in various real-world settings such as negotiation, medical diagnosis, and
+criminal investigation. It serves as a fundamental methodology in the field of
+Artificial General Intelligence (AGI). With the ongoing development of
+foundation models, there is a growing interest in exploring their abilities in
+reasoning tasks. In this paper, we introduce seminal foundation models proposed
+or adaptable for reasoning, highlighting the latest advancements in various
+reasoning tasks, methods, and benchmarks. We then delve into the potential
+future directions behind the emergence of reasoning abilities within foundation
+models. We also discuss the relevance of multimodal learning, autonomous
+agents, and super alignment in the context of reasoning. By discussing these
+future research directions, we hope to inspire researchers in their exploration
+of this field, stimulate further advancements in reasoning with foundation
+models, and contribute to the development of AGI.
+
+
+
+
+
+
+
+
+ Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn
+
+
+ In this study, we aim to extend the capabilities of diffusion-based
+text-to-image (T2I) generation models by incorporating diverse modalities
+beyond textual description, such as sketch, box, color palette, and style
+embedding, within a single model. We thus design a multimodal T2I diffusion
+model, coined as DiffBlender, by separating the channels of conditions into
+three types, i.e., image forms, spatial tokens, and non-spatial tokens. The
+unique architecture of DiffBlender facilitates adding new input modalities,
+pioneering a scalable framework for conditional image generation. Notably, we
+achieve this without altering the parameters of the existing generative model,
+Stable Diffusion, only with updating partial components. Our study establishes
+new benchmarks in multimodal generation through quantitative and qualitative
+comparisons with existing conditional generation methods. We demonstrate that
+DiffBlender faithfully blends all the provided information and showcase its
+various applications in the detailed image synthesis.
+
+
+ Text-to-image generation has recently witnessed remarkable achievements. We
+introduce a text-conditional image diffusion model, termed RAPHAEL, to generate
+highly artistic images, which accurately portray the text prompts, encompassing
+multiple nouns, adjectives, and verbs. This is achieved by stacking tens of
+mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling
+billions of diffusion paths (routes) from the network input to the output. Each
+path intuitively functions as a "painter" for depicting a particular textual
+concept onto a specified image region at a diffusion timestep. Comprehensive
+experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as
+Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both
+image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior
+performance in switching images across diverse styles, such as Japanese comics,
+realism, cyberpunk, and ink illustration. Secondly, a single model with three
+billion parameters, trained on 1,000 A100 GPUs for two months, achieves a
+state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore,
+RAPHAEL significantly surpasses its counterparts in human evaluation on the
+ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the
+frontiers of image generation research in both academia and industry, paving
+the way for future breakthroughs in this rapidly evolving field. More details
+can be found on a webpage: https://raphael-painter.github.io/.
+
+
+
+ comment: NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Even Small Correlation and Diversity Shifts Pose Dataset-Bias Issues
+
+
+ Distribution shifts are common in real-world datasets and can affect the
+performance and reliability of deep learning models. In this paper, we study
+two types of distribution shifts: diversity shifts, which occur when test
+samples exhibit patterns unseen during training, and correlation shifts, which
+occur when test data present a different correlation between seen invariant and
+spurious features. We propose an integrated protocol to analyze both types of
+shifts using datasets where they co-exist in a controllable manner. Finally, we
+apply our approach to a real-world classification problem of skin cancer
+analysis, using out-of-distribution datasets and specialized bias annotations.
+Our protocol reveals three findings: 1) Models learn and propagate correlation
+shifts even with low-bias training; this poses a risk of accumulating and
+combining unaccountable weak biases; 2) Models learn robust features in high-
+and low-bias scenarios but use spurious ones if test samples have them; this
+suggests that spurious correlations do not impair the learning of robust
+features; 3) Diversity shift can reduce the reliance on spurious correlations;
+this is counter intuitive since we expect biased models to depend more on
+biases when invariant features are missing. Our work has implications for
+distribution shift research and practice, providing new insights into how
+models learn and rely on spurious correlations under different types of shifts.
+
+
+
+ comment: Paper under consideration at Pattern Recognition Letters
+
+
+
+
+
+
+
+ Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, Zhiyong Wang
+
+
+ The Segment Anything Model (SAM) is a powerful foundation model that has
+revolutionised image segmentation. To apply SAM to surgical instrument
+segmentation, a common approach is to locate precise points or boxes of
+instruments and then use them as prompts for SAM in a zero-shot manner.
+However, we observe two problems with this naive pipeline: (1) the domain gap
+between natural objects and surgical instruments leads to inferior
+generalisation of SAM; and (2) SAM relies on precise point or box locations for
+accurate segmentation, requiring either extensive manual guidance or a
+well-performing specialist detector for prompt preparation, which leads to a
+complex multi-stage pipeline. To address these problems, we introduce
+SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to
+effectively integrate surgical-specific information with SAM's pre-trained
+knowledge for improved generalisation. Specifically, we propose a lightweight
+prototype-based class prompt encoder for tuning, which directly generates
+prompt embeddings from class prototypes and eliminates the use of explicit
+prompts for improved robustness and a simpler pipeline. In addition, to address
+the low inter-class variance among surgical instrument categories, we propose
+contrastive prototype learning, further enhancing the discrimination of the
+class prototypes for more accurate class prompting. The results of extensive
+experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that
+SurgicalSAM achieves state-of-the-art performance while only requiring a small
+number of tunable parameters. The source code is available at
+https://github.com/wenxi-yue/SurgicalSAM.
+
+
+
+ comment: AAAI2024. The source code is available at
+ https://github.com/wenxi-yue/SurgicalSAM
+
+
+
+
+
+
+ ♻ ☆ Unleashing the Potential of Adjacent Snippets for Weakly-supervised
+ Temporal Action Localization ICME2023
+
+
+ Weakly-supervised temporal action localization (WTAL) intends to detect
+action instances with only weak supervision, \eg, video-level labels. The
+current~\textit{de facto} pipeline locates action instances by thresholding and
+grouping continuous high-score regions on temporal class activation sequences.
+In this route, the capacity of the model to recognize the relationships between
+adjacent snippets is of vital importance which determines the quality of the
+action boundaries. However, it is error-prone since the variations between
+adjacent snippets are typically subtle, and unfortunately this is overlooked in
+the literature. To tackle the issue, we propose a novel WTAL approach named
+Convex Combination Consistency between Neighbors (C$^3$BN). C$^3$BN consists of
+two key ingredients: a micro data augmentation strategy that increases the
+diversity in-between adjacent snippets by convex combination of adjacent
+snippets, and a macro-micro consistency regularization that enforces the model
+to be invariant to the transformations~\textit{w.r.t.} video semantics, snippet
+predictions, and snippet representations. Consequently, fine-grained patterns
+in-between adjacent snippets are enforced to be explored, thereby resulting in
+a more robust action boundary localization. Experimental results demonstrate
+the effectiveness of C$^3$BN on top of various baselines for WTAL with
+video-level and point-level supervisions. Code is at
+https://github.com/Qinying-Liu/C3BN.
+
+
+
+ comment: ICME2023
+
+
+
+
+
+
+ ♻ ☆ MARS: Mask Attention Refinement with Sequential Quadtree Nodes for Car
+ Damage Instance Segmentation
+
+
+ Evaluating car damages from misfortune is critical to the car insurance
+industry. However, the accuracy is still insufficient for real-world
+applications since the deep learning network is not designed for car damage
+images as inputs, and its segmented masks are still very coarse. This paper
+presents MARS (Mask Attention Refinement with Sequential quadtree nodes) for
+car damage instance segmentation. Our MARS represents self-attention mechanisms
+to draw global dependencies between the sequential quadtree nodes layer and
+quadtree transformer to recalibrate channel weights and predict highly accurate
+instance masks. Our extensive experiments demonstrate that MARS outperforms
+state-of-the-art (SOTA) instance segmentation methods on three popular
+benchmarks such as Mask R-CNN [9], PointRend [13], and Mask Transfiner [12], by
+a large margin of +1.3 maskAP-based R50-FPN backbone and +2.3 maskAP-based
+R101-FPN backbone on Thai car-damage dataset. Our demos are available at
+https://github.com/kaopanboonyuen/MARS.
+
+
+
+ comment: 12 pages. arXiv admin note: substantial text overlap with
+ arXiv:2111.13673 by other authors
+
+
+
+
+
+
+ ♻ ☆ 3D Shape Knowledge Graph for Cross-domain 3D Shape Retrieval
+
+
+ The surge in 3D modeling has led to a pronounced research emphasis on the
+field of 3D shape retrieval. Numerous contemporary approaches have been put
+forth to tackle this intricate challenge. Nevertheless, effectively addressing
+the intricacies of cross-modal 3D shape retrieval remains a formidable
+undertaking, owing to inherent modality-based disparities. This study presents
+an innovative notion, termed "geometric words", which functions as elemental
+constituents for representing entities through combinations. To establish the
+knowledge graph, we employ geometric words as nodes, connecting them via shape
+categories and geometry attributes. Subsequently, we devise a unique graph
+embedding method for knowledge acquisition. Finally, an effective similarity
+measure is introduced for retrieval purposes. Importantly, each 3D or 2D entity
+can anchor its geometric terms within the knowledge graph, thereby serving as a
+link between cross-domain data. As a result, our approach facilitates multiple
+cross-domain 3D shape retrieval tasks. We evaluate the proposed method's
+performance on the ModelNet40 and ShapeNetCore55 datasets, encompassing
+scenarios related to 3D shape retrieval and cross-domain retrieval.
+Furthermore, we employ the established cross-modal dataset (MI3DOR) to assess
+cross-modal 3D shape retrieval. The resulting experimental outcomes, in
+conjunction with comparisons against state-of-the-art techniques, clearly
+highlight the superiority of our approach.
+
+
+
+
+
+
+
+ ♻ ☆ Sustainable Transparency in Recommender Systems: Bayesian Ranking of
+ Images for Explainability
+
+
+ Recommender Systems have become crucial in the modern world, commonly guiding
+users towards relevant content or products, and having a large influence over
+the decisions of users and citizens. However, ensuring transparency and user
+trust in these systems remains a challenge; personalized explanations have
+emerged as a solution, offering justifications for recommendations. Among the
+existing approaches for generating personalized explanations, using existing
+visual content created by users is a promising option to maximize transparency
+and user trust. State-of-the-art models that follow this approach, despite
+leveraging highly optimized architectures, employ surrogate learning tasks that
+do not efficiently model the objective of ranking images as explanations for a
+given recommendation; this leads to a suboptimal training process with high
+computational costs that may not be reduced without affecting model
+performance. This work presents BRIE, a novel model where we leverage Bayesian
+Pairwise Ranking to enhance the training process, allowing us to consistently
+outperform state-of-the-art models in six real-world datasets while reducing
+its model size by up to 64 times and its CO${_2}$ emissions by up to 75% in
+training and inference.
+
+
+ Due to the scarcity of sampling data in reality, few-shot object detection
+(FSOD) has drawn more and more attention because of its ability to quickly
+train new detection concepts with less data. However, there are still failure
+identifications due to the difficulty in distinguishing confusable classes. We
+also notice that the high standard deviation of average precision reveals the
+inconsistent detection performance. To this end, we propose a novel FSOD method
+with Refined Contrastive Learning (FSRC). A pre-determination component is
+introduced to find out the Resemblance Group from novel classes which contains
+confusable classes. Afterwards, Refined Contrastive Learning (RCL) is pointedly
+performed on this group of classes in order to increase the inter-class
+distances among them. In the meantime, the detection results distribute more
+uniformly which further improve the performance. Experimental results based on
+PASCAL VOC and COCO datasets demonstrate our proposed method outperforms the
+current state-of-the-art research.
+
+
+
+
+
+
+
+ ♻ ☆ OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic
+ Segmentation
+
+
+ Light field cameras, by harnessing the power of micro-lens array, are capable
+of capturing intricate angular and spatial details. This allows for acquiring
+complex light patterns and details from multiple angles, significantly
+enhancing the precision of image semantic segmentation, a critical aspect of
+scene interpretation in vision intelligence. However, the extensive angular
+information of light field cameras contains a large amount of redundant data,
+which is overwhelming for the limited hardware resources of intelligent
+vehicles. Besides, inappropriate compression leads to information corruption
+and data loss. To excavate representative information, we propose a new
+paradigm, Omni-Aperture Fusion model (OAFuser), which leverages dense context
+from the central view and discovers the angular information from sub-aperture
+images to generate a semantically consistent result. To avoid feature loss
+during network propagation and simultaneously streamline the redundant
+information from the light field camera, we present a simple yet very effective
+Sub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular
+features without any additional memory cost. Furthermore, to address the
+mismatched spatial information across viewpoints, we present a Center Angular
+Rectification Module (CARM) to realize feature resorting and prevent feature
+occlusion caused by asymmetric information. Our proposed OAFuser achieves
+state-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a
+new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain
+of +4.53%. The source code of OAFuser will be available at
+https://github.com/FeiBryantkit/OAFuser.
+
+
+
+ comment: The source code of OAFuser will be made publicly available at
+ https://github.com/FeiBryantkit/OAFuser
+
+
+
+
+
+
+ ♻ ☆ ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for
+ Zero-Shot Learning
+
+
+ This paper provides a novel parsimonious yet efficient design for zero-shot
+learning (ZSL), dubbed ParsNets, where we are interested in learning a
+composition of on-device friendly linear networks, each with orthogonality and
+low-rankness properties, to achieve equivalent or even better performance
+against existing deep models. Concretely, we first refactor the core module of
+ZSL, i.e., visual-semantics mapping function, into several base linear networks
+that correspond to diverse components of the semantic space, where the complex
+nonlinearity can be collapsed into simple local linearities. Then, to
+facilitate the generalization of local linearities, we construct a maximal
+margin geometry on the learned features by enforcing low-rank constraints on
+intra-class samples and high-rank constraints on inter-class samples, resulting
+in orthogonal subspaces for different classes and each subspace lies on a
+compact manifold. To enhance the model's adaptability and counterbalance
+over/under-fittings in ZSL, a set of sample-wise indicators is employed to
+select a sparse subset from these base linear networks to form a composite
+semantic predictor for each sample. Notably, maximal margin geometry can
+guarantee the diversity of features, and meanwhile, local linearities guarantee
+efficiency. Thus, our ParsNets can generalize better to unseen classes and can
+be deployed flexibly on resource-constrained devices. Theoretical explanations
+and extensive experiments are conducted to verify the effectiveness of the
+proposed method.
+
+
+
+ comment: 10 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ RealCraft: Attention Control as A Solution for Zero-shot Long Video
+ Editing
+
+
+ Although large-scale text-to-image generative models have shown promising
+performance in synthesizing high-quality images, directly applying these models
+to image editing remains a significant challenge. This challenge is further
+amplified in video editing due to the additional dimension of time. Especially
+for editing real videos as it necessitates maintaining a stable semantic layout
+across the frames while executing localized edits precisely without disrupting
+the existing backgrounds. In this paper, we propose RealCraft, an
+attention-control-based method for zero-shot editing in real videos. By
+employing the object-centric manipulation of cross-attention between prompts
+and frames and spatial-temporal attention within the frames, we achieve precise
+shape-wise editing along with enhanced consistency. Our model can be used
+directly with Stable Diffusion and operates without the need for additional
+localized information. We showcase our zero-shot attention-control-based method
+across a range of videos, demonstrating localized, high-fidelity, shape-precise
+and time-consistent editing in videos of various lengths, up to 64 frames.
+
+
+
+
+
+
+
+ ♻ ☆ Comparison of two data fusion approaches for land use classification
+
+
+
+
+
+
+
+
+ Martin Cubaud, Arnaud Le Bris, Laurence Jolivet, Ana-Maria Olteanu-Raimond
+
+
+ Accurate land use maps, describing the territory from an anthropic
+utilisation point of view, are useful tools for land management and planning.
+To produce them, the use of optical images alone remains limited. It is
+therefore necessary to make use of several heterogeneous sources, each carrying
+complementary or contradictory information due to their imperfections or their
+different specifications. This study compares two different approaches i.e. a
+pre-classification and a post-classification fusion approach for combining
+several sources of spatial data in the context of land use classification. The
+approaches are applied on authoritative land use data located in the Gers
+department in the southwest of France. Pre-classification fusion, while not
+explicitly modeling imperfections, has the best final results, reaching an
+overall accuracy of 97% and a macro-mean F1 score of 88%.
+
+
+
+
+
+
+
+ ♻ ☆ Improving Gradient-Trend Identification: Fast-Adaptive Moment Estimation
+ with Finance-Inspired Triple Exponential Moving Average
+
+
+ The performance improvement of deep networks significantly depends on their
+optimizers. With existing optimizers, precise and efficient recognition of the
+gradients trend remains a challenge. Existing optimizers predominantly adopt
+techniques based on the first-order exponential moving average (EMA), which
+results in noticeable delays that impede the real-time tracking of gradients
+trend and consequently yield sub-optimal performance. To overcome this
+limitation, we introduce a novel optimizer called fast-adaptive moment
+estimation (FAME). Inspired by the triple exponential moving average (TEMA)
+used in the financial domain, FAME leverages the potency of higher-order TEMA
+to improve the precision of identifying gradient trends. TEMA plays a central
+role in the learning process as it actively influences optimization dynamics;
+this role differs from its conventional passive role as a technical indicator
+in financial contexts. Because of the introduction of TEMA into the
+optimization process, FAME can identify gradient trends with higher accuracy
+and fewer lag issues, thereby offering smoother and more consistent responses
+to gradient fluctuations compared to conventional first-order EMA. To study the
+effectiveness of our novel FAME optimizer, we conducted comprehensive
+experiments encompassing six diverse computer-vision benchmarks and tasks,
+spanning detection, classification, and semantic comprehension. We integrated
+FAME into 15 learning architectures and compared its performance with those of
+six popular optimizers. Results clearly showed that FAME is more robust and
+accurate and provides superior performance stability by minimizing noise (i.e.,
+trend fluctuations). Notably, FAME achieves higher accuracy levels in
+remarkably fewer training epochs than its counterparts, clearly indicating its
+significance for optimizing deep networks in computer-vision tasks.
+
+
+
+
+
+
+
+ ♻ ☆ When SAM Meets Medical Images: An Investigation of Segment Anything
+ Model (SAM) on Multi-phase Liver Tumor Segmentation
+
+
+ Learning to segmentation without large-scale samples is an inherent
+capability of human. Recently, Segment Anything Model (SAM) performs the
+significant zero-shot image segmentation, attracting considerable attention
+from the computer vision community. Here, we investigate the capability of SAM
+for medical image analysis, especially for multi-phase liver tumor segmentation
+(MPLiTS), in terms of prompts, data resolution, phases. Experimental results
+demonstrate that there might be a large gap between SAM and expected
+performance. Fortunately, the qualitative results show that SAM is a powerful
+annotation tool for the community of interactive medical image segmentation.
+
+
+
+ comment: Preliminary investigation
+
+
+
+
+
+
+ ♻ ☆ Hybrid Internal Model: A Simple and Efficient Learner for Agile Legged
+ Locomotion
+
+
+
+
+
+
+
+
+ Junfeng Long, Zirui Wang, Quanyi Li, Jiawei Gao, Liu Cao, Jiangmiao Pang
+
+
+ Robust locomotion control depends on accurate state estimations. However, the
+sensors of most legged robots can only provide partial and noisy observations,
+making the estimation particularly challenging, especially for external states
+like terrain frictions and elevation maps. Inspired by the classical Internal
+Model Control principle, we consider these external states as disturbances and
+introduce Hybrid Internal Model (HIM) to estimate them according to the
+response of the robot. The response, which we refer to as the hybrid internal
+embedding, contains the robot's explicit velocity and implicit stability
+representation, corresponding to two primary goals for locomotion tasks:
+explicitly tracking velocity and implicitly maintaining stability. We use
+contrastive learning to optimize the embedding to be close to the robot's
+successor state, in which the response is naturally embedded. HIM has several
+appealing benefits: It only needs the robot's proprioceptions, i.e., those from
+joint encoders and IMU as observations. It innovatively maintains consistent
+observations between simulation reference and reality that avoids information
+loss in mimicking learning. It exploits batch-level information that is more
+robust to noises and keeps better sample efficiency. It only requires 1 hour of
+training on an RTX 4090 to enable a quadruped robot to traverse any terrain
+under any disturbances. A wealth of real-world experiments demonstrates its
+agility, even in high-difficulty tasks and cases never occurred during the
+training process, revealing remarkable open-world generalizability.
+
+
+
+ comment: Use 1 hour to train a quadruped robot capable of traversing any
+ terrain under any disturbances in the open world, Project Page:
+ https://github.com/OpenRobotLab/HIMLoco
+
+
+
+
+
+
+ ♻ ☆ Semantic Invariant Multi-view Clustering with Fully Incomplete
+ Information
+
+
+ Robust multi-view learning with incomplete information has received
+significant attention due to issues such as incomplete correspondences and
+incomplete instances that commonly affect real-world multi-view applications.
+Existing approaches heavily rely on paired samples to realign or impute
+defective ones, but such preconditions cannot always be satisfied in practice
+due to the complexity of data collection and transmission. To address this
+problem, we present a novel framework called SeMantic Invariance LEarning
+(SMILE) for multi-view clustering with incomplete information that does not
+require any paired samples. To be specific, we discover the existence of
+invariant semantic distribution across different views, which enables SMILE to
+alleviate the cross-view discrepancy to learn consensus semantics without
+requiring any paired samples. The resulting consensus semantics remain
+unaffected by cross-view distribution shifts, making them useful for
+realigning/imputing defective instances and forming clusters. We demonstrate
+the effectiveness of SMILE through extensive comparison experiments with 13
+state-of-the-art baselines on five benchmarks. Our approach improves the
+clustering accuracy of NoisyMNIST from 19.3\%/23.2\% to 82.7\%/69.0\% when the
+correspondences/instances are fully incomplete. The code could be accessed from
+https://pengxi.me.
+
+
+ Recognition of occluded objects in unseen and unstructured indoor
+environments is a challenging problem for mobile robots. To address this
+challenge, we propose a new descriptor, TOPS, for point clouds generated from
+depth images and an accompanying recognition framework, THOR, inspired by human
+reasoning. The descriptor employs a novel slicing-based approach to compute
+topological features from filtrations of simplicial complexes using persistent
+homology, and facilitates reasoning-based recognition using object unity. Apart
+from a benchmark dataset, we report performance on a new dataset, the UW Indoor
+Scenes (UW-IS) Occluded dataset, curated using commodity hardware to reflect
+real-world scenarios with different environmental conditions and degrees of
+object occlusion. THOR outperforms state-of-the-art methods on both the
+datasets and achieves substantially higher recognition accuracy for all the
+scenarios of the UW-IS Occluded dataset. Therefore, THOR, is a promising step
+toward robust recognition in low-cost robots, meant for everyday use in indoor
+settings.
+
+
+
+ comment: This work has been accepted for publication in the IEEE Transactions
+ on Robotics
+
+
+
+
+
+
+ ♻ ☆ SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment
+ Anything Model AAAI 2024
+
+
+
+
+
+
+
+
+ Shili Zhou, Ruian He, Weimin Tan, Bo Yan
+
+
+ Optical Flow Estimation aims to find the 2D dense motion field between two
+frames. Due to the limitation of model structures and training datasets,
+existing methods often rely too much on local clues and ignore the integrity of
+objects, resulting in fragmented motion estimation. Through theoretical
+analysis, we find the pre-trained large vision models are helpful in optical
+flow estimation, and we notice that the recently famous Segment Anything Model
+(SAM) demonstrates a strong ability to segment complete objects, which is
+suitable for solving the fragmentation problem. We thus propose a solution to
+embed the frozen SAM image encoder into FlowFormer to enhance object
+perception. To address the challenge of in-depth utilizing SAM in
+non-segmentation tasks like optical flow estimation, we propose an Optical Flow
+Task-Specific Adaption scheme, including a Context Fusion Module to fuse the
+SAM encoder with the optical flow context encoder, and a Context Adaption
+Module to adapt the SAM features for optical flow task with Learned
+Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10
+clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set,
+surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model
+achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks,
+ranking #1 among all two-frame methods on Sintel clean pass.
+
+
+ Image captioning models are known to perpetuate and amplify harmful societal
+bias in the training set. In this work, we aim to mitigate such gender bias in
+image captioning models. While prior work has addressed this problem by forcing
+models to focus on people to reduce gender misclassification, it conversely
+generates gender-stereotypical words at the expense of predicting the correct
+gender. From this observation, we hypothesize that there are two types of
+gender bias affecting image captioning models: 1) bias that exploits context to
+predict gender, and 2) bias in the probability of generating certain (often
+stereotypical) words because of gender. To mitigate both types of gender
+biases, we propose a framework, called LIBRA, that learns from synthetically
+biased samples to decrease both types of biases, correcting gender
+misclassification and changing gender-stereotypical words to more neutral ones.
+Code is available at https://github.com/rebnej/LIBRA.
+
+
+
+
+
+
+
+
+ Qi Mao, Lan Chen, Yuchao Gu, Zhen Fang, Mike Zheng Shou
+
+
+ Recent diffusion-based image editing approaches have exhibited impressive
+editing capabilities in images with simple compositions. However, localized
+editing in complex scenarios has not been well-studied in the literature,
+despite its growing real-world demands. Existing mask-based inpainting methods
+fall short of retaining the underlying structure within the edit region.
+Meanwhile, mask-free attention-based methods often exhibit editing leakage and
+misalignment in more complex compositions. In this work, we develop MAG-Edit, a
+training-free, inference-stage optimization method, which enables localized
+image editing in complex scenarios. In particular, MAG-Edit optimizes the noise
+latent feature in diffusion models by maximizing two mask-based cross-attention
+constraints of the edit token, which in turn gradually enhances the local
+alignment with the desired prompt. Extensive quantitative and qualitative
+experiments demonstrate the effectiveness of our method in achieving both text
+alignment and structure preservation for localized editing within complex
+scenarios.
+
+
+
+ comment: for project page, see https://mag-edit.github.io/
+
+
+
+
+
+
+ ♻ ☆ Domain Transfer in Latent Space (DTLS) Wins on Image Super-Resolution --
+ a Non-Denoising Model
+
+
+ Large scale image super-resolution is a challenging computer vision task,
+since vast information is missing in a highly degraded image, say for example
+forscale x16 super-resolution. Diffusion models are used successfully in recent
+years in extreme super-resolution applications, in which Gaussian noise is used
+as a means to form a latent photo-realistic space, and acts as a link between
+the space of latent vectors and the latent photo-realistic space. There are
+quite a few sophisticated mathematical derivations on mapping the statistics of
+Gaussian noises making Diffusion Models successful. In this paper we propose a
+simple approach which gets away from using Gaussian noise but adopts some basic
+structures of diffusion models for efficient image super-resolution.
+Essentially, we propose a DNN to perform domain transfer between neighbor
+domains, which can learn the differences in statistical properties to
+facilitate gradual interpolation with results of reasonable quality. Further
+quality improvement is achieved by conditioning the domain transfer with
+reference to the input LR image. Experimental results show that our method
+outperforms not only state-of-the-art large scale super resolution models, but
+also the current diffusion models for image super-resolution. The approach can
+readily be extended to other image-to-image tasks, such as image enlightening,
+inpainting, denoising, etc.
+
+
+
+
+
+
+
+
+ Yanzuo Lu, Meng Shen, Andy J Ma, Xiaohua Xie, Jian-Huang Lai
+
+
+ Universal domain adaptation (UniDA) is a practical but challenging problem,
+in which information about the relation between the source and the target
+domains is not given for knowledge transfer. Existing UniDA methods may suffer
+from the problems of overlooking intra-domain variations in the target domain
+and difficulty in separating between the similar known and unknown class. To
+address these issues, we propose a novel Mutual Learning Network (MLNet) with
+neighborhood invariance for UniDA. In our method, confidence-guided invariant
+feature learning with self-adaptive neighbor selection is designed to reduce
+the intra-domain variations for more generalizable feature representation. By
+using the cross-domain mixup scheme for better unknown-class identification,
+the proposed method compensates for the misidentified known-class errors by
+mutual learning between the closed-set and open-set classifiers. Extensive
+experiments on three publicly available benchmarks demonstrate that our method
+achieves the best results compared to the state-of-the-arts in most cases and
+significantly outperforms the baseline across all the four settings in UniDA.
+Code is available at https://github.com/YanzuoLu/MLNet.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ♻ ☆ LMC: Large Model Collaboration with Cross-assessment for Training-Free
+ Open-Set Object Recognition NeurIPS 2023
+
+
+
+
+
+
+
+
+ Haoxuan Qu, Xiaofei Hui, Yujun Cai, Jun Liu
+
+
+ Open-set object recognition aims to identify if an object is from a class
+that has been encountered during training or not. To perform open-set object
+recognition accurately, a key challenge is how to reduce the reliance on
+spurious-discriminative features. In this paper, motivated by that different
+large models pre-trained through different paradigms can possess very rich
+while distinct implicit knowledge, we propose a novel framework named Large
+Model Collaboration (LMC) to tackle the above challenge via collaborating
+different off-the-shelf large models in a training-free manner. Moreover, we
+also incorporate the proposed framework with several novel designs to
+effectively extract implicit knowledge from large models. Extensive experiments
+demonstrate the efficacy of our proposed framework. Code is available
+https://github.com/Harryqu123/LMC
+
+
+
+
+
+
+
+
+ Shangchao Su, Mingzhao Yang, Bin Li, Xiangyang Xue
+
+
+ Federated learning (FL) enables multiple clients to collaboratively train a
+global model without disclosing their data. Previous researches often require
+training the complete model parameters. However, the emergence of powerful
+pre-trained models makes it possible to achieve higher performance with fewer
+learnable parameters in FL. In this paper, we propose a federated adaptive
+prompt tuning algorithm, FedAPT, for multi-domain collaborative image
+classification with powerful foundation models, like CLIP. Compared with direct
+federated prompt tuning, our core idea is to adaptively unlock specific domain
+knowledge for each test sample in order to provide them with personalized
+prompts. To implement this idea, we design an adaptive prompt tuning module,
+which consists of a meta prompt, an adaptive network, and some keys. The server
+randomly generates a set of keys and assigns a unique key to each client. Then
+all clients cooperatively train the global adaptive network and meta prompt
+with the local datasets and the frozen keys. Ultimately, the global aggregation
+model can assign a personalized prompt to CLIP based on the domain features of
+each test sample. We perform extensive experiments on two multi-domain image
+classification datasets across two different settings -- supervised and
+unsupervised. The results show that FedAPT can achieve better performance with
+less than 10\% of the number of parameters of the fully trained model, and the
+global model can perform well in diverse client domains simultaneously.
+
+
+
+
+
+
+
+ ♻ ☆ Video-based Surgical Skill Assessment using Tree-based Gaussian Process
+ Classifier
+
+
+
+
+
+
+
+
+ Arefeh Rezaei, Mohammad Javad Ahmadi, Amir Molaei, Hamid. D. Taghirad
+
+
+ This paper aims to present a novel pipeline for automated surgical skill
+assessment using video data and to showcase the effectiveness of the proposed
+approach in evaluating surgeon proficiency, its potential for targeted training
+interventions, and quality assurance in surgical departments. The pipeline
+incorporates a representation flow convolutional neural network and a novel
+tree-based Gaussian process classifier, which is robust to noise, while being
+computationally efficient. Additionally, new kernels are introduced to enhance
+accuracy. The performance of the pipeline is evaluated using the JIGSAWS
+dataset. Comparative analysis with existing literature reveals significant
+improvement in accuracy and betterment in computation cost. The proposed
+pipeline contributes to computational efficiency and accuracy improvement in
+surgical skill assessment using video data. Results of our study based on
+comments of our colleague surgeons show that the proposed method has the
+potential to facilitate skill improvement among surgery fellows and enhance
+patient safety through targeted training interventions and quality assurance in
+surgical departments.
+
+
+
+ comment: 11 pages, 2 figures, journal
+
+
+
+
+
+
+ ♻ ☆ LMDrive: Closed-Loop End-to-End Driving with Large Language Models
+
+
+
+
+
+
+
+
+ Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, Hongsheng Li
+
+
+ Despite significant recent progress in the field of autonomous driving,
+modern methods still struggle and can incur serious accidents when encountering
+long-tail unforeseen events and challenging urban scenarios. On the one hand,
+large language models (LLM) have shown impressive reasoning capabilities that
+approach "Artificial General Intelligence". On the other hand, previous
+autonomous driving methods tend to rely on limited-format inputs (e.g. sensor
+data and navigation waypoints), restricting the vehicle's ability to understand
+language information and interact with humans. To this end, this paper
+introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous
+driving framework. LMDrive uniquely processes and integrates multi-modal sensor
+data with natural language instructions, enabling interaction with humans and
+navigation software in realistic instructional settings. To facilitate further
+research in language-based closed-loop autonomous driving, we also publicly
+release the corresponding dataset which includes approximately 64K
+instruction-following data clips, and the LangAuto benchmark that tests the
+system's ability to handle complex instructions and challenging driving
+scenarios. Extensive closed-loop experiments are conducted to demonstrate
+LMDrive's effectiveness. To the best of our knowledge, we're the very first
+work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes,
+models, and datasets can be found at https://github.com/opendilab/LMDrive
+
+
+
+
+
+
+
+ ♻ ☆ Unleashing Large-Scale Video Generative Pre-training for Visual Robot
+ Manipulation
+
+
+
+
+
+
+
+
+ Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong
+
+
+ Generative pre-trained models have demonstrated remarkable effectiveness in
+language and vision domains by learning useful representations. In this paper,
+we extend the scope of this effectiveness by showing that visual robot
+manipulation can significantly benefit from large-scale video generative
+pre-training. We introduce GR-1, a straightforward GPT-style model designed for
+multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs
+a language instruction, a sequence of observation images, and a sequence of
+robot states. It predicts robot actions as well as future images in an
+end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly
+finetuned on robot data after pre-trained on a large-scale video dataset. We
+perform extensive experiments on the challenging CALVIN benchmark and a real
+robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline
+methods and improves the success rate from 88.9% to 94.9%. In the setting of
+zero-shot unseen scene generalization, GR-1 improves the success rate from
+53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline
+methods and shows strong potentials in generalization to unseen scenes and
+objects. We provide inaugural evidence that a unified GPT-style transformer,
+augmented with large-scale video generative pre-training, exhibits remarkable
+generalization to multi-task visual robot manipulation. Project page:
+https://GR1-Manipulation.github.io
+
+
+ 3D object detection from images, one of the fundamental and challenging
+problems in autonomous driving, has received increasing attention from both
+industry and academia in recent years. Benefiting from the rapid development of
+deep learning technologies, image-based 3D detection has achieved remarkable
+progress. Particularly, more than 200 works have studied this problem from 2015
+to 2021, encompassing a broad spectrum of theories, algorithms, and
+applications. However, to date no recent survey exists to collect and organize
+this knowledge. In this paper, we fill this gap in the literature and provide
+the first comprehensive survey of this novel and continuously growing research
+field, summarizing the most commonly used pipelines for image-based 3D
+detection and deeply analyzing each of their components. Additionally, we also
+propose two new taxonomies to organize the state-of-the-art methods into
+different categories, with the intent of providing a more systematic review of
+existing methods and facilitating fair comparisons with future works. In
+retrospect of what has been achieved so far, we also analyze the current
+challenges in the field and discuss future directions for image-based 3D
+detection research.
+
+
+
+ comment: Accepted by T-PAMI
+
+
+
+
+
+
+ ♻ ☆ Dynamic Feature Pruning and Consolidation for Occluded Person
+ Re-Identification AAAI-24
+
+
+
+
+
+
+
+
+ YuTeng Ye, Hang Zhou, Jiale Cai, Chenxing Gao, Youjia Zhang, Junle Wang, Qiang Hu, Junqing Yu, Wei Yang
+
+
+ Occluded person re-identification (ReID) is a challenging problem due to
+contamination from occluders. Existing approaches address the issue with prior
+knowledge cues, such as human body key points and semantic segmentations, which
+easily fail in the presence of heavy occlusion and other humans as occluders.
+In this paper, we propose a feature pruning and consolidation (FPC) framework
+to circumvent explicit human structure parsing. The framework mainly consists
+of a sparse encoder, a multi-view feature mathcing module, and a feature
+consolidation decoder. Specifically, the sparse encoder drops less important
+image tokens, mostly related to background noise and occluders, solely based on
+correlation within the class token attention. Subsequently, the matching stage
+relies on the preserved tokens produced by the sparse encoder to identify
+k-nearest neighbors in the gallery by measuring the image and patch-level
+combined similarity. Finally, we use the feature consolidation module to
+compensate pruned features using identified neighbors for recovering essential
+information while disregarding disturbance from noise and occlusion.
+Experimental results demonstrate the effectiveness of our proposed framework on
+occluded, partial, and holistic Re-ID datasets. In particular, our method
+outperforms state-of-the-art results by at least 8.6\% mAP and 6.0\% Rank-1
+accuracy on the challenging Occluded-Duke dataset.
+
+
+
+ comment: Accepted by AAAI-24
+
+
+
+
+
+
+ ♻ ☆ An Empirical Study of CLIP for Text-based Person Search AAAI 2024
+
+
+
+
+
+
+
+
+ Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang
+
+
+ Text-based Person Search (TBPS) aims to retrieve the person images using
+natural language descriptions. Recently, Contrastive Language Image Pretraining
+(CLIP), a universal large cross-modal vision-language pre-training model, has
+remarkably performed over various cross-modal downstream tasks due to its
+powerful cross-modal semantic learning capacity. TPBS, as a fine-grained
+cross-modal retrieval task, is also facing the rise of research on the
+CLIP-based TBPS. In order to explore the potential of the visual-language
+pre-training model for downstream TBPS tasks, this paper makes the first
+attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus
+contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the
+TBPS community. We revisit critical design considerations under CLIP, including
+data augmentation and loss function. The model, with the aforementioned designs
+and practical training tricks, can attain satisfactory performance without any
+sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in
+model generalization and model compression, demonstrating the effectiveness of
+TBPS-CLIP from various aspects. This work is expected to provide empirical
+insights and highlight future CLIP-based TBPS research.
+
+
+
+ comment: Accepted by AAAI 2024. Code is available at
+ https://github.com/Flame-Chasers/TBPS-CLIP
+
+
+
+
+
+
+ ♻ ☆ Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
+
+
+ The core of cross-modal matching is to accurately measure the similarity
+between different modalities in a unified representation space. However,
+compared to textual descriptions of a certain perspective, the visual modality
+has more semantic variations. So, images are usually associated with multiple
+textual captions in databases. Although popular symmetric embedding methods
+have explored numerous modal interaction approaches, they often learn toward
+increasing the average expression probability of multiple semantic variations
+within image embeddings. Consequently, information entropy in embeddings is
+increased, resulting in redundancy and decreased accuracy. In this work, we
+propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the
+information entropy. Specifically, we obtain a set of heterogeneous visual
+sub-embeddings through dynamic orthogonal constraint loss. To encourage the
+generated candidate embeddings to capture various semantic variations, we
+construct a mixed distribution and employ a variance-aware weighting loss to
+assign different weights to the optimization process. In addition, we develop a
+Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and
+enhance the performance. We compare the performance with existing set-based
+method using four image feature encoders and two text feature encoders on three
+benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role
+of different components by ablation studies and perform a sensitivity analysis
+of the hyperparameters. The qualitative analysis of visualized bidirectional
+retrieval and attention maps further demonstrates the ability of our method to
+encode semantic variations.
+
+
+
+
+
+
+
+ ♻ ☆ Semantic segmentation of longitudinal thermal images for identification
+ of hot and cool spots in urban areas
+
+
+ This work presents the analysis of semantically segmented, longitudinally,
+and spatially rich thermal images collected at the neighborhood scale to
+identify hot and cool spots in urban areas. An infrared observatory was
+operated over a few months to collect thermal images of different types of
+buildings on the educational campus of the National University of Singapore. A
+subset of the thermal image dataset was used to train state-of-the-art deep
+learning models to segment various urban features such as buildings,
+vegetation, sky, and roads. It was observed that the U-Net segmentation model
+with `resnet34' CNN backbone has the highest mIoU score of 0.99 on the test
+dataset, compared to other models such as DeepLabV3, DeeplabV3+, FPN, and
+PSPnet. The masks generated using the segmentation models were then used to
+extract the temperature from thermal images and correct for differences in the
+emissivity of various urban features. Further, various statistical measure of
+the temperature extracted using the predicted segmentation masks is shown to
+closely match the temperature extracted using the ground truth masks. Finally,
+the masks were used to identify hot and cool spots in the urban feature at
+various instances of time. This forms one of the very few studies demonstrating
+the automated analysis of thermal images, which can be of potential use to
+urban planners for devising mitigation strategies for reducing the urban heat
+island (UHI) effect, improving building energy efficiency, and maximizing
+outdoor thermal comfort.
+
+
+
+ comment: 14 pages, 13 figures
+
+
+
+
+
+
+ ♻ ☆ CoSeR: Bridging Image and Language for Cognitive Super-Resolution
+
+
+ Existing super-resolution (SR) models primarily focus on restoring local
+texture details, often neglecting the global semantic information within the
+scene. This oversight can lead to the omission of crucial semantic details or
+the introduction of inaccurate textures during the recovery process. In our
+work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering
+SR models with the capacity to comprehend low-resolution images. We achieve
+this by marrying image appearance and language understanding to generate a
+cognitive embedding, which not only activates prior information from large
+text-to-image diffusion models but also facilitates the generation of
+high-quality reference images to optimize the SR process. To further improve
+image fidelity, we propose a novel condition injection scheme called
+"All-in-Attention", consolidating all conditional information into a single
+module. Consequently, our method successfully restores semantically correct and
+photorealistic details, demonstrating state-of-the-art performance across
+multiple benchmarks. Code: https://github.com/VINHYU/CoSeR
+
+
+ Generating realistic human motion sequences from text descriptions is a
+challenging task that requires capturing the rich expressiveness of both
+natural language and human motion.Recent advances in diffusion models have
+enabled significant progress in human motion synthesis.However, existing
+methods struggle to handle text inputs that describe complex or long motions.In
+this paper, we propose the Adaptable Motion Diffusion (AMD) model, which
+leverages a Large Language Model (LLM) to parse the input text into a sequence
+of concise and interpretable anatomical scripts that correspond to the target
+motion.This process exploits the LLM's ability to provide anatomical guidance
+for complex motion synthesis.We then devise a two-branch fusion scheme that
+balances the influence of the input text and the anatomical scripts on the
+inverse diffusion process, which adaptively ensures the semantic fidelity and
+diversity of the synthesized motion.Our method can effectively handle texts
+with complex or long motion descriptions, where existing methods often fail.
+Experiments on datasets with relatively more complex motions, such as CLCD1 and
+CLCD2, demonstrate that our AMD significantly outperforms existing
+state-of-the-art models.
+
+
+
+
+
+
+
+ ♻ ☆ Multiple Instance Learning Framework with Masked Hard Instance Mining
+ for Whole Slide Image Classification ICCV2023
+
+
+
+
+
+
+
+
+ Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, Bo Liu
+
+
+ The whole slide image (WSI) classification is often formulated as a multiple
+instance learning (MIL) problem. Since the positive tissue is only a small
+fraction of the gigapixel WSI, existing MIL methods intuitively focus on
+identifying salient instances via attention mechanisms. However, this leads to
+a bias towards easy-to-classify instances while neglecting hard-to-classify
+instances. Some literature has revealed that hard examples are beneficial for
+modeling a discriminative boundary accurately. By applying such an idea at the
+instance level, we elaborate a novel MIL framework with masked hard instance
+mining (MHIM-MIL), which uses a Siamese structure (Teacher-Student) with a
+consistency constraint to explore the potential hard instances. With several
+instance masking strategies based on attention scores, MHIM-MIL employs a
+momentum teacher to implicitly mine hard instances for training the student
+model, which can be any attention-based MIL model. This counter-intuitive
+strategy essentially enables the student to learn a better discriminating
+boundary. Moreover, the student is used to update the teacher with an
+exponential moving average (EMA), which in turn identifies new hard instances
+for subsequent training iterations and stabilizes the optimization.
+Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets
+demonstrate that MHIM-MIL outperforms other latest methods in terms of
+performance and training cost. The code is available at:
+https://github.com/DearCaat/MHIM-MIL.
+
+
+
+ comment: Published on ICCV2023
+
+
+
+
+
+
+ ♻ ☆ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
+ Generalizable 3D Reconstruction
+
+
+
+
+
+
+
+
+ David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann
+
+
+ We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D
+radiance fields parameterized by 3D Gaussian primitives from pairs of images.
+Our model features real-time and memory-efficient rendering for scalable
+training as well as fast 3D reconstruction at inference time. To overcome local
+minima inherent to sparse and locally supported representations, we predict a
+dense probability distribution over 3D and sample Gaussian means from that
+probability distribution. We make this sampling operation differentiable via a
+reparameterization trick, allowing us to back-propagate gradients through the
+Gaussian splatting representation. We benchmark our method on wide-baseline
+novel view synthesis on the real-world RealEstate10k and ACID datasets, where
+we outperform state-of-the-art light field transformers and accelerate
+rendering by 2.5 orders of magnitude while reconstructing an interpretable and
+editable 3D radiance field.
+
+
+ Understanding object recognition patterns in mice is crucial for advancing
+behavioral neuroscience and has significant implications for human health,
+particularly in the realm of Alzheimer's research. This study is centered on
+the development, application, and evaluation of a state-of-the-art
+computational pipeline designed to analyze such behaviors, specifically
+focusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition
+(SLR) tasks. The pipeline integrates three advanced computational models:
+Any-Maze for initial data collection, DeepLabCut for detailed pose estimation,
+and Convolutional Neural Networks (CNNs) for nuanced behavioral classification.
+Employed across four distinct mouse groups, this pipeline demonstrated high
+levels of accuracy and robustness. Despite certain challenges like video
+quality limitations and the need for manual calculations, the results affirm
+the pipeline's efficacy and potential for scalability. The study serves as a
+proof of concept for a multidimensional computational approach to behavioral
+neuroscience, emphasizing the pipeline's versatility and readiness for future,
+more complex analyses.
+
+
+
+ comment: Aspects of the paper contain errors, and data in the pipeline must be
+ vetted one more time. More testing is necessary
+
+
+
+
+
+
+ ♻ ☆ Two Independent Teachers are Better Role Model
+
+
+
+
+
+
+
+
+ Afifa Khaled, Ahmed A. Mubarak, Kun He
+
+
+ Recent deep learning models have attracted substantial attention in infant
+brain analysis. These models have performed state-of-the-art performance, such
+as semi-supervised techniques (e.g., Temporal Ensembling, mean teacher).
+However, these models depend on an encoder-decoder structure with stacked local
+operators to gather long-range information, and the local operators limit the
+efficiency and effectiveness. Besides, the $MRI$ data contain different tissue
+properties ($TPs$) such as $T1$ and $T2$. One major limitation of these models
+is that they use both data as inputs to the segment process, i.e., the models
+are trained on the dataset once, and it requires much computational and memory
+requirements during inference. In this work, we address the above limitations
+by designing a new deep-learning model, called 3D-DenseUNet, which works as
+adaptable global aggregation blocks in down-sampling to solve the issue of
+spatial information loss. The self-attention module connects the down-sampling
+blocks to up-sampling blocks, and integrates the feature maps in three
+dimensions of spatial and channel, effectively improving the representation
+potential and discriminating ability of the model. Additionally, we propose a
+new method called Two Independent Teachers ($2IT$), that summarizes the model
+weights instead of label predictions. Each teacher model is trained on
+different types of brain data, $T1$ and $T2$, respectively. Then, a fuse model
+is added to improve test accuracy and enable training with fewer parameters and
+labels compared to the Temporal Ensembling method without modifying the network
+architecture. Empirical results demonstrate the effectiveness of the proposed
+method. The code is available at
+https://github.com/AfifaKhaled/Two-Independent-Teachers-are-Better-Role-Model.
+
+
+
+ comment: This manuscript contains 14 pages, 7 figures
+
+
+
+
+
+
+
+ Suman Saha, Lukas Hoyer, Anton Obukhov, Dengxin Dai, Luc Van Gool
+
+
+ With autonomous industries on the rise, domain adaptation of the visual
+perception stack is an important research direction due to the cost savings
+promise. Much prior art was dedicated to domain-adaptive semantic segmentation
+in the synthetic-to-real context. Despite being a crucial output of the
+perception stack, panoptic segmentation has been largely overlooked by the
+domain adaptation community. Therefore, we revisit well-performing domain
+adaptation strategies from other fields, adapt them to panoptic segmentation,
+and show that they can effectively enhance panoptic domain adaptation. Further,
+we study the panoptic network design and propose a novel architecture (EDAPS)
+designed explicitly for domain-adaptive panoptic segmentation. It uses a
+shared, domain-robust transformer encoder to facilitate the joint adaptation of
+semantic and instance features, but task-specific decoders tailored for the
+specific requirements of both domain-adaptive semantic and instance
+segmentation. As a result, the performance gap seen in challenging panoptic
+benchmarks is substantially narrowed. EDAPS significantly improves the
+state-of-the-art performance for panoptic segmentation UDA by a large margin of
+20% on SYNTHIA-to-Cityscapes and even 72% on the more challenging
+SYNTHIA-to-Mapillary Vistas. The implementation is available at
+https://github.com/susaha/edaps.
+
+
+ Dataset pruning aims to construct a coreset capable of achieving performance
+comparable to the original, full dataset. Most existing dataset pruning methods
+rely on snapshot-based criteria to identify representative samples, often
+resulting in poor generalization across various pruning and cross-architecture
+scenarios. Recent studies have addressed this issue by expanding the scope of
+training dynamics considered, including factors such as forgetting event and
+probability change, typically using an averaging approach. However, these works
+struggle to integrate a broader range of training dynamics without overlooking
+well-generalized samples, which may not be sufficiently highlighted in an
+averaging manner. In this study, we propose a novel dataset pruning method
+termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS
+utilizes a dual-depth strategy to achieve a balance between incorporating
+extensive training dynamics and identifying representative samples for dataset
+pruning. In the first depth, we estimate the series of each sample's individual
+contributions spanning the training progress, ensuring comprehensive
+integration of training dynamics. In the second depth, we focus on the
+variability of the sample-wise contributions identified in the first depth to
+highlight well-generalized samples. Extensive experiments conducted on CIFAR
+and ImageNet datasets verify the superiority of TDDS over previous SOTA
+methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with
+only 10% training data, surpassing random selection by 7.83% and other
+comparison methods by at least 12.69%.
+
+
+
+
+
+
+
+ ♻ ☆ AM-RADIO: Agglomerative Model -- Reduce All Domains Into One
+
+
+
+
+
+
+
+
+ Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov
+
+
+ A handful of visual foundation models (VFMs) have recently emerged as the
+backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are
+trained with distinct objectives, exhibiting unique characteristics for various
+downstream tasks. We find that despite their conceptual differences, these
+models can be effectively merged into a unified model through multi-teacher
+distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All
+Domains Into One). This integrative approach not only surpasses the performance
+of individual teacher models but also amalgamates their distinctive features,
+such as zero-shot vision-language comprehension, detailed pixel-level
+understanding, and open vocabulary segmentation capabilities. In pursuit of the
+most hardware-efficient backbone, we evaluated numerous architectures in our
+multi-teacher distillation pipeline using the same training recipe. This led to
+the development of a novel architecture (E-RADIO) that exceeds the performance
+of its predecessors and is at least 7x faster than the teacher models. Our
+comprehensive benchmarking process covers downstream tasks including ImageNet
+classification, ADE20k semantic segmentation, COCO object detection and
+LLaVa-1.5 framework.
+ Code: https://github.com/NVlabs/RADIO
+
+
+
+ comment: Version 2: Added more acknowledgements and updated table 7 with more
+ recent results. Ensured that the link in the abstract to our code is working
+ properly
+
+
+
+
+
+
+ ♻ ☆ Self-Supervised Learning for Place Representation Generalization across
+ Appearance Changes WACV 2024
+
+
+ Visual place recognition is a key to unlocking spatial navigation for
+animals, humans and robots. While state-of-the-art approaches are trained in a
+supervised manner and therefore hardly capture the information needed for
+generalizing to unusual conditions, we argue that self-supervised learning may
+help abstracting the place representation so that it can be foreseen,
+irrespective of the conditions. More precisely, in this paper, we investigate
+learning features that are robust to appearance modifications while sensitive
+to geometric transformations in a self-supervised manner. This dual-purpose
+training is made possible by combining the two self-supervision main paradigms,
+\textit{i.e.} contrastive and predictive learning. Our results on standard
+benchmarks reveal that jointly learning such appearance-robust and
+geometry-sensitive image descriptors leads to competitive visual place
+recognition results across adverse seasonal and illumination conditions,
+without requiring any human-annotated labels.
+
+
+ In the dynamic landscape of online businesses, recommender systems are
+pivotal in enhancing user experiences. While traditional approaches have relied
+on static supervised learning, the quest for adaptive, user-centric
+recommendations has led to the emergence of the formulation of contextual
+bandits. This tutorial investigates the contextual bandits as a powerful
+framework for personalized recommendations. We delve into the challenges,
+advanced algorithms and theories, collaborative strategies, and open challenges
+and future prospects within this field. Different from existing related
+tutorials, (1) we focus on the exploration perspective of contextual bandits to
+alleviate the ``Matthew Effect'' in the recommender systems, i.e., the rich get
+richer and the poor get poorer, concerning the popularity of items; (2) in
+addition to the conventional linear contextual bandits, we will also dedicated
+to neural contextual bandits which have emerged as an important branch in
+recent years, to investigate how neural networks benefit contextual bandits for
+personalized recommendation both empirically and theoretically; (3) we will
+cover the latest topic, collaborative neural contextual bandits, to incorporate
+both user heterogeneity and user correlations customized for recommender
+system; (4) we will provide and discuss the new emerging challenges and open
+questions for neural contextual bandits with applications in the personalized
+recommendation, especially for large neural models.
+
+
+
+ comment: WWW'24 Tutorial
+
+
+
+
+
+
+ ☆ A Learning oriented DLP System based on Classification Model
+
+
+ Data is the key asset for organizations and data sharing is lifeline for
+organization growth; which may lead to data loss. Data leakage is the most
+critical issue being faced by organizations. In order to mitigate the data
+leakage issues data leakage prevention systems (DLPSs) are deployed at various
+levels by the organizations. DLPSs are capable to protect all kind of data i.e.
+DAR, DIM/DIT, DIU. Statistical analysis, regular expression, data
+fingerprinting are common approaches exercised in DLP system. Out of these
+techniques; statistical analysis approach is most appropriate for proposed DLP
+model of data security. This paper defines a statistical DLP model for document
+classification. Model uses various statistical approaches like TF-IDF (Term
+Frequency- Inverse Document Frequency) a renowned term count/weighing function,
+Vectorization, Gradient boosting document classification etc. to classify the
+documents before allowing any access to it. Machine learning is used to test
+and train the model. Proposed model also introduces an extremely efficient and
+more accurate approach; IGBCA (Improvised Gradient Boosting Classification
+Algorithm); for document classification, to prevent them from possible data
+leakage. Results depicts that proposed model can classify documents with high
+accuracy and on basis of which data can be prevented from being loss.
+
+
+
+
+
+
+
+ ☆ Unexplored Frontiers: A Review of Empirical Studies of Exploratory
+ Search
+
+
+ This article reviews how empirical research of exploratory search is
+conducted. We investigated aspects of interdisciplinarity, study settings and
+evaluation methodologies from a systematically selected sample of 231
+publications from 2010-2021, including a total of 172 articles with empirical
+studies. Our results show that exploratory search is highly interdisciplinary,
+with the most frequently occurring publication venues including high impact
+venues in information science, information systems and human-computer
+interaction. However, taken in aggregate, the breadth of study settings
+investigated was limited. We found that a majority of studies (77%) focused on
+evaluating novel retrieval systems as opposed to investigating users' search
+processes. Furthermore, a disproportionate number of studies were based on
+scientific literature search (20.7%), a majority of which only considered
+searching for Computer Science articles. Study participants were generally from
+convenience samples, with 75% of studies composed exclusively of students and
+other academics. The methodologies used for evaluation were mostly
+quantitative, but lacked consistency between studies and validated
+questionnaires were rarely used. In discussion, we offer a critical analysis of
+our findings and suggest potential improvements for future exploratory search
+studies.
+
+
+
+
+
+
+
+ ☆ Empowering Few-Shot Recommender Systems with Large Language Models --
+ Enhanced Representations
+
+
+ Recommender systems utilizing explicit feedback have witnessed significant
+advancements and widespread applications over the past years. However,
+generating recommendations in few-shot scenarios remains a persistent
+challenge. Recently, large language models (LLMs) have emerged as a promising
+solution for addressing natural language processing (NLP) tasks, thereby
+offering novel insights into tackling the few-shot scenarios encountered by
+explicit feedback-based recommender systems. To bridge recommender systems and
+LLMs, we devise a prompting template that generates user and item
+representations based on explicit feedback. Subsequently, we integrate these
+LLM-processed representations into various recommendation models to evaluate
+their significance across diverse recommendation tasks. Our ablation
+experiments and case study analysis collectively demonstrate the effectiveness
+of LLMs in processing explicit feedback, highlighting that LLMs equipped with
+generative and logical reasoning capabilities can effectively serve as a
+component of recommender systems to enhance their performance in few-shot
+scenarios. Furthermore, the broad adaptability of LLMs augments the
+generalization potential of recommender models, despite certain inherent
+constraints. We anticipate that our study can inspire researchers to delve
+deeper into the multifaceted dimensions of LLMs's involvement in recommender
+systems and contribute to the advancement of the explicit feedback-based
+recommender systems field.
+
+
+ Query-focused summarization (QFS) aims to provide a summary of a single
+document/multi documents that can satisfy the information needs of a given
+query. It is useful for various real-world applications, such as abstractive
+snippet generation or more recent retrieval augmented generation (RAG). A
+prototypical QFS pipeline consists of a retriever (sparse or dense retrieval)
+and a generator (usually a large language model). However, applying large
+language models (LLM) potentially leads to hallucinations, especially when the
+evidence contradicts the prior belief of LLMs. There has been growing interest
+in developing new decoding methods to improve generation quality and reduce
+hallucination. In this work, we conduct a large-scale reproducibility on one
+recently proposed decoding method -- Context-aware Decoding (CAD). In addition
+to replicating CAD's experiments on news summarization datasets, we include
+experiments on QFS datasets, and conduct more rigorous analysis on
+computational complexity and hyperparameter sensitivity. Experiments with eight
+different language models show that performance-wise, CAD improves QFS quality
+by (1) reducing factuality errors/hallucinations while (2) mostly retaining the
+match of lexical patterns, measured by ROUGE scores, while also at a cost of
+increased inference-time FLOPs and reduced decoding speed. The code
+implementation based on Huggingface Library is made available
+https://github.com/zhichaoxu-shufe/context-aware-decoding-qfs
+
+
+
+ comment: technical report
+
+
+
+
+
+
+ ♻ ☆ Restricted Bernoulli Matrix Factorization: Balancing the trade-off
+ between prediction accuracy and coverage in classification based
+ collaborative filtering
+
+
+
+
+
+
+
+
+ Ángel González-Prieto, Abraham Gutiérrez, Fernando Ortega, Raúl Lara-Cabrera
+
+
+ Reliability measures associated with the prediction of the machine learning
+models are critical to strengthening user confidence in artificial
+intelligence. Therefore, those models that are able to provide not only
+predictions, but also reliability, enjoy greater popularity. In the field of
+recommender systems, reliability is crucial, since users tend to prefer those
+recommendations that are sure to interest them, that is, high predictions with
+high reliabilities. In this paper, we propose Restricted Bernoulli Matrix
+Factorization (ResBeMF), a new algorithm aimed at enhancing the performance of
+classification-based collaborative filtering. The proposed model has been
+compared to other existing solutions in the literature in terms of prediction
+quality (Mean Absolute Error and accuracy scores), prediction quantity
+(coverage score) and recommendation quality (Mean Average Precision score). The
+experimental results demonstrate that the proposed model provides a good
+balance in terms of the quality measures used compared to other recommendation
+models.
+
+
+
+ comment: Several changes performed, including a title change. 21 pages, 7
+ figures, 2 tables
+
+
+
+
+
+
+ ♻ ☆ A Survey on Query-based API Recommendation
+
+
+
+
+
+
+
+
+ Moshi Wei, Nima Shiri Harzevili, Alvine Boaye Belle, Junjie Wang, Lin Shi, Jinqiu Yang, Song Wang, Ming Zhen, Jiang
+
+
+ Application Programming Interfaces (APIs) are designed to help developers
+build software more effectively. Recommending the right APIs for specific tasks
+has gained increasing attention among researchers and developers in recent
+years. To comprehensively understand this research domain, we have surveyed to
+analyze API recommendation studies published in the last 10 years. Our study
+begins with an overview of the structure of API recommendation tools.
+Subsequently, we systematically analyze prior research and pose four key
+research questions. For RQ1, we examine the volume of published papers and the
+venues in which these papers appear within the API recommendation field. In
+RQ2, we categorize and summarize the prevalent data sources and collection
+methods employed in API recommendation research. In RQ3, we explore the types
+of data and common data representations utilized by API recommendation
+approaches. We also investigate the typical data extraction procedures and
+collection approaches employed by the existing approaches. RQ4 delves into the
+modeling techniques employed by API recommendation approaches, encompassing
+both statistical and deep learning models. Additionally, we compile an overview
+of the prevalent ranking strategies and evaluation metrics used for assessing
+API recommendation tools. Drawing from our survey findings, we identify current
+challenges in API recommendation research that warrant further exploration,
+along with potential avenues for future research.
+
+
+
+
+
+
+
+ ♻ ☆ VM-Rec: A Variational Mapping Approach for Cold-start User
+ Recommendation
+
+
+ The cold-start problem is a common challenge for most recommender systems.
+The practical application of most cold-start methods is hindered by the
+deficiency in auxiliary content information for users. Moreover, most methods
+necessitate simultaneous updates to the extensive parameters of recommender
+models, leading to significant training costs, particularly in large-scale
+industrial scenarios. We observe that the model can generate expressive
+embeddings for warm users with relatively more interactions. Initially, these
+users were cold-start users, and after transitioning to warm users, they
+exhibit clustering patterns in their embeddings with consistent initial
+interactions. Based on this motivation, we propose a Variational Mapping
+approach for cold-start user Recommendation (VM-Rec), mapping from few initial
+interactions to expressive embeddings for cold-start users. Specifically, we
+encode the initial interactions into a latent representation, where each
+dimension disentangledly signifies the degree of association with each warm
+user. Subsequently, we utilize this latent representation as the parameters for
+the mapping function, mapping (decoding) it into an expressive embedding, which
+can be integrated into a pre-trained recommender model directly. Our method is
+evaluated on three datasets using the same base model, demonstrating superior
+performance compared to other popular cold-start methods.
+
+
+ Keyphrase extraction is a fundamental task in natural language processing and
+information retrieval that aims to extract a set of phrases with important
+information from a source document. Identifying important keyphrase is the
+central component of the keyphrase extraction task, and its main challenge is
+how to represent information comprehensively and discriminate importance
+accurately. In this paper, to address these issues, we design a new hyperbolic
+matching model (HyperMatch) to represent phrases and documents in the same
+hyperbolic space and explicitly estimate the phrase-document relevance via the
+Poincar\'e distance as the important score of each phrase. Specifically, to
+capture the hierarchical syntactic and semantic structure information,
+HyperMatch takes advantage of the hidden representations in multiple layers of
+RoBERTa and integrates them as the word embeddings via an adaptive mixing
+layer. Meanwhile, considering the hierarchical structure hidden in the
+document, HyperMatch embeds both phrases and documents in the same hyperbolic
+space via a hyperbolic phrase encoder and a hyperbolic document encoder. This
+strategy can further enhance the estimation of phrase-document relevance due to
+the good properties of hyperbolic space. In this setting, the keyphrase
+extraction can be taken as a matching problem and effectively implemented by
+minimizing a hyperbolic margin-based triplet loss. Extensive experiments are
+conducted on six benchmarks and demonstrate that HyperMatch outperforms the
+state-of-the-art baselines.
+
+
+
+ comment: 12 pages, 3 figures, Accepted by NAACL2022
+
+
+
+
+
+
+ ♻ ☆ Sustainable Transparency in Recommender Systems: Bayesian Ranking of
+ Images for Explainability
+
+
+ Recommender Systems have become crucial in the modern world, commonly guiding
+users towards relevant content or products, and having a large influence over
+the decisions of users and citizens. However, ensuring transparency and user
+trust in these systems remains a challenge; personalized explanations have
+emerged as a solution, offering justifications for recommendations. Among the
+existing approaches for generating personalized explanations, using existing
+visual content created by users is a promising option to maximize transparency
+and user trust. State-of-the-art models that follow this approach, despite
+leveraging highly optimized architectures, employ surrogate learning tasks that
+do not efficiently model the objective of ranking images as explanations for a
+given recommendation; this leads to a suboptimal training process with high
+computational costs that may not be reduced without affecting model
+performance. This work presents BRIE, a novel model where we leverage Bayesian
+Pairwise Ranking to enhance the training process, allowing us to consistently
+outperform state-of-the-art models in six real-world datasets while reducing
+its model size by up to 64 times and its CO${_2}$ emissions by up to 75% in
+training and inference.
+
+
+
+
+
+
+
+ ♻ ☆ Importance Estimation from Multiple Perspectives for Keyphrase
+ Extraction EMNLP2021
+
+
+ Keyphrase extraction is a fundamental task in Natural Language Processing,
+which usually contains two main parts: candidate keyphrase extraction and
+keyphrase importance estimation. From the view of human understanding
+documents, we typically measure the importance of phrase according to its
+syntactic accuracy, information saliency, and concept consistency
+simultaneously. However, most existing keyphrase extraction approaches only
+focus on the part of them, which leads to biased results. In this paper, we
+propose a new approach to estimate the importance of keyphrase from multiple
+perspectives (called as \textit{KIEMP}) and further improve the performance of
+keyphrase extraction. Specifically, \textit{KIEMP} estimates the importance of
+phrase with three modules: a chunking module to measure its syntactic accuracy,
+a ranking module to check its information saliency, and a matching module to
+judge the concept (i.e., topic) consistency between phrase and the whole
+document. These three modules are seamlessly jointed together via an end-to-end
+multi-task learning model, which is helpful for three parts to enhance each
+other and balance the effects of three perspectives. Experimental results on
+six benchmark datasets show that \textit{KIEMP} outperforms the existing
+state-of-the-art keyphrase extraction approaches in most cases.
+
+
+
+ comment: 11 pages, 2 figures, Accepted by EMNLP2021
+
+
+
+
+
+
+ ♻ ☆ Embedding in Recommender Systems: A Survey
+
+
+ Recommender systems have become an essential component of many online
+platforms, providing personalized recommendations to users. A crucial aspect is
+embedding techniques that coverts the high-dimensional discrete features, such
+as user and item IDs, into low-dimensional continuous vectors and can enhance
+the recommendation performance. Applying embedding techniques captures complex
+entity relationships and has spurred substantial research. In this survey, we
+provide an overview of the recent literature on embedding techniques in
+recommender systems. This survey covers embedding methods like collaborative
+filtering, self-supervised learning, and graph-based techniques. Collaborative
+filtering generates embeddings capturing user-item preferences, excelling in
+sparse data. Self-supervised methods leverage contrastive or generative
+learning for various tasks. Graph-based techniques like node2vec exploit
+complex relationships in network-rich environments. Addressing the scalability
+challenges inherent to embedding methods, our survey delves into innovative
+directions within the field of recommendation systems. These directions aim to
+enhance performance and reduce computational complexity, paving the way for
+improved recommender systems. Among these innovative approaches, we will
+introduce Auto Machine Learning (AutoML), hash techniques, and quantization
+techniques in this survey. We discuss various architectures and techniques and
+highlight the challenges and future directions in these aspects. This survey
+aims to provide a comprehensive overview of the state-of-the-art in this
+rapidly evolving field and serve as a useful resource for researchers and
+practitioners working in the area of recommender systems.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Large Language Model for Graph Data Understanding in Online
+ Job Recommendations
+
+
+ Large Language Models (LLMs) have revolutionized natural language processing
+tasks, demonstrating their exceptional capabilities in various domains.
+However, their potential for behavior graph understanding in job
+recommendations remains largely unexplored. This paper focuses on unveiling the
+capability of large language models in understanding behavior graphs and
+leveraging this understanding to enhance recommendations in online recruitment,
+including the promotion of out-of-distribution (OOD) application. We present a
+novel framework that harnesses the rich contextual information and semantic
+representations provided by large language models to analyze behavior graphs
+and uncover underlying patterns and relationships. Specifically, we propose a
+meta-path prompt constructor that leverages LLM recommender to understand
+behavior graphs for the first time and design a corresponding path augmentation
+module to alleviate the prompt bias introduced by path-based sequence input. By
+leveraging this capability, our framework enables personalized and accurate job
+recommendations for individual users. We evaluate the effectiveness of our
+approach on a comprehensive dataset and demonstrate its ability to improve the
+relevance and quality of recommended quality. This research not only sheds
+light on the untapped potential of large language models but also provides
+valuable insights for developing advanced recommendation systems in the
+recruitment market. The findings contribute to the growing field of natural
+language processing and offer practical implications for enhancing job search
+experiences. We release the code at https://github.com/WLiK/GLRec.
+
+
+
+
+
+
+
+ ♻ ☆ Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
+
+
+ The core of cross-modal matching is to accurately measure the similarity
+between different modalities in a unified representation space. However,
+compared to textual descriptions of a certain perspective, the visual modality
+has more semantic variations. So, images are usually associated with multiple
+textual captions in databases. Although popular symmetric embedding methods
+have explored numerous modal interaction approaches, they often learn toward
+increasing the average expression probability of multiple semantic variations
+within image embeddings. Consequently, information entropy in embeddings is
+increased, resulting in redundancy and decreased accuracy. In this work, we
+propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the
+information entropy. Specifically, we obtain a set of heterogeneous visual
+sub-embeddings through dynamic orthogonal constraint loss. To encourage the
+generated candidate embeddings to capture various semantic variations, we
+construct a mixed distribution and employ a variance-aware weighting loss to
+assign different weights to the optimization process. In addition, we develop a
+Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and
+enhance the performance. We compare the performance with existing set-based
+method using four image feature encoders and two text feature encoders on three
+benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role
+of different components by ablation studies and perform a sensitivity analysis
+of the hyperparameters. The qualitative analysis of visualized bidirectional
+retrieval and attention maps further demonstrates the ability of our method to
+encode semantic variations.
+
+
+ Item representation learning (IRL) plays an essential role in recommender
+systems, especially for sequential recommendation. Traditional sequential
+recommendation models usually utilize ID embeddings to represent items, which
+are not shared across different domains and lack the transferable ability.
+Recent studies use pre-trained language models (PLM) for item text embeddings
+(text-based IRL) that are universally applicable across domains. However, the
+existing text-based IRL is unaware of the important collaborative filtering
+(CF) information. In this paper, we propose CoWPiRec, an approach of
+Collaborative Word-based Pre-trained item representation for Recommendation. To
+effectively incorporate CF information into text-based IRL, we convert the
+item-level interaction data to a word graph containing word-level
+collaborations. Subsequently, we design a novel pre-training task to align the
+word-level semantic- and CF-related item representation. Extensive experimental
+results on multiple public datasets demonstrate that compared to
+state-of-the-art transferable sequential recommenders, CoWPiRec achieves
+significantly better performances in both fine-tuning and zero-shot settings
+for cross-scenario recommendation and effectively alleviates the cold-start
+issue. The code is available at: https://github.com/ysh-1998/CoWPiRec.
+
+
+
+ comment: Accepted by ICDM 2023
+
+
+
+
+
+
+ ♻ ☆ Shall We Pretrain Autoregressive Language Models with Retrieval? A
+ Comprehensive Study EMNLP 2023
+
+
+
+
+
+
+
+
+ Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro
+
+
+ Large decoder-only language models (LMs) can be largely improved in terms of
+perplexity by retrieval (e.g., RETRO), but its impact on text generation
+quality and downstream task accuracy is unclear. Thus, it is still an open
+question: shall we pretrain large autoregressive LMs with retrieval? To answer
+it, we perform a comprehensive study on a scalable pre-trained
+retrieval-augmented LM (i.e., RETRO) compared with standard GPT and
+retrieval-augmented GPT incorporated at fine-tuning or inference stages. We
+first provide the recipe to reproduce RETRO up to 9.5B parameters while
+retrieving a text corpus with 330B tokens. Based on that, we have the following
+novel findings: i) RETRO outperforms GPT on text generation with much less
+degeneration (i.e., repetition), moderately higher factual accuracy, and
+slightly lower toxicity with a nontoxic retrieval database. ii) On the LM
+Evaluation Harness benchmark, RETRO largely outperforms GPT on
+knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore,
+we introduce a simple variant of the model, RETRO++, which largely improves
+open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural
+Question) and significantly outperforms retrieval-augmented GPT in both
+fine-tuning and zero-shot evaluation settings. Our findings highlight the
+promising direction of pretraining autoregressive LMs with retrieval as future
+foundation models. We release our code and model at:
+https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md
+
+
+
+ comment: EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Economic Recommender Systems -- A Systematic Review
+
+
+ Many of today's online services provide personalized recommendations to their
+users. Such recommendations are typically designed to serve certain user needs,
+e.g., to quickly find relevant content in situations of information overload.
+Correspondingly, the academic literature in the field largely focuses on the
+value of recommender systems for the end user. In this context, one underlying
+assumption is that the improved service that is achieved through the
+recommendations will in turn positively impact the organization's goals, e.g.,
+in the form of higher customer retention or loyalty. However, in reality,
+recommender systems can be used to target organizational economic goals more
+directly by incorporating monetary considerations such as price awareness and
+profitability aspects into the underlying recommendation models. In this work,
+we survey the existing literature on what we call Economic Recommender Systems
+based on a systematic review approach that helped us identify 133 relevant
+papers. We first categorize existing works along different dimensions and then
+review the most important technical approaches from the literature.
+Furthermore, we discuss common methodologies to evaluate such systems and
+finally outline the limitations of today's research and future directions.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 155
+
+
+
+
+
+ ☆ Quantum Algorithms for the Pathwise Lasso
+
+
+
+
+
+
+
+
+ João F. Doriguello, Debbie Lim, Chi Seng Pun, Patrick Rebentrost, Tushar Vaidya
+
+
+ We present a novel quantum high-dimensional linear regression algorithm with
+an $\ell_1$-penalty based on the classical LARS (Least Angle Regression)
+pathwise algorithm. Similarly to available classical numerical algorithms for
+Lasso, our quantum algorithm provides the full regularisation path as the
+penalty term varies, but quadratically faster per iteration under specific
+conditions. A quadratic speedup on the number of features/predictors $d$ is
+possible by using the simple quantum minimum-finding subroutine from D\"urr and
+Hoyer (arXiv'96) in order to obtain the joining time at each iteration. We then
+improve upon this simple quantum algorithm and obtain a quadratic speedup both
+in the number of features $d$ and the number of observations $n$ by using the
+recent approximate quantum minimum-finding subroutine from Chen and de Wolf
+(ICALP'23). As one of our main contributions, we construct a quantum unitary
+based on quantum amplitude estimation to approximately compute the joining
+times to be searched over by the approximate quantum minimum finding. Since the
+joining times are no longer exactly computed, it is no longer clear that the
+resulting approximate quantum algorithm obtains a good solution. As our second
+main contribution, we prove, via an approximate version of the KKT conditions
+and a duality gap, that the LARS algorithm (and therefore our quantum
+algorithm) is robust to errors. This means that it still outputs a path that
+minimises the Lasso cost function up to a small error if the joining times are
+only approximately computed. Finally, in the model where the observations are
+generated by an underlying linear model with an unknown coefficient vector, we
+prove bounds on the difference between the unknown coefficient vector and the
+approximate Lasso solution, which generalises known results about convergence
+rates in classical statistical learning theory analysis.
+
+
+
+ comment: 44 pages
+
+
+
+
+
+
+ ☆ Fast kernel half-space depth for data with non-convex supports
+
+
+ Data depth is a statistical function that generalizes order and quantiles to
+the multivariate setting and beyond, with applications spanning over
+descriptive and visual statistics, anomaly detection, testing, etc. The
+celebrated halfspace depth exploits data geometry via an optimization program
+to deliver properties of invariances, robustness, and non-parametricity.
+Nevertheless, it implicitly assumes convex data supports and requires
+exponential computational cost. To tackle distribution's multimodality, we
+extend the halfspace depth in a Reproducing Kernel Hilbert Space (RKHS). We
+show that the obtained depth is intuitive and establish its consistency with
+provable concentration bounds that allow for homogeneity testing. The proposed
+depth can be computed using manifold gradient making faster than halfspace
+depth by several orders of magnitude. The performance of our depth is
+demonstrated through numerical simulations as well as applications such as
+anomaly detection on real data and homogeneity testing.
+
+
+
+ comment: 30 pages
+
+
+
+
+
+
+ ☆ Diffusion Reward: Learning Rewards via Conditional Video Diffusion
+
+
+ Learning rewards from expert videos offers an affordable and effective
+solution to specify the intended behaviors for reinforcement learning tasks. In
+this work, we propose Diffusion Reward, a novel framework that learns rewards
+from expert videos via conditional video diffusion models for solving complex
+visual RL problems. Our key insight is that lower generative diversity is
+observed when conditioned on expert trajectories. Diffusion Reward is
+accordingly formalized by the negative of conditional entropy that encourages
+productive exploration of expert-like behaviors. We show the efficacy of our
+method over 10 robotic manipulation tasks from MetaWorld and Adroit with visual
+input and sparse reward. Moreover, Diffusion Reward could even solve unseen
+tasks successfully and effectively, largely surpassing baseline methods.
+Project page and code: https://diffusion-reward.github.io/.
+
+
+
+ comment: Project page and code: https://diffusion-reward.github.io/
+
+
+
+
+
+
+ ☆ WellFactor: Patient Profiling using Integrative Embedding of Healthcare
+ Data
+
+
+
+
+
+
+
+
+ Dongjin Choi, Andy Xiang, Ozgur Ozturk, Deep Shrestha, Barry Drake, Hamid Haidarian, Faizan Javed, Haesun Park
+
+
+ In the rapidly evolving healthcare industry, platforms now have access to not
+only traditional medical records, but also diverse data sets encompassing
+various patient interactions, such as those from healthcare web portals. To
+address this rich diversity of data, we introduce WellFactor: a method that
+derives patient profiles by integrating information from these sources. Central
+to our approach is the utilization of constrained low-rank approximation.
+WellFactor is optimized to handle the sparsity that is often inherent in
+healthcare data. Moreover, by incorporating task-specific label information,
+our method refines the embedding results, offering a more informed perspective
+on patients. One important feature of WellFactor is its ability to compute
+embeddings for new, previously unobserved patient data instantaneously,
+eliminating the need to revisit the entire data set or recomputing the
+embedding. Comprehensive evaluations on real-world healthcare data demonstrate
+WellFactor's effectiveness. It produces better results compared to other
+existing methods in classification performance, yields meaningful clustering of
+patients, and delivers consistent results in patient similarity searches and
+predictions.
+
+
+
+ comment: 2023 IEEE International Conference on Big Data (IEEE BigData 2023)
+
+
+
+
+
+
+ ☆ Learning Human-like Representations to Enable Learning Human Values AAAI 2024
+
+
+
+
+
+
+
+
+ Andrea Wynn, Ilia Sucholutsky, Thomas L. Griffiths
+
+
+ How can we build AI systems that are aligned with human values and objectives
+in order to avoid causing harm or violating societal standards for acceptable
+behavior? Making AI systems learn human-like representations of the world has
+many known benefits, including improving generalization, robustness to domain
+shifts, and few-shot learning performance, among others. We propose that this
+kind of representational alignment between machine learning (ML) models and
+humans is also a necessary condition for value alignment, where ML systems
+conform to human values and societal norms. We focus on ethics as one aspect of
+value alignment and train multiple ML agents (support vector regression and
+kernel regression) in a multi-armed bandit setting, where rewards are sampled
+from a distribution that reflects the morality of the chosen action. We then
+study the relationship between each agent's degree of representational
+alignment with humans and their performance when learning to take the most
+ethical actions.
+
+
+
+ comment: Paper accepted in Human-Centric Representation Learning workshop at
+ AAAI 2024 (https://hcrl-workshop.github.io/2024/)
+
+
+
+
+
+
+ ☆ RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation
+
+
+ Significant research effort has been devoted in recent years to developing
+personalized pricing, promotions, and product recommendation algorithms that
+can leverage rich customer data to learn and earn. Systematic benchmarking and
+evaluation of these causal learning systems remains a critical challenge, due
+to the lack of suitable datasets and simulation environments. In this work, we
+propose a multi-stage model for simulating customer shopping behavior that
+captures important sources of heterogeneity, including price sensitivity and
+past experiences. We embedded this model into a working simulation environment
+-- RetailSynth. RetailSynth was carefully calibrated on publicly available
+grocery data to create realistic synthetic shopping transactions. Multiple
+pricing policies were implemented within the simulator and analyzed for impact
+on revenue, category penetration, and customer retention. Applied researchers
+can use RetailSynth to validate causal demand models for multi-category retail
+and to incorporate realistic price sensitivity into emerging benchmarking
+suites for personalized pricing, promotions, and product recommendations.
+
+
+ Learning-based and data-driven techniques have recently become a subject of
+primary interest in the field of reconstruction and regularization of inverse
+problems. Besides the development of novel methods, yielding excellent results
+in several applications, their theoretical investigation has attracted growing
+interest, e.g., on the topics of reliability, stability, and interpretability.
+In this work, a general framework is described, allowing us to interpret many
+of these techniques in the context of statistical learning. This is not
+intended to provide a complete survey of existing methods, but rather to put
+them in a working perspective, which naturally allows their theoretical
+treatment. The main goal of this dissertation is thereby to address the
+generalization properties of learned reconstruction methods, and specifically
+to perform their sample error analysis. This task, well-developed in
+statistical learning, consists in estimating the dependence of the learned
+operators with respect to the data employed for their training. A rather
+general strategy is proposed, whose assumptions are met for a large class of
+inverse problems and learned methods, as depicted via a selection of examples.
+
+
+ Multi-relational clustering is a challenging task due to the fact that
+diverse semantic information conveyed in multi-layer graphs is difficult to
+extract and fuse. Recent methods integrate topology structure and node
+attribute information through graph filtering. However, they often use a
+low-pass filter without fully considering the correlation among multiple
+graphs. To overcome this drawback, we propose to learn a graph filter motivated
+by the theoretical analysis of Barlow Twins. We find that input with a negative
+semi-definite inner product provides a lower bound for Barlow Twins loss, which
+prevents it from reaching a better solution. We thus learn a filter that yields
+an upper bound for Barlow Twins. Afterward, we design a simple clustering
+architecture and demonstrate its state-of-the-art performance on four benchmark
+datasets.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Weighted least-squares approximation with determinantal point processes
+ and generalized volume sampling
+
+
+ We consider the problem of approximating a function from $L^2$ by an element
+of a given $m$-dimensional space $V_m$, associated with some feature map
+$\varphi$, using evaluations of the function at random points $x_1,\dots,x_n$.
+After recalling some results on optimal weighted least-squares using
+independent and identically distributed points, we consider weighted
+least-squares using projection determinantal point processes (DPP) or volume
+sampling. These distributions introduce dependence between the points that
+promotes diversity in the selected features $\varphi(x_i)$. We first provide a
+generalized version of volume-rescaled sampling yielding quasi-optimality
+results in expectation with a number of samples $n = O(m\log(m))$, that means
+that the expected $L^2$ error is bounded by a constant times the best
+approximation error in $L^2$. Also, further assuming that the function is in
+some normed vector space $H$ continuously embedded in $L^2$, we further prove
+that the approximation is almost surely bounded by the best approximation error
+measured in the $H$-norm. This includes the cases of functions from $L^\infty$
+or reproducing kernel Hilbert spaces. Finally, we present an alternative
+strategy consisting in using independent repetitions of projection DPP (or
+volume sampling), yielding similar error bounds as with i.i.d. or volume
+sampling, but in practice with a much lower number of samples. Numerical
+experiments illustrate the performance of the different strategies.
+
+
+
+
+
+
+
+ ☆ Machine learning and domain decomposition methods -- a survey
+
+
+
+
+
+
+
+
+ Axel Klawonn, Martin Lanser, Janine Weber
+
+
+ Hybrid algorithms, which combine black-box machine learning methods with
+experience from traditional numerical methods and domain expertise from diverse
+application areas, are progressively gaining importance in scientific machine
+learning and various industrial domains, especially in computational science
+and engineering. In the present survey, several promising avenues of research
+will be examined which focus on the combination of machine learning (ML) and
+domain decomposition methods (DDMs). The aim of this survey is to provide an
+overview of existing work within this field and to structure it into domain
+decomposition for machine learning and machine learning-enhanced domain
+decomposition, including: domain decomposition for classical machine learning,
+domain decomposition to accelerate the training of physics-aware neural
+networks, machine learning to enhance the convergence properties or
+computational efficiency of DDMs, and machine learning as a discretization
+method in a DDM for the solution of PDEs. In each of these fields, we summarize
+existing work and key advances within a common framework and, finally, disuss
+ongoing challenges and opportunities for future research.
+
+
+ In the dynamic landscape of online businesses, recommender systems are
+pivotal in enhancing user experiences. While traditional approaches have relied
+on static supervised learning, the quest for adaptive, user-centric
+recommendations has led to the emergence of the formulation of contextual
+bandits. This tutorial investigates the contextual bandits as a powerful
+framework for personalized recommendations. We delve into the challenges,
+advanced algorithms and theories, collaborative strategies, and open challenges
+and future prospects within this field. Different from existing related
+tutorials, (1) we focus on the exploration perspective of contextual bandits to
+alleviate the ``Matthew Effect'' in the recommender systems, i.e., the rich get
+richer and the poor get poorer, concerning the popularity of items; (2) in
+addition to the conventional linear contextual bandits, we will also dedicated
+to neural contextual bandits which have emerged as an important branch in
+recent years, to investigate how neural networks benefit contextual bandits for
+personalized recommendation both empirically and theoretically; (3) we will
+cover the latest topic, collaborative neural contextual bandits, to incorporate
+both user heterogeneity and user correlations customized for recommender
+system; (4) we will provide and discuss the new emerging challenges and open
+questions for neural contextual bandits with applications in the personalized
+recommendation, especially for large neural models.
+
+
+
+
+
+
+
+
+ Sebastian Bieringer, Gregor Kasieczka, Maximilian F. Steffen, Mathias Trabs
+
+
+ Uncertainty estimation is a key issue when considering the application of
+deep neural network methods in science and engineering. In this work, we
+introduce a novel algorithm that quantifies epistemic uncertainty via Monte
+Carlo sampling from a tempered posterior distribution. It combines the well
+established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based
+optimization using Adam and leverages a prolate proposal distribution, to
+efficiently draw from the posterior. We prove that the constructed chain admits
+the Gibbs posterior as an invariant distribution and converges to this Gibbs
+posterior in total variation distance. Numerical evaluations are postponed to a
+first revision.
+
+
+
+ comment: 12 pages
+
+
+
+
+
+
+ ☆ Leveraging Visual Supervision for Array-based Active Speaker Detection
+ and Localization
+
+
+
+
+
+
+
+
+ Davide Berghi, Philip J. B. Jackson
+
+
+ Conventional audio-visual approaches for active speaker detection (ASD)
+typically rely on visually pre-extracted face tracks and the corresponding
+single-channel audio to find the speaker in a video. Therefore, they tend to
+fail every time the face of the speaker is not visible. We demonstrate that a
+simple audio convolutional recurrent neural network (CRNN) trained with spatial
+input features extracted from multichannel audio can perform simultaneous
+horizontal active speaker detection and localization (ASDL), independently of
+the visual modality. To address the time and cost of generating ground truth
+labels to train such a system, we propose a new self-supervised training
+pipeline that embraces a ``student-teacher'' learning approach. A conventional
+pre-trained active speaker detector is adopted as a ``teacher'' network to
+provide the position of the speakers as pseudo-labels. The multichannel audio
+``student'' network is trained to generate the same results. At inference, the
+student network can generalize and locate also the occluded speakers that the
+teacher network is not able to detect visually, yielding considerable
+improvements in recall rate. Experiments on the TragicTalkers dataset show that
+an audio network trained with the proposed self-supervised learning approach
+can exceed the performance of the typical audio-visual methods and produce
+results competitive with the costly conventional supervised training. We
+demonstrate that improvements can be achieved when minimal manual supervision
+is introduced in the learning pipeline. Further gains may be sought with larger
+training sets and integrating vision with the multichannel audio system.
+
+
+ In the field of audio and speech analysis, the ability to identify emotions
+from acoustic signals is essential. Human-computer interaction (HCI) and
+behavioural analysis are only a few of the many areas where the capacity to
+distinguish emotions from speech signals has an extensive range of
+applications. Here, we are introducing BanSpEmo, a corpus of emotional speech
+that only consists of audio recordings and has been created specifically for
+the Bangla language. This corpus contains 792 audio recordings over a duration
+of more than 1 hour and 23 minutes. 22 native speakers took part in the
+recording of two sets of sentences that represent the six desired emotions. The
+data set consists of 12 Bangla sentences which are uttered in 6 emotions as
+Disgust, Happy, Sad, Surprised, Anger, and Fear. This corpus is not also gender
+balanced. Ten individuals who either have experience in related field or have
+acting experience took part in the assessment of this corpus. It has a balanced
+number of audio recordings in each emotion class. BanSpEmo can be considered as
+a useful resource to promote emotion and speech recognition research and
+related applications in the Bangla language. The dataset can be found here:
+https://data.mendeley.com/datasets/rdwn4bs5ky and might be employed for
+academic research.
+
+
+
+
+
+
+
+ ☆ Risk-Sensitive Stochastic Optimal Control as Rao-Blackwellized Markovian
+ Score Climbing
+
+
+
+
+
+
+
+
+ Hany Abdulsamad, Sahel Iqbal, Adrien Corenflos, Simo Särkkä
+
+
+ Stochastic optimal control of dynamical systems is a crucial challenge in
+sequential decision-making. Recently, control-as-inference approaches have had
+considerable success, providing a viable risk-sensitive framework to address
+the exploration-exploitation dilemma. Nonetheless, a majority of these
+techniques only invoke the inference-control duality to derive a modified risk
+objective that is then addressed within a reinforcement learning framework.
+This paper introduces a novel perspective by framing risk-sensitive stochastic
+control as Markovian score climbing under samples drawn from a conditional
+particle filter. Our approach, while purely inference-centric, provides
+asymptotically unbiased estimates for gradient-based policy optimization with
+optimal importance weighting and no explicit value function learning. To
+validate our methodology, we apply it to the task of learning neural
+non-Gaussian feedback policies, showcasing its efficacy on numerical benchmarks
+of stochastic dynamical systems.
+
+
+
+
+
+
+
+ ☆ Modular Neural Network Policies for Learning In-Flight Object Catching
+ with a Robot Hand-Arm System IROS 2023
+
+
+
+
+
+
+
+
+ Wenbin Hu, Fernando Acero, Eleftherios Triantafyllidis, Zhaocheng Liu, Zhibin Li
+
+
+ We present a modular framework designed to enable a robot hand-arm system to
+learn how to catch flying objects, a task that requires fast, reactive, and
+accurately-timed robot motions. Our framework consists of five core modules:
+(i) an object state estimator that learns object trajectory prediction, (ii) a
+catching pose quality network that learns to score and rank object poses for
+catching, (iii) a reaching control policy trained to move the robot hand to
+pre-catch poses, (iv) a grasping control policy trained to perform soft
+catching motions for safe and robust grasping, and (v) a gating network trained
+to synthesize the actions given by the reaching and grasping policy. The former
+two modules are trained via supervised learning and the latter three use deep
+reinforcement learning in a simulated environment. We conduct extensive
+evaluations of our framework in simulation for each module and the integrated
+system, to demonstrate high success rates of in-flight catching and robustness
+to perturbations and sensory noise. Whilst only simple cylindrical and
+spherical objects are used for training, the integrated system shows successful
+generalization to a variety of household objects that are not used in training.
+
+
+
+ comment: 8 pages. Accepted and presented at IEEE IROS 2023
+
+
+
+
+
+
+ ☆ Rényi Pufferfish Privacy: General Additive Noise Mechanisms and
+ Privacy Amplification by Iteration
+
+
+ Pufferfish privacy is a flexible generalization of differential privacy that
+allows to model arbitrary secrets and adversary's prior knowledge about the
+data. Unfortunately, designing general and tractable Pufferfish mechanisms that
+do not compromise utility is challenging. Furthermore, this framework does not
+provide the composition guarantees needed for a direct use in iterative machine
+learning algorithms. To mitigate these issues, we introduce a R\'enyi
+divergence-based variant of Pufferfish and show that it allows us to extend the
+applicability of the Pufferfish framework. We first generalize the Wasserstein
+mechanism to cover a wide range of noise distributions and introduce several
+ways to improve its utility. We also derive stronger guarantees against
+out-of-distribution adversaries. Finally, as an alternative to composition, we
+prove privacy amplification results for contractive noisy iterations and
+showcase the first use of Pufferfish in private convex optimization. A common
+ingredient underlying our results is the use and extension of shift reduction
+lemmas.
+
+
+
+
+
+
+
+ ☆ Metalearning with Very Few Samples Per Task
+
+
+
+
+
+
+
+
+ Maryam Aliakbarpour, Konstantina Bairaktari, Gavin Brown, Adam Smith, Jonathan Ullman
+
+
+ Metalearning and multitask learning are two frameworks for solving a group of
+related learning tasks more efficiently than we could hope to solve each of the
+individual tasks on their own. In multitask learning, we are given a fixed set
+of related learning tasks and need to output one accurate model per task,
+whereas in metalearning we are given tasks that are drawn i.i.d. from a
+metadistribution and need to output some common information that can be easily
+specialized to new, previously unseen tasks from the metadistribution.
+ In this work, we consider a binary classification setting where tasks are
+related by a shared representation, that is, every task $P$ of interest can be
+solved by a classifier of the form $f_{P} \circ h$ where $h \in H$ is a map
+from features to some representation space that is shared across tasks, and
+$f_{P} \in F$ is a task-specific classifier from the representation space to
+labels. The main question we ask in this work is how much data do we need to
+metalearn a good representation? Here, the amount of data is measured in terms
+of both the number of tasks $t$ that we need to see and the number of samples
+$n$ per task. We focus on the regime where the number of samples per task is
+extremely small. Our main result shows that, in a distribution-free setting
+where the feature vectors are in $\mathbb{R}^d$, the representation is a linear
+map from $\mathbb{R}^d \to \mathbb{R}^k$, and the task-specific classifiers are
+halfspaces in $\mathbb{R}^k$, we can metalearn a representation with error
+$\varepsilon$ using just $n = k+2$ samples per task, and $d \cdot
+(1/\varepsilon)^{O(k)}$ tasks. Learning with so few samples per task is
+remarkable because metalearning would be impossible with $k+1$ samples per
+task, and because we cannot even hope to learn an accurate task-specific
+classifier with just $k+2$ samples per task.
+
+
+
+
+
+
+
+ ☆ On Partial Optimal Transport: Revising the Infeasibility of Sinkhorn and
+ Efficient Gradient Methods AAAI 2024
+
+
+ This paper studies the Partial Optimal Transport (POT) problem between two
+unbalanced measures with at most $n$ supports and its applications in various
+AI tasks such as color transfer or domain adaptation. There is hence the need
+for fast approximations of POT with increasingly large problem sizes in arising
+applications. We first theoretically and experimentally investigate the
+infeasibility of the state-of-the-art Sinkhorn algorithm for POT due to its
+incompatible rounding procedure, which consequently degrades its qualitative
+performance in real world applications like point-cloud registration. To this
+end, we propose a novel rounding algorithm for POT, and then provide a feasible
+Sinkhorn procedure with a revised computation complexity of
+$\mathcal{\widetilde O}(n^2/\varepsilon^4)$. Our rounding algorithm also
+permits the development of two first-order methods to approximate the POT
+problem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent
+(APDAGD), finds an $\varepsilon$-approximate solution to the POT problem in
+$\mathcal{\widetilde O}(n^{2.5}/\varepsilon)$, which is better in $\varepsilon$
+than revised Sinkhorn. The second method, Dual Extrapolation, achieves the
+computation complexity of $\mathcal{\widetilde O}(n^2/\varepsilon)$, thereby
+being the best in the literature. We further demonstrate the flexibility of POT
+compared to standard OT as well as the practicality of our algorithms on real
+applications where two marginal distributions are unbalanced.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ PhysRFANet: Physics-Guided Neural Network for Real-Time Prediction of
+ Thermal Effect During Radiofrequency Ablation Treatment
+
+
+
+
+
+
+
+
+ Minwoo Shin, Minjee Seo, Seonaeng Cho, Juil Park, Joon Ho Kwon, Deukhee Lee, Kyungho Yoon
+
+
+ Radiofrequency ablation (RFA) is a widely used minimally invasive technique
+for ablating solid tumors. Achieving precise personalized treatment
+necessitates feedback information on in situ thermal effects induced by the RFA
+procedure. While computer simulation facilitates the prediction of electrical
+and thermal phenomena associated with RFA, its practical implementation in
+clinical settings is hindered by high computational demands. In this paper, we
+propose a physics-guided neural network model, named PhysRFANet, to enable
+real-time prediction of thermal effect during RFA treatment. The networks,
+designed for predicting temperature distribution and the corresponding ablation
+lesion, were trained using biophysical computational models that integrated
+electrostatics, bio-heat transfer, and cell necrosis, alongside magnetic
+resonance (MR) images of breast cancer patients. Validation of the
+computational model was performed through experiments on ex vivo bovine liver
+tissue. Our model demonstrated a 96% Dice score in predicting the lesion volume
+and an RMSE of 0.4854 for temperature distribution when tested with foreseen
+tumor images. Notably, even with unforeseen images, it achieved a 93% Dice
+score for the ablation lesion and an RMSE of 0.6783 for temperature
+distribution. All networks were capable of inferring results within 10 ms. The
+presented technique, applied to optimize the placement of the electrode for a
+specific target region, holds significant promise in enhancing the safety and
+efficacy of RFA treatments.
+
+
+ This paper presents a new supervised representation learning framework,
+namely Structured Probabilistic Coding (SPC), to learn compact and informative
+representations from input related to the target task. SPC is an encoder-only
+probabilistic coding technology with a structured regularization from the
+target label space. By extracting compact and informative representations from
+input related to the target task, SPC can enhance the generalization ability of
+pre-trained language models for better language understanding. Specifically,
+the hidden representation is encoded into a Gaussian distribution space, while
+maximizing the prior entropy of latent representations concerning label space.
+This technique can simultaneously perform information encoding and task
+prediction in one module to more fully utilize the effective information from
+input data, and use variational inference in the output space to reduce
+randomness and uncertainty. To better control the probability distribution in
+the latent space, a structured regularization is proposed to promote
+class-level uniformity in the latent space. With the regularization term, SPC
+can preserve the Gaussian distribution structure of latent code as well as
+better cover the hidden space with class uniformly. We conduct evaluations on
+12 natural language understanding tasks. The results show that our SPC can
+effectively improve the performance of pre-trained language models for various
+classification and regression tasks. Experiments demonstrate that SPC can
+enhance the generalization capability, robustness to label noise, and
+clustering quality of output representations.
+
+
+
+ comment: 11 pages, accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Joint Sensing and Task-Oriented Communications with Image and Wireless
+ Data Modalities for Dynamic Spectrum Access
+
+
+ This paper introduces a deep learning approach to dynamic spectrum access,
+leveraging the synergy of multi-modal image and spectrum data for the
+identification of potential transmitters. We consider an edge device equipped
+with a camera that is taking images of potential objects such as vehicles that
+may harbor transmitters. Recognizing the computational constraints and trust
+issues associated with on-device computation, we propose a collaborative system
+wherein the edge device communicates selectively processed information to a
+trusted receiver acting as a fusion center, where a decision is made to
+identify whether a potential transmitter is present, or not. To achieve this,
+we employ task-oriented communications, utilizing an encoder at the transmitter
+for joint source coding, channel coding, and modulation. This architecture
+efficiently transmits essential information of reduced dimension for object
+classification. Simultaneously, the transmitted signals may reflect off objects
+and return to the transmitter, allowing for the collection of target sensing
+data. Then the collected sensing data undergoes a second round of encoding at
+the transmitter, with the reduced-dimensional information communicated back to
+the fusion center through task-oriented communications. On the receiver side, a
+decoder performs the task of identifying a transmitter by fusing data received
+through joint sensing and task-oriented communications. The two encoders at the
+transmitter and the decoder at the receiver are jointly trained, enabling a
+seamless integration of image classification and wireless signal detection.
+Using AWGN and Rayleigh channel models, we demonstrate the effectiveness of the
+proposed approach, showcasing high accuracy in transmitter identification
+across diverse channel conditions while sustaining low latency in decision
+making.
+
+
+
+
+
+
+
+ ☆ On the convergence of loss and uncertainty-based active learning
+ algorithms
+
+
+
+
+
+
+
+
+ Daniel Haimovich, Dima Karamshuk, Fridolin Linder, Niek Tax, Milan Vojnovic
+
+
+ We study convergence rates of loss and uncertainty-based active learning
+algorithms under various assumptions. First, we provide a set of conditions
+under which a convergence rate guarantee holds, and use this for linear
+classifiers and linearly separable datasets to show convergence rate guarantees
+for loss-based sampling and different loss functions. Second, we provide a
+framework that allows us to derive convergence rate bounds for loss-based
+sampling by deploying known convergence rate bounds for stochastic gradient
+descent algorithms. Third, and last, we propose an active learning algorithm
+that combines sampling of points and stochastic Polyak's step size. We show a
+condition on the sampling that ensures a convergence rate guarantee for this
+algorithm for smooth convex loss functions. Our numerical results demonstrate
+efficiency of our proposed algorithm.
+
+
+
+
+
+
+
+ ☆ Fed-CO$_{2}$: Cooperation of Online and Offline Models for Severe Data
+ Heterogeneity in Federated Learning NeurIPS 2023
+
+
+
+
+
+
+
+
+ Zhongyi Cai, Ye Shi, Wei Huang, Jingya Wang
+
+
+ Federated Learning (FL) has emerged as a promising distributed learning
+paradigm that enables multiple clients to learn a global model collaboratively
+without sharing their private data. However, the effectiveness of FL is highly
+dependent on the quality of the data that is being used for training. In
+particular, data heterogeneity issues, such as label distribution skew and
+feature skew, can significantly impact the performance of FL. Previous studies
+in FL have primarily focused on addressing label distribution skew data
+heterogeneity, while only a few recent works have made initial progress in
+tackling feature skew issues. Notably, these two forms of data heterogeneity
+have been studied separately and have not been well explored within a unified
+FL framework. To address this gap, we propose Fed-CO$_{2}$, a universal FL
+framework that handles both label distribution skew and feature skew within a
+\textbf{C}ooperation mechanism between the \textbf{O}nline and \textbf{O}ffline
+models. Specifically, the online model learns general knowledge that is shared
+among all clients, while the offline model is trained locally to learn the
+specialized knowledge of each individual client. To further enhance model
+cooperation in the presence of feature shifts, we design an intra-client
+knowledge transfer mechanism that reinforces mutual learning between the online
+and offline models, and an inter-client knowledge transfer mechanism to
+increase the models' domain generalization ability. Extensive experiments show
+that our Fed-CO$_{2}$ outperforms a wide range of existing personalized
+federated learning algorithms in terms of handling label distribution skew and
+feature skew, both individually and collectively. The empirical results are
+supported by our convergence analyses in a simplified setting.
+
+
+
+ comment: Accepted by NeurIPS 2023
+
+
+
+
+
+
+ ☆ Multi-Agent Probabilistic Ensembles with Trajectory Sampling for
+ Connected Autonomous Vehicles
+
+
+ Autonomous Vehicles (AVs) have attracted significant attention in recent
+years and Reinforcement Learning (RL) has shown remarkable performance in
+improving the autonomy of vehicles. In that regard, the widely adopted
+Model-Free RL (MFRL) promises to solve decision-making tasks in connected AVs
+(CAVs), contingent on the readiness of a significant amount of data samples for
+training. Nevertheless, it might be infeasible in practice and possibly lead to
+learning instability. In contrast, Model-Based RL (MBRL) manifests itself in
+sample-efficient learning, but the asymptotic performance of MBRL might lag
+behind the state-of-the-art MFRL algorithms. Furthermore, most studies for CAVs
+are limited to the decision-making of a single AV only, thus underscoring the
+performance due to the absence of communications. In this study, we try to
+address the decision-making problem of multiple CAVs with limited
+communications and propose a decentralized Multi-Agent Probabilistic Ensembles
+with Trajectory Sampling algorithm MA-PETS. In particular, in order to better
+capture the uncertainty of the unknown environment, MA-PETS leverages
+Probabilistic Ensemble (PE) neural networks to learn from communicated samples
+among neighboring CAVs. Afterwards, MA-PETS capably develops Trajectory
+Sampling (TS)-based model-predictive control for decision-making. On this
+basis, we derive the multi-agent group regret bound affected by the number of
+agents within the communication range and mathematically validate that
+incorporating effective information exchange among agents into the multi-agent
+learning scheme contributes to reducing the group regret bound in the worst
+case. Finally, we empirically demonstrate the superiority of MA-PETS in terms
+of the sample efficiency comparable to MFBL.
+
+
+
+
+
+
+
+ ☆ EfficientPPS: Part-aware Panoptic Segmentation of Transparent Objects
+ for Robotic Manipulation
+
+
+
+
+
+
+
+
+ Benjamin Alt, Minh Dang Nguyen, Andreas Hermann, Darko Katic, Rainer Jäkel, Rüdiger Dillmann, Eric Sax
+
+
+ The use of autonomous robots for assistance tasks in hospitals has the
+potential to free up qualified staff and im-prove patient care. However, the
+ubiquity of deformable and transparent objects in hospital settings poses
+signif-icant challenges to vision-based perception systems. We present
+EfficientPPS, a neural architecture for part-aware panoptic segmentation that
+provides robots with semantically rich visual information for grasping and
+ma-nipulation tasks. We also present an unsupervised data collection and
+labelling method to reduce the need for human involvement in the training
+process. EfficientPPS is evaluated on a dataset containing real-world hospital
+objects and demonstrated to be robust and efficient in grasping transparent
+transfusion bags with a collaborative robot arm.
+
+
+
+ comment: 8 pages, 8 figures, presented at the 56th International Symposium on
+ Robotics (ISR Europe)
+
+
+
+
+
+
+ ☆ Domain-Specific Fine-Tuning of Large Language Models for Interactive
+ Robot Programming
+
+
+
+
+
+
+
+
+ Benjamin Alt, Urs Keßner, Aleksandar Taranovic, Darko Katic, Andreas Hermann, Rainer Jäkel, Gerhard Neumann
+
+
+ Industrial robots are applied in a widening range of industries, but robot
+programming mostly remains a task limited to programming experts. We propose a
+natural language-based assistant for programming of advanced, industrial
+robotic applications and investigate strategies for domain-specific fine-tuning
+of foundation models with limited data and compute.
+
+
+
+ comment: 5 pages, 1 figure, accepted to the 2024 European Robotics Forum
+
+
+
+
+
+
+ ☆ Comparative Evaluation of Anomaly Detection Methods for Fraud Detection
+ in Online Credit Card Payments
+
+
+
+
+
+
+
+
+ Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan, Fabrice Daniel
+
+
+ This study explores the application of anomaly detection (AD) methods in
+imbalanced learning tasks, focusing on fraud detection using real online credit
+card payment data. We assess the performance of several recent AD methods and
+compare their effectiveness against standard supervised learning methods.
+Offering evidence of distribution shift within our dataset, we analyze its
+impact on the tested models' performances. Our findings reveal that LightGBM
+exhibits significantly superior performance across all evaluated metrics but
+suffers more from distribution shifts than AD methods. Furthermore, our
+investigation reveals that LightGBM also captures the majority of frauds
+detected by AD methods. This observation challenges the potential benefits of
+ensemble methods to combine supervised, and AD approaches to enhance
+performance. In summary, this research provides practical insights into the
+utility of these techniques in real-world scenarios, showing LightGBM's
+superiority in fraud detection while highlighting challenges related to
+distribution shifts.
+
+
+
+ comment: Accepted at ICICT 2024
+
+
+
+
+
+
+ ☆ Capture the Flag: Uncovering Data Insights with Large Language Models NeurIPS 2023
+
+
+
+
+
+
+
+
+ Issam Laradji, Perouz Taslakian, Sai Rajeswar, Valentina Zantedeschi, Alexandre Lacoste, Nicolas Chapados, David Vazquez, Christopher Pal, Alexandre Drouin
+
+
+ The extraction of a small number of relevant insights from vast amounts of
+data is a crucial component of data-driven decision-making. However,
+accomplishing this task requires considerable technical skills, domain
+expertise, and human labor. This study explores the potential of using Large
+Language Models (LLMs) to automate the discovery of insights in data,
+leveraging recent advances in reasoning and code generation techniques. We
+propose a new evaluation methodology based on a "capture the flag" principle,
+measuring the ability of such models to recognize meaningful and pertinent
+information (flags) in a dataset. We further propose two proof-of-concept
+agents, with different inner workings, and compare their ability to capture
+such flags in a real-world sales dataset. While the work reported here is
+preliminary, our results are sufficiently interesting to mandate future
+exploration by the community.
+
+
+
+ comment: 14 pages, 1 figure, Foundation Models for Decision Making Workshop at
+ NeurIPS 2023
+
+
+
+
+
+
+ ☆ Best Arm Identification in Batched Multi-armed Bandit Problems
+
+
+ Recently multi-armed bandit problem arises in many real-life scenarios where
+arms must be sampled in batches, due to limited time the agent can wait for the
+feedback. Such applications include biological experimentation and online
+marketing. The problem is further complicated when the number of arms is large
+and the number of batches is small. We consider pure exploration in a batched
+multi-armed bandit problem. We introduce a general linear programming framework
+that can incorporate objectives of different theoretical settings in best arm
+identification. The linear program leads to a two-stage algorithm that can
+achieve good theoretical properties. We demonstrate by numerical studies that
+the algorithm also has good performance compared to certain UCB-type or
+Thompson sampling methods.
+
+
+
+
+
+
+
+
+ Arthur France-Lanord, Hadrien Vroylandt, Mathieu Salanne, Benjamin Rotenberg, A. Marco Saitta, Fabio Pietrucci
+
+
+ Identifying optimal collective variables to model transformations, using
+atomic-scale simulations, is a long-standing challenge. We propose a new method
+for the generation, optimization, and comparison of collective variables, which
+can be thought of as a data-driven generalization of the path collective
+variable concept. It consists in a kernel ridge regression of the committor
+probability, which encodes a transformation's progress. The resulting
+collective variable is one-dimensional, interpretable, and differentiable,
+making it appropriate for enhanced sampling simulations requiring biasing. We
+demonstrate the validity of the method on two different applications: a
+precipitation model, and the association of Li$^+$ and F$^-$ in water. For the
+former, we show that global descriptors such as the permutation invariant
+vector allow to reach an accuracy far from the one achieved \textit{via}
+simpler, more intuitive variables. For the latter, we show that information
+correlated with the transformation mechanism is contained in the first
+solvation shell only, and that inertial effects prevent the derivation of
+optimal collective variables from the atomic positions only.
+
+
+ Autonomous vehicles ought to predict the surrounding agents' trajectories to
+allow safe maneuvers in uncertain and complex traffic situations. As companies
+increasingly apply trajectory prediction in the real world, security becomes a
+relevant concern. In this paper, we focus on backdoors - a security threat
+acknowledged in other fields but so far overlooked for trajectory prediction.
+To this end, we describe and investigate four triggers that could affect
+trajectory prediction. We then show that these triggers (for example, a braking
+vehicle), when correlated with a desired output (for example, a curve) during
+training, cause the desired output of a state-of-the-art trajectory prediction
+model. In other words, the model has good benign performance but is vulnerable
+to backdoors. This is the case even if the trigger maneuver is performed by a
+non-casual agent behind the target vehicle. As a side-effect, our analysis
+reveals interesting limitations within trajectory prediction models. Finally,
+we evaluate a range of defenses against backdoors. While some, like simple
+offroad checks, do not enable detection for all triggers, clustering is a
+promising candidate to support manual inspection to find backdoors.
+
+
+
+ comment: 9 pages, 7 figures
+
+
+
+
+
+
+ ☆ Statistical learning theory and Occam's razor: The argument from
+ empirical risk minimization
+
+
+ This paper considers the epistemic justification for a simplicity preference
+in inductive inference that may be obtained from the machine learning framework
+of statistical learning theory. Uniting elements from both earlier arguments
+suggesting and rejecting such a justification, the paper spells out a qualified
+means-ends and model-relative justificatory argument, built on statistical
+learning theory's central mathematical learning guarantee for the method of
+empirical risk minimization.
+
+
+
+
+
+
+
+
+ Thomas Norrenbrock, Marco Rudolph, Bodo Rosenhahn
+
+
+ Explanations in Computer Vision are often desired, but most Deep Neural
+Networks can only provide saliency maps with questionable faithfulness.
+Self-Explaining Neural Networks (SENN) extract interpretable concepts with
+fidelity, diversity, and grounding to combine them linearly for
+decision-making. While they can explain what was recognized, initial
+realizations lack accuracy and general applicability. We propose the
+Quantized-Self-Explaining Neural Network Q-SENN. Q-SENN satisfies or exceeds
+the desiderata of SENN while being applicable to more complex datasets and
+maintaining most or all of the accuracy of an uninterpretable baseline model,
+out-performing previous work in all considered metrics. Q-SENN describes the
+relationship between every class and feature as either positive, negative or
+neutral instead of an arbitrary number of possible relations, enforcing more
+binary human-friendly features. Since every class is assigned just 5
+interpretable features on average, Q-SENN shows convincing local and global
+interpretability. Additionally, we propose a feature alignment method, capable
+of aligning learned features with human language-based concepts without
+additional supervision. Thus, what is learned can be more easily verbalized.
+The code is published: https://github.com/ThomasNorr/Q-SENN
+
+
+
+ comment: Accepted to AAAI 2024, SRRAI
+
+
+
+
+
+
+ ☆ Optimized classification with neural ODEs via separability
+
+
+
+
+
+
+
+
+ Antonio Álvarez-López, Rafael Orive-Illera, Enrique Zuazua
+
+
+ Classification of $N$ points becomes a simultaneous control problem when
+viewed through the lens of neural ordinary differential equations (neural
+ODEs), which represent the time-continuous limit of residual networks. For the
+narrow model, with one neuron per hidden layer, it has been shown that the task
+can be achieved using $O(N)$ neurons. In this study, we focus on estimating the
+number of neurons required for efficient cluster-based classification,
+particularly in the worst-case scenario where points are independently and
+uniformly distributed in $[0,1]^d$. Our analysis provides a novel method for
+quantifying the probability of requiring fewer than $O(N)$ neurons, emphasizing
+the asymptotic behavior as both $d$ and $N$ increase. Additionally, under the
+sole assumption that the data are in general position, we propose a new
+constructive algorithm that simultaneously classifies clusters of $d$ points
+from any initial configuration, effectively reducing the maximal complexity to
+$O(N/d)$ neurons.
+
+
+
+ comment: 26 pages, 10 figures
+
+
+
+
+
+
+ ☆ Sparse Training for Federated Learning with Regularized Error Correction
+
+
+ Federated Learning (FL) has attracted much interest due to the significant
+advantages it brings to training deep neural network (DNN) models. However,
+since communications and computation resources are limited, training DNN models
+in FL systems face challenges such as elevated computational and communication
+costs in complex tasks. Sparse training schemes gain increasing attention in
+order to scale down the dimensionality of each client (i.e., node)
+transmission. Specifically, sparsification with error correction methods is a
+promising technique, where only important updates are sent to the parameter
+server (PS) and the rest are accumulated locally. While error correction
+methods have shown to achieve a significant sparsification level of the
+client-to-PS message without harming convergence, pushing sparsity further
+remains unresolved due to the staleness effect. In this paper, we propose a
+novel algorithm, dubbed Federated Learning with Accumulated Regularized
+Embeddings (FLARE), to overcome this challenge. FLARE presents a novel sparse
+training approach via accumulated pulling of the updated models with
+regularization on the embeddings in the FL process, providing a powerful
+solution to the staleness effect, and pushing sparsity to an exceptional level.
+The performance of FLARE is validated through extensive experiments on diverse
+and complex models, achieving a remarkable sparsity level (10 times and more
+beyond the current state-of-the-art) along with significantly improved
+accuracy. Additionally, an open-source software package has been developed for
+the benefit of researchers and developers in related fields.
+
+
+
+
+
+
+
+ ☆ Few Shot Part Segmentation Reveals Compositional Logic for Industrial
+ Anomaly Detection AAAI2024
+
+
+
+
+
+
+
+
+ Soopil Kim, Sion An, Philip Chikontwe, Myeongkyun Kang, Ehsan Adeli, Kilian M. Pohl, Sanghyun Park
+
+
+ Logical anomalies (LA) refer to data violating underlying logical constraints
+e.g., the quantity, arrangement, or composition of components within an image.
+Detecting accurately such anomalies requires models to reason about various
+component types through segmentation. However, curation of pixel-level
+annotations for semantic segmentation is both time-consuming and expensive.
+Although there are some prior few-shot or unsupervised co-part segmentation
+algorithms, they often fail on images with industrial object. These images have
+components with similar textures and shapes, and a precise differentiation
+proves challenging. In this study, we introduce a novel component segmentation
+model for LA detection that leverages a few labeled samples and unlabeled
+images sharing logical constraints. To ensure consistent segmentation across
+unlabeled images, we employ a histogram matching loss in conjunction with an
+entropy loss. As segmentation predictions play a crucial role, we propose to
+enhance both local and global sample validity detection by capturing key
+aspects from visual semantics via three memory banks: class histograms,
+component composition embeddings and patch-level representations. For effective
+LA detection, we propose an adaptive scaling strategy to standardize anomaly
+scores from different memory banks in inference. Extensive experiments on the
+public benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA
+detection vs. 89.6% from competing methods.
+
+
+
+ comment: Accepted at AAAI2024
+
+
+
+
+
+
+ ☆ On Task Performance and Model Calibration with Supervised and
+ Self-Ensembled In-Context Learning
+
+
+
+
+
+
+
+
+ Chengzu Li, Han Zhou, Goran Glavaš, Anna Korhonen, Ivan Vulić
+
+
+ Following the standard supervised fine-tuning (SFT) paradigm, in-context
+learning (ICL) has become an efficient approach propelled by the recent
+advancements in large language models (LLMs), yielding promising performance
+across various tasks in few-shot data setups. However, both paradigms are prone
+to suffer from the critical problem of overconfidence (i.e., miscalibration),
+especially in such limited data setups. In this work, we deliver an in-depth
+analysis of the behavior across different choices of learning methods from the
+perspective of both performance and calibration, as well as their interplay.
+Through extensive controlled experiments, we find that simultaneous gains for
+both task performance and calibration are difficult to achieve, and the problem
+of miscalibration exists across all learning methods in low-resource
+scenarios.To address this challenging trade-off between performance and
+calibration, we then investigate the potential of self-ensembling techniques
+applied at different modeling stages (e.g., variations of in-context examples
+or variations in prompts or different ensembling strategies). We justify the
+feasibility of self-ensembling on SFT in addition to ICL, to make the
+predictions more calibrated and have comparable or even better performance. Our
+work sheds light on which learning paradigm to choose and how to enhance both
+task performance and calibration of LLMs.
+
+
+
+ comment: 9 pages, 4 figures, 5 tables (20 pages, 5 figures, 13 tables
+ including references and appendices)
+
+
+
+
+
+
+ ☆ A Semantic Space is Worth 256 Language Descriptions: Make Stronger
+ Segmentation Models with Descriptive Properties
+
+
+ This paper introduces ProLab, a novel approach using property-level label
+space for creating strong interpretable segmentation models. Instead of relying
+solely on category-specific annotations, ProLab uses descriptive properties
+grounded in common sense knowledge for supervising segmentation models. It is
+based on two core designs. First, we employ Large Language Models (LLMs) and
+carefully crafted prompts to generate descriptions of all involved categories
+that carry meaningful common sense knowledge and follow a structured format.
+Second, we introduce a description embedding model preserving semantic
+correlation across descriptions and then cluster them into a set of descriptive
+properties (e.g., 256) using K-Means. These properties are based on
+interpretable common sense knowledge consistent with theories of human
+recognition. We empirically show that our approach makes segmentation models
+perform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal
+Context, Cityscapes, and BDD). Our method also shows better scalability with
+extended training steps than category-level supervision. Our interpretable
+segmentation framework also emerges with the generalization ability to segment
+out-of-domain or unknown categories using only in-domain descriptive
+properties. Code is available at https://github.com/lambert-x/ProLab.
+
+
+
+ comment: Preprint. Code is available at https://github.com/lambert-x/ProLab
+
+
+
+
+
+
+ ☆ Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed
+ Diffusion Models
+
+
+
+
+
+
+
+
+ Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, Karsten Kreis
+
+
+ Text-guided diffusion models have revolutionized image and video generation
+and have also been successfully used for optimization-based 3D object
+synthesis. Here, we instead focus on the underexplored text-to-4D setting and
+synthesize dynamic, animated 3D objects using score distillation methods with
+an additional temporal dimension. Compared to previous work, we pursue a novel
+compositional generation-based approach, and combine text-to-image,
+text-to-video, and 3D-aware multiview diffusion models to provide feedback
+during 4D object optimization, thereby simultaneously enforcing temporal
+consistency, high-quality visual appearance and realistic geometry. Our method,
+called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with
+deformation fields as 4D representation. Crucial to AYG is a novel method to
+regularize the distribution of the moving 3D Gaussians and thereby stabilize
+the optimization and induce motion. We also propose a motion amplification
+mechanism as well as a new autoregressive synthesis scheme to generate and
+combine multiple 4D sequences for longer generation. These techniques allow us
+to synthesize vivid dynamic scenes, outperform previous work qualitatively and
+quantitatively and achieve state-of-the-art text-to-4D performance. Due to the
+Gaussian 4D representation, different 4D animations can be seamlessly combined,
+as we demonstrate. AYG opens up promising avenues for animation, simulation and
+digital content creation as well as synthetic data generation.
+
+
+ Fault-tolerant deep learning accelerator is the basis for highly reliable
+deep learning processing and critical to deploy deep learning in
+safety-critical applications such as avionics and robotics. Since deep learning
+is known to be computing- and memory-intensive, traditional fault-tolerant
+approaches based on redundant computing will incur substantial overhead
+including power consumption and chip area. To this end, we propose to
+characterize deep learning vulnerability difference across both neurons and
+bits of each neuron, and leverage the vulnerability difference to enable
+selective protection of the deep learning processing components from the
+perspective of architecture layer and circuit layer respectively. At the same
+time, we observe the correlation between model quantization and bit protection
+overhead of the underlying processing elements of deep learning accelerators,
+and propose to reduce the bit protection overhead by adding additional
+quantization constrain without compromising the model accuracy. Finally, we
+employ Bayesian optimization strategy to co-optimize the correlated cross-layer
+design parameters at algorithm layer, architecture layer, and circuit layer to
+minimize the hardware resource consumption while fulfilling multiple user
+constraints including reliability, accuracy, and performance of the deep
+learning processing at the same time.
+
+
+
+ comment: 16 pages, it has been presented at CCF-DAC 2023 while CCF-DAC does
+ not own the copyright
+
+ Recent advancements in offline reinforcement learning (RL) have underscored
+the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm
+that learns the action distribution based on target returns for each state in a
+supervised manner. However, prevailing RCSL methods largely focus on
+deterministic trajectory modeling, disregarding stochastic state transitions
+and the diversity of future trajectory distributions. A fundamental challenge
+arises from the inconsistency between the sampled returns within individual
+trajectories and the expected returns across multiple trajectories.
+Fortunately, value-based methods offer a solution by leveraging a value
+function to approximate the expected returns, thereby addressing the
+inconsistency effectively. Building upon these insights, we propose a novel
+approach, termed the Critic-Guided Decision Transformer (CGDT), which combines
+the predictability of long-term returns from value-based methods with the
+trajectory modeling capability of the Decision Transformer. By incorporating a
+learned value function, known as the critic, CGDT ensures a direct alignment
+between the specified target returns and the expected returns of actions. This
+integration bridges the gap between the deterministic nature of RCSL and the
+probabilistic characteristics of value-based methods. Empirical evaluations on
+stochastic environments and D4RL benchmark datasets demonstrate the superiority
+of CGDT over traditional RCSL methods. These results highlight the potential of
+CGDT to advance the state of the art in offline RL and extend the applicability
+of RCSL to a wide range of RL tasks.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ☆ A Learning oriented DLP System based on Classification Model
+
+
+ Data is the key asset for organizations and data sharing is lifeline for
+organization growth; which may lead to data loss. Data leakage is the most
+critical issue being faced by organizations. In order to mitigate the data
+leakage issues data leakage prevention systems (DLPSs) are deployed at various
+levels by the organizations. DLPSs are capable to protect all kind of data i.e.
+DAR, DIM/DIT, DIU. Statistical analysis, regular expression, data
+fingerprinting are common approaches exercised in DLP system. Out of these
+techniques; statistical analysis approach is most appropriate for proposed DLP
+model of data security. This paper defines a statistical DLP model for document
+classification. Model uses various statistical approaches like TF-IDF (Term
+Frequency- Inverse Document Frequency) a renowned term count/weighing function,
+Vectorization, Gradient boosting document classification etc. to classify the
+documents before allowing any access to it. Machine learning is used to test
+and train the model. Proposed model also introduces an extremely efficient and
+more accurate approach; IGBCA (Improvised Gradient Boosting Classification
+Algorithm); for document classification, to prevent them from possible data
+leakage. Results depicts that proposed model can classify documents with high
+accuracy and on basis of which data can be prevented from being loss.
+
+
+
+
+
+
+
+ ☆ A Forecasting-Based DLP Approach for Data Security
+
+
+ Sensitive data leakage is the major growing problem being faced by
+enterprises in this technical era. Data leakage causes severe threats for
+organization of data safety which badly affects the reputation of
+organizations. Data leakage is the flow of sensitive data/information from any
+data holder to an unauthorized destination. Data leak prevention (DLP) is set
+of techniques that try to alleviate the threats which may hinder data security.
+DLP unveils guilty user responsible for data leakage and ensures that user
+without appropriate permission cannot access sensitive data and also provides
+protection to sensitive data if sensitive data is shared accidentally. In this
+paper, data leakage prevention (DLP) model is used to restrict/grant data
+access permission to user, based on the forecast of their access to data. This
+study provides a DLP solution using data statistical analysis to forecast the
+data access possibilities of any user in future based on the access to data in
+the past. The proposed approach makes use of renowned simple piecewise linear
+function for learning/training to model. The results show that the proposed DLP
+approach with high level of precision can correctly classify between users even
+in cases of extreme data access.
+
+
+
+
+
+
+
+
+ Kamil Deja, Bartosz Cywiński, Jan Rybarczyk, Tomasz Trzciński
+
+
+ In this work, we introduce Adapt & Align, a method for continual learning of
+neural networks by aligning latent representations in generative models. Neural
+Networks suffer from abrupt loss in performance when retrained with additional
+training data from different distributions. At the same time, training with
+additional data without access to the previous examples rarely improves the
+model's performance. In this work, we propose a new method that mitigates those
+problems by employing generative models and splitting the process of their
+update into two parts. In the first one, we train a local generative model
+using only data from a new task. In the second phase, we consolidate latent
+representations from the local model with a global one that encodes knowledge
+of all past experiences. We introduce our approach with Variational
+Auteoncoders and Generative Adversarial Networks. Moreover, we show how we can
+use those generative models as a general method for continual knowledge
+consolidation that can be used in downstream tasks such as classification.
+
+
+
+
+
+
+
+ ☆ Parallel Trust-Region Approaches in Neural Network Training: Beyond
+ Traditional Methods
+
+
+
+
+
+
+
+
+ Ken Trotti, Samuel A. Cruz Alegría, Alena Kopaničáková, Rolf Krause
+
+
+ We propose to train neural networks (NNs) using a novel variant of the
+``Additively Preconditioned Trust-region Strategy'' (APTS). The proposed method
+is based on a parallelizable additive domain decomposition approach applied to
+the neural network's parameters. Built upon the TR framework, the APTS method
+ensures global convergence towards a minimizer. Moreover, it eliminates the
+need for computationally expensive hyper-parameter tuning, as the TR algorithm
+automatically determines the step size in each iteration. We demonstrate the
+capabilities, strengths, and limitations of the proposed APTS training method
+by performing a series of numerical experiments. The presented numerical study
+includes a comparison with widely used training methods such as SGD, Adam,
+LBFGS, and the standard TR method.
+
+
+
+
+
+
+
+ ☆ Text2Analysis: A Benchmark of Table Question Answering with Advanced
+ Data Analysis and Unclear Queries AAAI'2024
+
+
+
+
+
+
+
+
+ Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen, Shi Han, Zejian Yuan, Dongmei Zhang
+
+
+ Tabular data analysis is crucial in various fields, and large language models
+show promise in this area. However, current research mostly focuses on
+rudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like
+forecasting and chart generation. To address this gap, we developed the
+Text2Analysis benchmark, incorporating advanced analysis tasks that go beyond
+the SQL-compatible operations and require more in-depth analysis. We also
+develop five innovative and effective annotation methods, harnessing the
+capabilities of large language models to enhance data quality and quantity.
+Additionally, we include unclear queries that resemble real-world user
+questions to test how well models can understand and tackle such challenges.
+Finally, we collect 2249 query-result pairs with 347 tables. We evaluate five
+state-of-the-art models using three different metrics and the results show that
+our benchmark presents introduces considerable challenge in the field of
+tabular data analysis, paving the way for more advanced research opportunities.
+
+
+
+ comment: Accepted by AAAI'2024
+
+
+
+
+
+
+ ☆ Distributed Quantum Neural Networks via Partitioned Features Encoding
+
+
+ Quantum neural networks are expected to be a promising application in
+near-term quantum computation, but face challenges such as vanishing gradients
+during optimization and limited expressibility by a limited number of qubits
+and shallow circuits. To mitigate these challenges, distributed quantum neural
+networks have been proposed to make a prediction by approximating a large
+circuit with multiple small circuits. However, the approximation of a large
+circuit requires an exponential number of small circuit evaluations. Here, we
+instead propose to distribute partitioned features over multiple small quantum
+neural networks and use the ensemble of their expectation values to generate
+predictions. To verify our distributed approach, we demonstrate multi-class
+classifications of handwritten digit datasets. Especially for the MNIST
+dataset, we succeeded in ten class classifications of the dataset with
+exceeding 96% accuracy. Our proposed method not only achieved highly accurate
+predictions for a large dataset but also reduced the hardware requirements for
+each quantum neural network compared to a single quantum neural network. Our
+results highlight distributed quantum neural networks as a promising direction
+for practical quantum machine learning algorithms compatible with near-term
+quantum devices. We hope that our approach is useful for exploring quantum
+machine learning applications.
+
+
+
+ comment: 9 pages, 2 figures, 2 tables
+
+
+
+
+
+
+ ☆ ProvFL: Client-Driven Interpretability of Global Model Predictions in
+ Federated Learning
+
+
+
+
+
+
+
+
+ Waris Gill, Ali Anwar, Muhammad Ali Gulzar
+
+
+ Federated Learning (FL) trains a collaborative machine learning model by
+aggregating multiple privately trained clients' models over several training
+rounds. Such a long, continuous action of model aggregations poses significant
+challenges in reasoning about the origin and composition of such a global
+model. Regardless of the quality of the global model or if it has a fault,
+understanding the model's origin is equally important for debugging,
+interpretability, and explainability in federated learning. FL application
+developers often question: (1) what clients contributed towards a global model
+and (2) if a global model predicts a label, which clients are responsible for
+it?
+ We introduce, neuron provenance, a fine-grained lineage capturing mechanism
+that tracks the flow of information between the individual participating
+clients in FL and the final global model. We operationalize this concept in
+ProvFL that functions on two key principles. First, recognizing that monitoring
+every neuron of every client's model statically is ineffective and noisy due to
+the uninterpretable nature of individual neurons, ProvFL dynamically isolates
+influential and sensitive neurons in the global model, significantly reducing
+the search space. Second, as multiple clients' models are fused in each round
+to form a global model, tracking each client's contribution becomes
+challenging. ProvFL leverages the invertible nature of fusion algorithms to
+precisely isolate each client's contribution derived from selected neurons.
+When asked to localize the clients responsible for the given behavior (i.e.,
+prediction) of the global model, ProvFL successfully localizes them with an
+average provenance accuracy of 97%. Additionally, ProvFL outperforms the
+state-of-the-art FL fault localization approach by an average margin of 50%.
+
+
+
+ comment: 22 pages. For access to the source code used in this study, please
+ contact the authors directly
+
+
+
+
+
+
+ ☆ MFABA: A More Faithful and Accelerated Boundary-based Attribution Method
+ for Deep Neural Networks AAAI
+
+
+ To better understand the output of deep neural networks (DNN), attribution
+based methods have been an important approach for model interpretability, which
+assign a score for each input dimension to indicate its importance towards the
+model outcome. Notably, the attribution methods use the axioms of sensitivity
+and implementation invariance to ensure the validity and reliability of
+attribution results. Yet, the existing attribution methods present challenges
+for effective interpretation and efficient computation. In this work, we
+introduce MFABA, an attribution algorithm that adheres to axioms, as a novel
+method for interpreting DNN. Additionally, we provide the theoretical proof and
+in-depth analysis for MFABA algorithm, and conduct a large scale experiment.
+The results demonstrate its superiority by achieving over 101.5142 times faster
+speed than the state-of-the-art attribution algorithms. The effectiveness of
+MFABA is thoroughly evaluated through the statistical analysis in comparison to
+other methods, and the full implementation package is open-source at:
+https://github.com/LMBTough/MFABA
+
+
+
+ comment: Accepted by The 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI-24)
+
+
+
+
+
+
+ ☆ Where and How to Attack? A Causality-Inspired Recipe for Generating
+ Counterfactual Adversarial Examples AAAI-2024
+
+
+ Deep neural networks (DNNs) have been demonstrated to be vulnerable to
+well-crafted \emph{adversarial examples}, which are generated through either
+well-conceived $\mathcal{L}_p$-norm restricted or unrestricted attacks.
+Nevertheless, the majority of those approaches assume that adversaries can
+modify any features as they wish, and neglect the causal generating process of
+the data, which is unreasonable and unpractical. For instance, a modification
+in income would inevitably impact features like the debt-to-income ratio within
+a banking system. By considering the underappreciated causal generating
+process, first, we pinpoint the source of the vulnerability of DNNs via the
+lens of causality, then give theoretical results to answer \emph{where to
+attack}. Second, considering the consequences of the attack interventions on
+the current state of the examples to generate more realistic adversarial
+examples, we propose CADE, a framework that can generate
+\textbf{C}ounterfactual \textbf{AD}versarial \textbf{E}xamples to answer
+\emph{how to attack}. The empirical results demonstrate CADE's effectiveness,
+as evidenced by its competitive performance across diverse attack scenarios,
+including white-box, transfer-based, and random intervention attacks.
+
+
+
+ comment: Accepted by AAAI-2024
+
+
+
+
+
+
+ ☆ Navigating the Structured What-If Spaces: Counterfactual Generation via
+ Structured Diffusion
+
+
+ Generating counterfactual explanations is one of the most effective
+approaches for uncovering the inner workings of black-box neural network models
+and building user trust. While remarkable strides have been made in generative
+modeling using diffusion models in domains like vision, their utility in
+generating counterfactual explanations in structured modalities remains
+unexplored. In this paper, we introduce Structured Counterfactual Diffuser or
+SCD, the first plug-and-play framework leveraging diffusion for generating
+counterfactual explanations in structured data. SCD learns the underlying data
+distribution via a diffusion model which is then guided at test time to
+generate counterfactuals for any arbitrary black-box model, input, and desired
+prediction. Our experiments show that our counterfactuals not only exhibit high
+plausibility compared to the existing state-of-the-art but also show
+significantly better proximity and diversity.
+
+
+
+ comment: 13 pages
+
+
+
+
+
+
+ ☆ Structure-Aware Path Inference for Neural Finite State Transducers NeurIPS 2023
+
+
+ Neural finite-state transducers (NFSTs) form an expressive family of
+neurosymbolic sequence transduction models. An NFST models each string pair as
+having been generated by a latent path in a finite-state transducer. As they
+are deep generative models, both training and inference of NFSTs require
+inference networks that approximate posterior distributions over such latent
+variables. In this paper, we focus on the resulting challenge of imputing the
+latent alignment path that explains a given pair of input and output strings
+(e.g., during training). We train three autoregressive approximate models for
+amortized inference of the path, which can then be used as proposal
+distributions for importance sampling. All three models perform lookahead. Our
+most sophisticated (and novel) model leverages the FST structure to consider
+the graph of future paths; unfortunately, we find that it loses out to the
+simpler approaches -- except on an artificial task that we concocted to confuse
+the simpler approaches.
+
+
+
+ comment: In Proceedings of ICBINB Workshop at NeurIPS 2023
+
+
+
+
+
+
+
+ Zheshun Wu, Zenglin Xu, Dun Zeng, Junfan Li, Jie Liu
+
+
+ With the proliferation of intelligent mobile devices in wireless
+device-to-device (D2D) networks, decentralized federated learning (DFL) has
+attracted significant interest. Compared to centralized federated learning
+(CFL), DFL mitigates the risk of central server failures due to communication
+bottlenecks. However, DFL faces several challenges, such as the severe
+heterogeneity of data distributions in diverse environments, and the
+transmission outages and package errors caused by the adoption of the User
+Datagram Protocol (UDP) in D2D networks. These challenges often degrade the
+convergence of training DFL models. To address these challenges, we conduct a
+thorough theoretical convergence analysis for DFL and derive a convergence
+bound. By defining a novel quantity named unreliable links-aware neighborhood
+discrepancy in this convergence bound, we formulate a tractable optimization
+objective, and develop a novel Topology Learning method considering the
+Representation Discrepancy and Unreliable Links in DFL, named ToLRDUL.
+Intensive experiments under both feature skew and label skew settings have
+validated the effectiveness of our proposed method, demonstrating improved
+convergence speed and test accuracy, consistent with our theoretical findings.
+
+
+
+
+
+
+
+ ☆ Peer-to-Peer Learning + Consensus with Non-IID Data
+
+
+
+
+
+
+
+
+ Srinivasa Pranav, José M. F. Moura
+
+
+ Peer-to-peer deep learning algorithms are enabling distributed edge devices
+to collaboratively train deep neural networks without exchanging raw training
+data or relying on a central server. Peer-to-Peer Learning (P2PL) and other
+algorithms based on Distributed Local-Update Stochastic/mini-batch Gradient
+Descent (local DSGD) rely on interleaving epochs of training with distributed
+consensus steps. This process leads to model parameter drift/divergence amongst
+participating devices in both IID and non-IID settings. We observe that model
+drift results in significant oscillations in test performance evaluated after
+local training and consensus phases. We then identify factors that amplify
+performance oscillations and demonstrate that our novel approach, P2PL with
+Affinity, dampens test performance oscillations in non-IID settings without
+incurring any additional communication cost.
+
+
+
+ comment: Asilomar Conference on Signals, Systems, and Computers 2023
+ Camera-Ready Version
+
+
+
+
+
+
+ ☆ Anchoring Path for Inductive Relation Prediction in Knowledge Graphs
+
+
+
+
+
+
+
+
+ Zhixiang Su, Di Wang, Chunyan Miao, Lizhen Cui
+
+
+ Aiming to accurately predict missing edges representing relations between
+entities, which are pervasive in real-world Knowledge Graphs (KGs), relation
+prediction plays a critical role in enhancing the comprehensiveness and utility
+of KGs. Recent research focuses on path-based methods due to their inductive
+and explainable properties. However, these methods face a great challenge when
+lots of reasoning paths do not form Closed Paths (CPs) in the KG. To address
+this challenge, we propose Anchoring Path Sentence Transformer (APST) by
+introducing Anchoring Paths (APs) to alleviate the reliance of CPs.
+Specifically, we develop a search-based description retrieval method to enrich
+entity descriptions and an assessment mechanism to evaluate the rationality of
+APs. APST takes both APs and CPs as the inputs of a unified Sentence
+Transformer architecture, enabling comprehensive predictions and high-quality
+explanations. We evaluate APST on three public datasets and achieve
+state-of-the-art (SOTA) performance in 30 of 36 transductive, inductive, and
+few-shot experimental settings.
+
+
+
+
+
+
+
+
+ Harsha Vardhan Tetali, Joel B. Harley, Benjamin D. Haeffele
+
+
+ With the recent success of representation learning methods, which includes
+deep learning as a special case, there has been considerable interest in
+developing techniques that incorporate known physical constraints into the
+learned representation. As one example, in many applications that involve a
+signal propagating through physical media (e.g., optics, acoustics, fluid
+dynamics, etc), it is known that the dynamics of the signal must satisfy
+constraints imposed by the wave equation. Here we propose a matrix
+factorization technique that decomposes such signals into a sum of components,
+where each component is regularized to ensure that it {nearly} satisfies wave
+equation constraints. Although our proposed formulation is non-convex, we prove
+that our model can be efficiently solved to global optimality. Through this
+line of work we establish theoretical connections between wave-informed
+learning and filtering theory in signal processing. We further demonstrate the
+application of this work on modal analysis problems commonly arising in
+structural diagnostics and prognostics.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2107.09144
+
+ Recently, the paradigm of pre-training and fine-tuning graph neural networks
+has been intensively studied and applied in a wide range of graph mining tasks.
+Its success is generally attributed to the structural consistency between
+pre-training and downstream datasets, which, however, does not hold in many
+real-world scenarios. Existing works have shown that the structural divergence
+between pre-training and downstream graphs significantly limits the
+transferability when using the vanilla fine-tuning strategy. This divergence
+leads to model overfitting on pre-training graphs and causes difficulties in
+capturing the structural properties of the downstream graphs. In this paper, we
+identify the fundamental cause of structural divergence as the discrepancy of
+generative patterns between the pre-training and downstream graphs.
+Furthermore, we propose G-Tuning to preserve the generative patterns of
+downstream graphs. Given a downstream graph G, the core idea is to tune the
+pre-trained GNN so that it can reconstruct the generative patterns of G, the
+graphon W. However, the exact reconstruction of a graphon is known to be
+computationally expensive. To overcome this challenge, we provide a theoretical
+analysis that establishes the existence of a set of alternative graphons called
+graphon bases for any given graphon. By utilizing a linear combination of these
+graphon bases, we can efficiently approximate W. This theoretical finding forms
+the basis of our proposed model, as it enables effective learning of the
+graphon bases and their associated coefficients. Compared with existing
+algorithms, G-Tuning demonstrates an average improvement of 0.5% and 2.6% on
+in-domain and out-of-domain transfer learning experiments, respectively.
+
+
+ Network binarization exhibits great potential for deployment on
+resource-constrained devices due to its low computational cost. Despite the
+critical importance, the security of binarized neural networks (BNNs) is rarely
+investigated. In this paper, we present ARBiBench, a comprehensive benchmark to
+evaluate the robustness of BNNs against adversarial perturbations on CIFAR-10
+and ImageNet. We first evaluate the robustness of seven influential BNNs on
+various white-box and black-box attacks. The results reveal that 1) The
+adversarial robustness of BNNs exhibits a completely opposite performance on
+the two datasets under white-box attacks. 2) BNNs consistently exhibit better
+adversarial robustness under black-box attacks. 3) Different BNNs exhibit
+certain similarities in their robustness performance. Then, we conduct
+experiments to analyze the adversarial robustness of BNNs based on these
+insights. Our research contributes to inspiring future research on enhancing
+the robustness of BNNs and advancing their application in real-world scenarios.
+
+
+ This paper investigates the impact of using gradient norm reward signals in
+the context of Automatic Curriculum Learning (ACL) for deep reinforcement
+learning (DRL). We introduce a framework where the teacher model, utilizing the
+gradient norm information of a student model, dynamically adapts the learning
+curriculum. This approach is based on the hypothesis that gradient norms can
+provide a nuanced and effective measure of learning progress. Our experimental
+setup involves several reinforcement learning environments (PointMaze, AntMaze,
+and AdroitHandRelocate), to assess the efficacy of our method. We analyze how
+gradient norm rewards influence the teacher's ability to craft challenging yet
+achievable learning sequences, ultimately enhancing the student's performance.
+Our results show that this approach not only accelerates the learning process
+but also leads to improved generalization and adaptability in complex tasks.
+The findings underscore the potential of gradient norm signals in creating more
+efficient and robust ACL systems, opening new avenues for research in
+curriculum learning and reinforcement learning.
+
+
+
+ comment: 11 pages, 15 figures
+
+
+
+
+
+
+ ☆ The Truth is in There: Improving Reasoning in Language Models with
+ Layer-Selective Rank Reduction
+
+
+
+
+
+
+
+
+ Pratyusha Sharma, Jordan T. Ash, Dipendra Misra
+
+
+ Transformer-based Large Language Models (LLMs) have become a fixture in
+modern machine learning. Correspondingly, significant resources are allocated
+towards research that aims to further advance this technology, typically
+resulting in models of increasing size that are trained on increasing amounts
+of data. This work, however, demonstrates the surprising result that it is
+often possible to significantly improve the performance of LLMs by selectively
+removing higher-order components of their weight matrices. This simple
+intervention, which we call LAyer-SElective Rank reduction (LASER), can be done
+on a model after training has completed, and requires no additional parameters
+or data. We show extensive experiments demonstrating the generality of this
+finding across language models and datasets, and provide in-depth analyses
+offering insights into both when LASER is effective and the mechanism by which
+it operates.
+
+
+ The capacity to generalize to future unseen data stands as one of the utmost
+crucial attributes of deep neural networks. Sharpness-Aware Minimization (SAM)
+aims to enhance the generalizability by minimizing worst-case loss using
+one-step gradient ascent as an approximation. However, as training progresses,
+the non-linearity of the loss landscape increases, rendering one-step gradient
+ascent less effective. On the other hand, multi-step gradient ascent will incur
+higher training cost. In this paper, we introduce a normalized Hessian trace to
+accurately measure the curvature of loss landscape on {\em both} training and
+test sets. In particular, to counter excessive non-linearity of loss landscape,
+we propose Curvature Regularized SAM (CR-SAM), integrating the normalized
+Hessian trace as a SAM regularizer. Additionally, we present an efficient way
+to compute the trace via finite differences with parallelism. Our theoretical
+analysis based on PAC-Bayes bounds establishes the regularizer's efficacy in
+reducing generalization error. Empirical evaluation on CIFAR and ImageNet
+datasets shows that CR-SAM consistently enhances classification performance for
+ResNet and Vision Transformer (ViT) models across various datasets. Our code is
+available at https://github.com/TrustAIoT/CR-SAM.
+
+
+ Despite the remarkable accomplishments of graph neural networks (GNNs), they
+typically rely on task-specific labels, posing potential challenges in terms of
+their acquisition. Existing work have been made to address this issue through
+the lens of unsupervised domain adaptation, wherein labeled source graphs are
+utilized to enhance the learning process for target data. However, the
+simultaneous exploration of graph topology and reduction of domain disparities
+remains a substantial hurdle. In this paper, we introduce the Dual Adversarial
+Graph Representation Learning (DAGRL), which explore the graph topology from
+dual branches and mitigate domain discrepancies via dual adversarial learning.
+Our method encompasses a dual-pronged structure, consisting of a graph
+convolutional network branch and a graph kernel branch, which enables us to
+capture graph semantics from both implicit and explicit perspectives. Moreover,
+our approach incorporates adaptive perturbations into the dual branches, which
+align the source and target distribution to address domain discrepancies.
+Extensive experiments on a wild range graph classification datasets demonstrate
+the effectiveness of our proposed method.
+
+
+
+
+
+
+
+ ☆ HW-V2W-Map: Hardware Vulnerability to Weakness Mapping Framework for
+ Root Cause Analysis with GPT-assisted Mitigation Suggestion
+
+
+ The escalating complexity of modern computing frameworks has resulted in a
+surge in the cybersecurity vulnerabilities reported to the National
+Vulnerability Database (NVD) by practitioners. Despite the fact that the
+stature of NVD is one of the most significant databases for the latest insights
+into vulnerabilities, extracting meaningful trends from such a large amount of
+unstructured data is still challenging without the application of suitable
+technological methodologies. Previous efforts have mostly concentrated on
+software vulnerabilities; however, a holistic strategy incorporates approaches
+for mitigating vulnerabilities, score prediction, and a knowledge-generating
+system that may extract relevant insights from the Common Weakness Enumeration
+(CWE) and Common Vulnerability Exchange (CVE) databases is notably absent. As
+the number of hardware attacks on Internet of Things (IoT) devices continues to
+rapidly increase, we present the Hardware Vulnerability to Weakness Mapping
+(HW-V2W-Map) Framework, which is a Machine Learning (ML) framework focusing on
+hardware vulnerabilities and IoT security. The architecture that we have
+proposed incorporates an Ontology-driven Storytelling framework, which
+automates the process of updating the ontology in order to recognize patterns
+and evolution of vulnerabilities over time and provides approaches for
+mitigating the vulnerabilities. The repercussions of vulnerabilities can be
+mitigated as a result of this, and conversely, future exposures can be
+predicted and prevented. Furthermore, our proposed framework utilized
+Generative Pre-trained Transformer (GPT) Large Language Models (LLMs) to
+provide mitigation suggestions.
+
+
+ Various methods have been proposed to secure access to sensitive information
+over time, such as the many cryptographic methods in use to facilitate secure
+communications on the internet. But other methods like steganography have been
+overlooked which may be more suitable in cases where the act of transmission of
+sensitive information itself should remain a secret. Multiple techniques that
+are commonly discussed for such scenarios suffer from low capacity and high
+distortion in the output signal. This research introduces a novel
+steganographic approach for concealing a confidential portable document format
+(PDF) document within a host image by employing the Hybrid Firefly algorithm
+(HFA) proposed to select the pixel arrangement. This algorithm combines two
+widely used optimization algorithms to improve their performance. The suggested
+methodology utilizes the HFA algorithm to conduct a search for optimal pixel
+placements in the spatial domain. The purpose of this search is to accomplish
+two main goals: increasing the host image's capacity and reducing distortion.
+Moreover, the proposed approach intends to reduce the time required for the
+embedding procedure. The findings indicate a decrease in image distortion and
+an accelerated rate of convergence in the search process. The resultant
+embeddings exhibit robustness against steganalytic assaults, hence rendering
+the identification of the embedded data a formidable undertaking.
+
+
+
+
+
+
+
+ ☆ Symmetry-enforcing neural networks with applications to constitutive
+ modeling
+
+
+ The use of machine learning techniques to homogenize the effective behavior
+of arbitrary microstructures has been shown to be not only efficient but also
+accurate. In a recent work, we demonstrated how to combine state-of-the-art
+micromechanical modeling and advanced machine learning techniques to homogenize
+complex microstructures exhibiting non-linear and history dependent behaviors.
+The resulting homogenized model, termed smart constitutive law (SCL), enables
+the adoption of microstructurally informed constitutive laws into finite
+element solvers at a fraction of the computational cost required by traditional
+concurrent multiscale approaches. In this work, the capabilities of SCLs are
+expanded via the introduction of a novel methodology that enforces material
+symmetries at the neuron level, applicable across various neural network
+architectures. This approach utilizes tensor-based features in neural networks,
+facilitating the concise and accurate representation of symmetry-preserving
+operations, and is general enough to be extend to problems beyond constitutive
+modeling. Details on the construction of these tensor-based neural networks and
+their application in learning constitutive laws are presented for both elastic
+and inelastic materials. The superiority of this approach over traditional
+neural networks is demonstrated in scenarios with limited data and strong
+symmetries, through comprehensive testing on various materials, including
+isotropic neo-Hookean materials and tensegrity lattice metamaterials. This work
+is concluded by a discussion on the potential of this methodology to discover
+symmetry bases in materials and by an outline of future research directions.
+
+
+
+
+
+
+
+ ☆ Multimodal Federated Learning with Missing Modality via Prototype Mask
+ and Contrast
+
+
+ In real-world scenarios, multimodal federated learning often faces the
+practical challenge of intricate modality missing, which poses constraints on
+building federated frameworks and significantly degrades model inference
+accuracy. Existing solutions for addressing missing modalities generally
+involve developing modality-specific encoders on clients and training modality
+fusion modules on servers. However, these methods are primarily constrained to
+specific scenarios with either unimodal clients or complete multimodal clients,
+struggling to generalize effectively in the intricate modality missing
+scenarios. In this paper, we introduce a prototype library into the
+FedAvg-based Federated Learning framework, thereby empowering the framework
+with the capability to alleviate the global model performance degradation
+resulting from modality missing during both training and testing. The proposed
+method utilizes prototypes as masks representing missing modalities to
+formulate a task-calibrated training loss and a model-agnostic uni-modality
+inference strategy. In addition, a proximal term based on prototypes is
+constructed to enhance local training. Experimental results demonstrate the
+state-of-the-art performance of our approach. Compared to the baselines, our
+method improved inference accuracy by 3.7\% with 50\% modality missing during
+training and by 23.8\% during uni-modality inference. Code is available at
+https://github.com/BaoGuangYin/PmcmFL.
+
+
+
+ comment: 17 pages
+
+
+
+
+
+
+ ☆ DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias
+ Correction) AAAI
+
+
+ The Adam optimizer is a popular choice in contemporary deep learning, due to
+its strong empirical performance. However we observe that in privacy sensitive
+scenarios, the traditional use of Differential Privacy (DP) with the Adam
+optimizer leads to sub-optimal performance on several tasks. We find that this
+performance degradation is due to a DP bias in Adam's second moment estimator,
+introduced by the addition of independent noise in the gradient computation to
+enforce DP guarantees. This DP bias leads to a different scaling for low
+variance parameter updates, that is inconsistent with the behavior of
+non-private Adam. We propose DP-AdamBC, an optimization algorithm which removes
+the bias in the second moment estimation and retrieves the expected behaviour
+of Adam. Empirically, DP-AdamBC significantly improves the optimization
+performance of DP-Adam by up to 3.5% in final accuracy in image, text, and
+graph node classification tasks.
+
+
+
+ comment: Published as a conference paper at the 38th Annual AAAI Conference on
+ Artificial Intelligence, Vancouver, 2024
+
+
+
+
+
+
+ ☆ Behaviour Modelling of Social Animals via Causal Structure Discovery and
+ Graph Neural Networks AAMAS 2024
+
+
+
+
+
+
+
+
+ Gaël Gendron, Yang Chen, Mitchell Rogers, Yiping Liu, Mihailo Azhar, Shahrokh Heidari, David Arturo Soriano Valdez, Kobe Knowles, Padriac O'Leary, Simon Eyre, Michael Witbrock, Gillian Dobbie, Jiamou Liu, Patrice Delmas
+
+
+ Better understanding the natural world is a crucial task with a wide range of
+applications. In environments with close proximity between humans and animals,
+such as zoos, it is essential to better understand the causes behind animal
+behaviour and what interventions are responsible for changes in their
+behaviours. This can help to predict unusual behaviours, mitigate detrimental
+effects and increase the well-being of animals. There has been work on
+modelling the dynamics behind swarms of birds and insects but the complex
+social behaviours of mammalian groups remain less explored. In this work, we
+propose a method to build behavioural models using causal structure discovery
+and graph neural networks for time series. We apply this method to a mob of
+meerkats in a zoo environment and study its ability to predict future actions
+and model the behaviour distribution at an individual-level and at a group
+level. We show that our method can match and outperform standard deep learning
+architectures and generate more realistic data, while using fewer parameters
+and providing increased interpretability.
+
+
+
+ comment: 9 pages, 7 figures, accepted as an extended abstract and poster at
+ AAMAS 2024
+
+
+
+
+
+
+ ☆ Maximum entropy GFlowNets with soft Q-learning
+
+
+ Generative Flow Networks (GFNs) have emerged as a powerful tool for sampling
+discrete objects from unnormalized distributions, offering a scalable
+alternative to Markov Chain Monte Carlo (MCMC) methods. While GFNs draw
+inspiration from maximum entropy reinforcement learning (RL), the connection
+between the two has largely been unclear and seemingly applicable only in
+specific cases. This paper addresses the connection by constructing an
+appropriate reward function, thereby establishing an exact relationship between
+GFNs and maximum entropy RL. This construction allows us to introduce maximum
+entropy GFNs, which, in contrast to GFNs with uniform backward policy, achieve
+the maximum entropy attainable by GFNs without constraints on the state space.
+
+
+
+
+
+
+
+ ☆ Invariant Anomaly Detection under Distribution Shifts: A Causal
+ Perspective
+
+
+
+
+
+
+
+
+ João B. S. Carvalho, Mengtao Zhang, Robin Geyer, Carlos Cotrini, Joachim M. Buhmann
+
+
+ Anomaly detection (AD) is the machine learning task of identifying highly
+discrepant abnormal samples by solely relying on the consistency of the normal
+training samples. Under the constraints of a distribution shift, the assumption
+that training samples and test samples are drawn from the same distribution
+breaks down. In this work, by leveraging tools from causal inference we attempt
+to increase the resilience of anomaly detection models to different kinds of
+distribution shifts. We begin by elucidating a simple yet necessary statistical
+property that ensures invariant representations, which is critical for robust
+AD under both domain and covariate shifts. From this property, we derive a
+regularization term which, when minimized, leads to partial distribution
+invariance across environments. Through extensive experimental evaluation on
+both synthetic and real-world tasks, covering a range of six different AD
+methods, we demonstrated significant improvements in out-of-distribution
+performance. Under both covariate and domain shift, models regularized with our
+proposed term showed marked increased robustness. Code is available at:
+https://github.com/JoaoCarv/invariant-anomaly-detection.
+
+
+
+
+
+
+
+ ☆ Data Needs and Challenges of Quantum Dot Devices Automation: Workshop
+ Report
+
+
+
+
+
+
+
+
+ Justyna P. Zwolak, Jacob M. Taylor, Reed Andrews, Jared Benson, Garnett Bryant, Donovan Buterakos, Anasua Chatterjee, Sankar Das Sarma, Mark A. Eriksson, Eliška Greplová, Michael J. Gullans, Fabian Hader, Tyler J. Kovach, Pranav S. Mundada, Mick Ramsey, Torbjoern Rasmussen, Brandon Severin, Anthony Sigillito, Brennan Undseth, Brian Weber
+
+
+ Gate-defined quantum dots are a promising candidate system to realize
+scalable, coupled qubit systems and serve as a fundamental building block for
+quantum computers. However, present-day quantum dot devices suffer from
+imperfections that must be accounted for, which hinders the characterization,
+tuning, and operation process. Moreover, with an increasing number of quantum
+dot qubits, the relevant parameter space grows sufficiently to make heuristic
+control infeasible. Thus, it is imperative that reliable and scalable
+autonomous tuning approaches are developed. In this report, we outline current
+challenges in automating quantum dot device tuning and operation with a
+particular focus on datasets, benchmarking, and standardization. We also
+present ideas put forward by the quantum dot community on how to overcome them.
+
+
+
+ comment: White paper/overview based on a workshop held at the National
+ Institute of Standards and Technology, Gaithersburg, MD. 13 pages
+
+ Quantum federated learning (QFL) can facilitate collaborative learning across
+multiple clients using quantum machine learning (QML) models, while preserving
+data privacy. Although recent advances in QFL span different tasks like
+classification while leveraging several data types, no prior work has focused
+on developing a QFL framework that utilizes temporal data to approximate
+functions useful to analyze the performance of distributed quantum sensing
+networks. In this paper, a novel QFL framework that is the first to integrate
+quantum long short-term memory (QLSTM) models with temporal data is proposed.
+The proposed federated QLSTM (FedQLSTM) framework is exploited for performing
+the task of function approximation. In this regard, three key use cases are
+presented: Bessel function approximation, sinusoidal delayed quantum feedback
+control function approximation, and Struve function approximation. Simulation
+results confirm that, for all considered use cases, the proposed FedQLSTM
+framework achieves a faster convergence rate under one local training epoch,
+minimizing the overall computations, and saving 25-33% of the number of
+communication rounds needed until convergence compared to an FL framework with
+classical LSTM models.
+
+
+
+ comment: 20 pages, 9 figures
+
+
+
+
+
+
+ ☆ Geo2SigMap: High-Fidelity RF Signal Mapping Using Geographic Databases
+
+
+ Radio frequency (RF) signal mapping, which is the process of analyzing and
+predicting the RF signal strength and distribution across specific areas, is
+crucial for cellular network planning and deployment. Traditional approaches to
+RF signal mapping rely on statistical models constructed based on measurement
+data, which offer low complexity but often lack accuracy, or ray tracing tools,
+which provide enhanced precision for the target area but suffer from increased
+computational complexity. Recently, machine learning (ML) has emerged as a
+data-driven method for modeling RF signal propagation, which leverages models
+trained on synthetic datasets to perform RF signal mapping in "unseen" areas.
+ In this paper, we present Geo2SigMap, an ML-based framework for efficient and
+high-fidelity RF signal mapping using geographic databases. First, we develop
+an automated framework that seamlessly integrates three open-source tools:
+OpenStreetMap (geographic databases), Blender (computer graphics), and Sionna
+(ray tracing), enabling the efficient generation of large-scale 3D building
+maps and ray tracing models. Second, we propose a cascaded U-Net model, which
+is pre-trained on synthetic datasets and employed to generate detailed RF
+signal maps, leveraging environmental information and sparse measurement data.
+Finally, we evaluate the performance of Geo2SigMap via a real-world measurement
+campaign, where three types of user equipment (UE) collect over 45,000 data
+points related to cellular information from six LTE cells operating in the
+citizens broadband radio service (CBRS) band. Our results show that Geo2SigMap
+achieves an average root-mean-square-error (RMSE) of 6.04 dB for predicting the
+reference signal received power (RSRP) at the UE, representing an average RMSE
+improvement of 3.59 dB compared to existing methods.
+
+
+
+
+
+
+
+ ☆ Exploiting Novel GPT-4 APIs
+
+
+
+
+
+
+
+
+ Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave
+
+
+ Language model attacks typically assume one of two extreme threat models:
+full white-box access to model weights, or black-box access limited to a text
+generation API. However, real-world APIs are often more flexible than just text
+generation: these APIs expose ``gray-box'' access leading to new threat
+vectors. To explore this, we red-team three new functionalities exposed in the
+GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that
+fine-tuning a model on as few as 15 harmful examples or 100 benign examples can
+remove core safeguards from GPT-4, enabling a range of harmful outputs.
+Furthermore, we find that GPT-4 Assistants readily divulge the function call
+schema and can be made to execute arbitrary function calls. Finally, we find
+that knowledge retrieval can be hijacked by injecting instructions into
+retrieval documents. These vulnerabilities highlight that any additions to the
+functionality exposed by an API can create new vulnerabilities.
+
+
+
+ comment: 10 pages, 1 figure, 4 tables
+
+
+
+
+
+
+ ☆ Fairness in Submodular Maximization over a Matroid Constraint
+
+
+
+
+
+
+
+
+ Marwa El Halabi, Jakub Tarnawski, Ashkan Norouzi-Fard, Thuy-Duong Vuong
+
+
+ Submodular maximization over a matroid constraint is a fundamental problem
+with various applications in machine learning. Some of these applications
+involve decision-making over datapoints with sensitive attributes such as
+gender or race. In such settings, it is crucial to guarantee that the selected
+solution is fairly distributed with respect to this attribute. Recently,
+fairness has been investigated in submodular maximization under a cardinality
+constraint in both the streaming and offline settings, however the more general
+problem with matroid constraint has only been considered in the streaming
+setting and only for monotone objectives. This work fills this gap. We propose
+various algorithms and impossibility results offering different trade-offs
+between quality, fairness, and generality.
+
+
+ Preference-based Reinforcement Learning (PbRL) is an active area of research,
+and has made significant strides in single-agent actor and in observer
+human-in-the-loop scenarios. However, its application within the co-operative
+multi-agent RL frameworks, where humans actively participate and express
+preferences for agent behavior, remains largely uncharted. We consider a
+two-agent (Human-AI) cooperative setup where both the agents are rewarded
+according to human's reward function for the team. However, the agent does not
+have access to it, and instead, utilizes preference-based queries to elicit its
+objectives and human's preferences for the robot in the human-robot team. We
+introduce the notion of Human-Flexibility, i.e. whether the human partner is
+amenable to multiple team strategies, with a special case being Specified
+Orchestration where the human has a single team policy in mind (most
+constrained case). We propose a suite of domains to study PbRL for Human-AI
+cooperative setup which explicitly require forced cooperation. Adapting
+state-of-the-art single-agent PbRL algorithms to our two-agent setting, we
+conduct a comprehensive benchmarking study across our domain suite. Our
+findings highlight the challenges associated with high degree of
+Human-Flexibility and the limited access to the human's envisioned policy in
+PbRL for Human-AI cooperation. Notably, we observe that PbRL algorithms exhibit
+effective performance exclusively in the case of Specified Orchestration which
+can be seen as an upper bound PbRL performance for future research.
+
+
+
+
+
+
+
+ ☆ Probing Biological and Artificial Neural Networks with Task-dependent
+ Neural Manifolds
+
+
+
+
+
+
+
+
+ Michael Kuoch, Chi-Ning Chou, Nikhil Parthasarathy, Joel Dapello, James J. DiCarlo, Haim Sompolinsky, SueYeon Chung
+
+
+ Recently, growth in our understanding of the computations performed in both
+biological and artificial neural networks has largely been driven by either
+low-level mechanistic studies or global normative approaches. However, concrete
+methodologies for bridging the gap between these levels of abstraction remain
+elusive. In this work, we investigate the internal mechanisms of neural
+networks through the lens of neural population geometry, aiming to provide
+understanding at an intermediate level of abstraction, as a way to bridge that
+gap. Utilizing manifold capacity theory (MCT) from statistical physics and
+manifold alignment analysis (MAA) from high-dimensional statistics, we probe
+the underlying organization of task-dependent manifolds in deep neural networks
+and macaque neural recordings. Specifically, we quantitatively characterize how
+different learning objectives lead to differences in the organizational
+strategies of these models and demonstrate how these geometric analyses are
+connected to the decodability of task-relevant information. These analyses
+present a strong direction for bridging mechanistic and normative theories in
+neural networks through neural population geometry, potentially opening up many
+future research avenues in both machine learning and neuroscience.
+
+
+
+ comment: To appear in the proceedings of the Conference on Parsimony and
+ Learning (CPAL) 2024
+
+
+
+
+
+
+ ☆ Fine-grained Forecasting Models Via Gaussian Process Blurring Effect
+
+
+ Time series forecasting is a challenging task due to the existence of complex
+and dynamic temporal dependencies. This can lead to incorrect predictions by
+even the best forecasting models. Using more training data is one way to
+improve the accuracy, but this source is often limited. In contrast, we are
+building on successful denoising approaches for image generation by advocating
+for an end-to-end forecasting and denoising paradigm.
+ We propose an end-to-end forecast-blur-denoise forecasting framework by
+encouraging a division of labors between the forecasting and the denoising
+models. The initial forecasting model is directed to focus on accurately
+predicting the coarse-grained behavior, while the denoiser model focuses on
+capturing the fine-grained behavior that is locally blurred by integrating a
+Gaussian Process model. All three parts are interacting for the best end-to-end
+performance. Our extensive experiments demonstrate that our proposed approach
+is able to improve the forecasting accuracy of several state-of-the-art
+forecasting models as well as several other denoising approaches.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ Characterizing and Classifying Developer Forum Posts with their
+ Intentions
+
+
+
+
+
+
+
+
+ Xingfang Wu, Eric Laufer, Heng Li, Foutse Khomh, Santhosh Srinivasan, Jayden Luo
+
+
+ With the rapid growth of the developer community, the amount of posts on
+online technical forums has been growing rapidly, which poses difficulties for
+users to filter useful posts and find important information. Tags provide a
+concise feature dimension for users to locate their interested posts and for
+search engines to index the most relevant posts according to the queries.
+However, most tags are only focused on the technical perspective (e.g., program
+language, platform, tool). In most cases, forum posts in online developer
+communities reveal the author's intentions to solve a problem, ask for advice,
+share information, etc. The modeling of the intentions of posts can provide an
+extra dimension to the current tag taxonomy. By referencing previous studies
+and learning from industrial perspectives, we create a refined taxonomy for the
+intentions of technical forum posts. Through manual labeling and analysis on a
+sampled post dataset extracted from online forums, we understand the relevance
+between the constitution of posts (code, error messages) and their intentions.
+Furthermore, inspired by our manual study, we design a pre-trained
+transformer-based model to automatically predict post intentions. The best
+variant of our intention prediction framework, which achieves a Micro F1-score
+of 0.589, Top 1-3 accuracy of 62.6% to 87.8%, and an average AUC of 0.787,
+outperforms the state-of-the-art baseline approach. Our characterization and
+automated classification of forum posts regarding their intentions may help
+forum maintainers or third-party tool developers improve the organization and
+retrieval of posts on technical forums. We have released our annotated dataset
+and codes in our supplementary material package.
+
+
+
+ comment: 39 pages
+
+
+
+
+
+
+ ☆ Deep Neural Networks and Finite Elements of Any Order on Arbitrary
+ Dimensions
+
+
+ In this study, we establish that deep neural networks employing ReLU and
+ReLU$^2$ activation functions are capable of representing Lagrange finite
+element functions of any order on simplicial meshes across arbitrary
+dimensions. We introduce a novel global formulation of the basis functions for
+Lagrange elements, grounded in a geometric decomposition of these elements and
+leveraging two essential properties of high-dimensional simplicial meshes and
+barycentric coordinate functions. This representation theory facilitates a
+natural approximation result for such deep neural networks. Our findings
+present the first demonstration of how deep neural networks can systematically
+generate general continuous piecewise polynomial functions.
+
+
+
+ comment: 23 pages, 2 figures
+
+
+
+
+
+
+ ☆ Elevating Defenses: Bridging Adversarial Training and Watermarking for
+ Model Resilience AAAI 2024
+
+
+ Machine learning models are being used in an increasing number of critical
+applications; thus, securing their integrity and ownership is critical. Recent
+studies observed that adversarial training and watermarking have a conflicting
+interaction. This work introduces a novel framework to integrate adversarial
+training with watermarking techniques to fortify against evasion attacks and
+provide confident model verification in case of intellectual property theft. We
+use adversarial training together with adversarial watermarks to train a robust
+watermarked model. The key intuition is to use a higher perturbation budget to
+generate adversarial watermarks compared to the budget used for adversarial
+training, thus avoiding conflict. We use the MNIST and Fashion-MNIST datasets
+to evaluate our proposed technique on various model stealing attacks. The
+results obtained consistently outperform the existing baseline in terms of
+robustness performance and further prove the resilience of this defense against
+pruning and fine-tuning removal attacks.
+
+
+
+
+
+
+
+
+ Osama A. Hanna, Merve Karakas, Lin F. Yang, Christina Fragouli
+
+
+ Multi-Armed Bandit (MAB) systems are witnessing an upswing in applications
+within multi-agent distributed environments, leading to the advancement of
+collaborative MAB algorithms. In such settings, communication between agents
+executing actions and the primary learner making decisions can hinder the
+learning process. A prevalent challenge in distributed learning is action
+erasure, often induced by communication delays and/or channel noise. This
+results in agents possibly not receiving the intended action from the learner,
+subsequently leading to misguided feedback. In this paper, we introduce novel
+algorithms that enable learners to interact concurrently with distributed
+agents across heterogeneous action erasure channels with different action
+erasure probabilities. We illustrate that, in contrast to existing bandit
+algorithms, which experience linear regret, our algorithms assure sub-linear
+regret guarantees. Our proposed solutions are founded on a meticulously crafted
+repetition protocol and scheduling of learning across heterogeneous channels.
+To our knowledge, these are the first algorithms capable of effectively
+learning through heterogeneous action erasure channels. We substantiate the
+superior performance of our algorithm through numerical experiments,
+emphasizing their practical significance in addressing issues related to
+communication constraints and delays in multi-agent environments.
+
+
+ We study the problem of contextual feature selection, where the goal is to
+learn a predictive function while identifying subsets of informative features
+conditioned on specific contexts. Towards this goal, we generalize the recently
+proposed stochastic gates (STG) Yamada et al. [2020] by modeling the
+probabilistic gates as conditional Bernoulli variables whose parameters are
+predicted based on the contextual variables. Our new scheme, termed
+conditional-STG (c-STG), comprises two networks: a hypernetwork that
+establishes the mapping between contextual variables and probabilistic feature
+selection parameters and a prediction network that maps the selected feature to
+the response variable. Training the two networks simultaneously ensures the
+comprehensive incorporation of context and feature selection within a unified
+model. We provide a theoretical analysis to examine several properties of the
+proposed framework. Importantly, our model leads to improved flexibility and
+adaptability of feature selection and, therefore, can better capture the
+nuances and variations in the data. We apply c-STG to simulated and real-world
+datasets, including healthcare, housing, and neuroscience, and demonstrate that
+it effectively selects contextually meaningful features, thereby enhancing
+predictive performance and interpretability.
+
+
+
+
+
+
+
+ ☆ GenoCraft: A Comprehensive, User-Friendly Web-Based Platform for
+ High-Throughput Omics Data Analysis and Visualization
+
+
+
+
+
+
+
+
+ Yingzhou Lu, Minjie Shen, Yue Zhao, Chenhao Li, Fan Meng, Xiao Wang, David Herrington, Yue Wang, Tim Fu, Capucine Van Rechem
+
+
+ The surge in high-throughput omics data has reshaped the landscape of
+biological research, underlining the need for powerful, user-friendly data
+analysis and interpretation tools. This paper presents GenoCraft, a web-based
+comprehensive software solution designed to handle the entire pipeline of omics
+data processing. GenoCraft offers a unified platform featuring advanced
+bioinformatics tools, covering all aspects of omics data analysis. It
+encompasses a range of functionalities, such as normalization, quality control,
+differential analysis, network analysis, pathway analysis, and diverse
+visualization techniques. This software makes state-of-the-art omics data
+analysis more accessible to a wider range of users. With GenoCraft, researchers
+and data scientists have access to an array of cutting-edge bioinformatics
+tools under a user-friendly interface, making it a valuable resource for
+managing and analyzing large-scale omics data. The API with an interactive web
+interface is publicly available at https://genocraft.stanford. edu/. We also
+release all the codes in https://github.com/futianfan/GenoCraft.
+
+
+
+
+
+
+
+ ☆ Deep Reinforcement Learning Based Placement for Integrated Access
+ Backhauling in UAV-Assisted Wireless Networks
+
+
+ The advent of fifth generation (5G) networks has opened new avenues for
+enhancing connectivity, particularly in challenging environments like remote
+areas or disaster-struck regions. Unmanned aerial vehicles (UAVs) have been
+identified as a versatile tool in this context, particularly for improving
+network performance through the Integrated access and backhaul (IAB) feature of
+5G. However, existing approaches to UAV-assisted network enhancement face
+limitations in dynamically adapting to varying user locations and network
+demands. This paper introduces a novel approach leveraging deep reinforcement
+learning (DRL) to optimize UAV placement in real-time, dynamically adjusting to
+changing network conditions and user requirements. Our method focuses on the
+intricate balance between fronthaul and backhaul links, a critical aspect often
+overlooked in current solutions. The unique contribution of this work lies in
+its ability to autonomously position UAVs in a way that not only ensures robust
+connectivity to ground users but also maintains seamless integration with
+central network infrastructure. Through various simulated scenarios, we
+demonstrate how our approach effectively addresses these challenges, enhancing
+coverage and network performance in critical areas. This research fills a
+significant gap in UAV-assisted 5G networks, providing a scalable and adaptive
+solution for future mobile networks.
+
+
+
+
+
+
+
+ ☆ AI-Lorenz: A physics-data-driven framework for black-box and gray-box
+ identification of chaotic systems with symbolic regression
+
+
+
+
+
+
+
+
+ Mario De Florio, Ioannis G. Kevrekidis, George Em Karniadakis
+
+
+ Discovering mathematical models that characterize the observed behavior of
+dynamical systems remains a major challenge, especially for systems in a
+chaotic regime. The challenge is even greater when the physics underlying such
+systems is not yet understood, and scientific inquiry must solely rely on
+empirical data. Driven by the need to fill this gap, we develop a framework
+that learns mathematical expressions modeling complex dynamical behaviors by
+identifying differential equations from noisy and sparse observable data. We
+train a small neural network to learn the dynamics of a system, its rate of
+change in time, and missing model terms, which are used as input for a symbolic
+regression algorithm to autonomously distill the explicit mathematical terms.
+This, in turn, enables us to predict the future evolution of the dynamical
+behavior. The performance of this framework is validated by recovering the
+right-hand sides and unknown terms of certain complex, chaotic systems such as
+the well-known Lorenz system, a six-dimensional hyperchaotic system, and the
+non-autonomous Sprott chaotic system, and comparing them with their known
+analytical expressions.
+
+
+
+ comment: 28 pages, 15 figures, 9 tables
+
+
+
+
+
+
+ ♻ ☆ Convex Clustering through MM: An Efficient Algorithm to Perform
+ Hierarchical Clustering
+
+
+
+
+
+
+
+
+ Daniel J. W. Touw, Patrick J. F. Groenen, Yoshikazu Terada
+
+
+ Convex clustering is a modern method with both hierarchical and $k$-means
+clustering characteristics. Although convex clustering can capture complex
+clustering structures hidden in data, the existing convex clustering algorithms
+are not scalable to large data sets with sample sizes greater than several
+thousands. Moreover, it is known that convex clustering sometimes fails to
+produce a complete hierarchical clustering structure. This issue arises if
+clusters split up or the minimum number of possible clusters is larger than the
+desired number of clusters. In this paper, we propose convex clustering through
+majorization-minimization (CCMM) -- an iterative algorithm that uses cluster
+fusions and a highly efficient updating scheme derived using diagonal
+majorization. Additionally, we explore different strategies to ensure that the
+hierarchical clustering structure terminates in a single cluster. With a
+current desktop computer, CCMM efficiently solves convex clustering problems
+featuring over one million objects in seven-dimensional space, achieving a
+solution time of 51 seconds on average.
+
+
+
+ comment: 27 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ Cascade Speculative Drafting for Even Faster LLM Inference
+
+
+ Speculative decoding enhances the efficiency of large language models (LLMs)
+by leveraging a draft model to draft for a larger target model to review.
+However, drafting in speculative decoding involves slow autoregressive
+generation and generating tokens of different importance with the same time
+allocation. These two inefficiencies lead to its suboptimal performance. To
+address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a
+novel approach that employs two types of cascades. The Vertical Cascade
+eliminates autoregressive generation from neural models. The Horizontal Cascade
+constitutes efficient time allocation in drafting with its optimality supported
+by our theoretical analysis. Combining both cascades, our CS. Drafting
+algorithm has achieved up to 72 percent additional speedup over speculative
+decoding in our experiments while keeping the same output distribution.
+
+
+ In this work we design graph neural network architectures that can be used to
+obtain optimal approximation algorithms for a large class of combinatorial
+optimization problems using powerful algorithmic tools from semidefinite
+programming (SDP). Concretely, we prove that polynomial-sized message passing
+algorithms can represent the most powerful polynomial time algorithms for Max
+Constraint Satisfaction Problems assuming the Unique Games Conjecture. We
+leverage this result to construct efficient graph neural network architectures,
+OptGNN, that obtain high-quality approximate solutions on landmark
+combinatorial optimization problems such as Max Cut and maximum independent
+set. Our approach achieves strong empirical results across a wide range of
+real-world and synthetic datasets against both neural baselines and classical
+algorithms. Finally, we take advantage of OptGNN's ability to capture convex
+relaxations to design an algorithm for producing dual certificates of
+optimality (bounds on the optimal solution) from the learned embeddings of
+OptGNN.
+
+
+
+ comment: Updated references, fixed more typos and wording issues
+
+ Open-vocabulary image segmentation aims to partition an image into semantic
+regions according to arbitrary text descriptions. However, complex visual
+scenes can be naturally decomposed into simpler parts and abstracted at
+multiple levels of granularity, introducing inherent segmentation ambiguity.
+Unlike existing methods that typically sidestep this ambiguity and treat it as
+an external factor, our approach actively incorporates a hierarchical
+representation encompassing different semantic-levels into the learning
+process. We propose a decoupled text-image fusion mechanism and representation
+learning modules for both "things" and "stuff". Additionally, we systematically
+examine the differences that exist in the textual and visual features between
+these types of categories. Our resulting model, named HIPIE, tackles
+HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a
+unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,
+Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the
+state-of-the-art results at various levels of image comprehension, including
+semantic-level (e.g., semantic segmentation), instance-level (e.g.,
+panoptic/referring segmentation and object detection), as well as part-level
+(e.g., part/subpart segmentation) tasks. Our code is released at
+https://github.com/berkeley-hipie/HIPIE.
+
+
+
+
+
+
+
+ ♻ ☆ Optimistic Policy Gradient in Multi-Player Markov Games with a Single
+ Controller: Convergence Beyond the Minty Property AAAI 2024
+
+
+
+
+
+
+
+
+ Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, Tuomas Sandholm
+
+
+ Policy gradient methods enjoy strong practical performance in numerous tasks
+in reinforcement learning. Their theoretical understanding in multiagent
+settings, however, remains limited, especially beyond two-player competitive
+and potential Markov games. In this paper, we develop a new framework to
+characterize optimistic policy gradient methods in multi-player Markov games
+with a single controller. Specifically, under the further assumption that the
+game exhibits an equilibrium collapse, in that the marginals of coarse
+correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to
+stationary $\epsilon$-NE in $O(1/\epsilon^2)$ iterations, where $O(\cdot)$
+suppresses polynomial factors in the natural parameters of the game. Such an
+equilibrium collapse is well-known to manifest itself in two-player zero-sum
+Markov games, but also occurs even in a class of multi-player Markov games with
+separable interactions, as established by recent work. As a result, we bypass
+known complexity barriers for computing stationary NE when either of our
+assumptions fails. Our approach relies on a natural generalization of the
+classical Minty property that we introduce, which we anticipate to have further
+applications beyond Markov games.
+
+
+ Generative Models (GMs) have attracted considerable attention due to their
+tremendous success in various domains, such as computer vision where they are
+capable to generate impressive realistic-looking images. Likelihood-based GMs
+are attractive due to the possibility to generate new data by a single model
+evaluation. However, they typically achieve lower sample quality compared to
+state-of-the-art score-based diffusion models (DMs). This paper provides a
+significant step in the direction of addressing this limitation. The idea is to
+borrow one of the strengths of score-based DMs, which is the ability to perform
+accurate density estimation in low-density regions and to address manifold
+overfitting by means of data mollification. We connect data mollification
+through the addition of Gaussian noise to Gaussian homotopy, which is a
+well-known technique to improve optimization. Data mollification can be
+implemented by adding one line of code in the optimization loop, and we
+demonstrate that this provides a boost in generation quality of
+likelihood-based GMs, without computational overheads. We report results on
+image data sets with popular likelihood-based GMs, including variants of
+variational autoencoders and normalizing flows, showing large improvements in
+FID score.
+
+
+
+ comment: NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Unifying GANs and Score-Based Diffusion as Generative Particle Models
+
+
+
+
+
+
+
+
+ Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, Alain Rakotomamonjy
+
+
+ Particle-based deep generative models, such as gradient flows and score-based
+diffusion models, have recently gained traction thanks to their striking
+performance. Their principle of displacing particle distributions using
+differential equations is conventionally seen as opposed to the previously
+widespread generative adversarial networks (GANs), which involve training a
+pushforward generator network. In this paper we challenge this interpretation,
+and propose a novel framework that unifies particle and adversarial generative
+models by framing generator training as a generalization of particle models.
+This suggests that a generator is an optional addition to any such generative
+model. Consequently, integrating a generator into a score-based diffusion model
+and training a GAN without a generator naturally emerge from our framework. We
+empirically test the viability of these original models as proofs of concepts
+of potential applications of our framework.
+
+
+ We provide a systematic investigation of using physics-informed neural
+networks to compute Lyapunov functions. We encode Lyapunov conditions as a
+partial differential equation (PDE) and use this for training neural network
+Lyapunov functions. We analyze the analytical properties of the solutions to
+the Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov
+equation in training neural Lyapunov functions can lead to approximate regions
+of attraction close to the true domain of attraction. We also examine
+approximation errors and the convergence of neural approximations to the unique
+solution of Zubov's equation. We then provide sufficient conditions for the
+learned neural Lyapunov functions that can be readily verified by
+satisfiability modulo theories (SMT) solvers, enabling formal verification of
+both local stability analysis and region-of-attraction estimates in the large.
+Through a number of nonlinear examples, ranging from low to high dimensions, we
+demonstrate that the proposed framework can outperform traditional
+sums-of-squares (SOS) Lyapunov functions obtained using semidefinite
+programming (SDP).
+
+
+
+ comment: The current version has been submitted for publication; corrected
+ some minor typos from v2
+
+
+
+
+
+
+ ♻ ☆ ThoraX-PriorNet: A Novel Attention-Based Architecture Using Anatomical
+ Prior Probability Maps for Thoracic Disease Classification
+
+
+
+
+
+
+
+
+ Md. Iqbal Hossain, Mohammad Zunaed, Md. Kawsar Ahmed, S. M. Jawwad Hossain, Anwarul Hasan, Taufiq Hasan
+
+
+ Objective: Computer-aided disease diagnosis and prognosis based on medical
+images is a rapidly emerging field. Many Convolutional Neural Network (CNN)
+architectures have been developed by researchers for disease classification and
+localization from chest X-ray images. It is known that different thoracic
+disease lesions are more likely to occur in specific anatomical regions
+compared to others. This article aims to incorporate this disease and
+region-dependent prior probability distribution within a deep learning
+framework. Methods: We present the ThoraX-PriorNet, a novel attention-based CNN
+model for thoracic disease classification. We first estimate a
+disease-dependent spatial probability, i.e., an anatomical prior, that
+indicates the probability of occurrence of a disease in a specific region in a
+chest X-ray image. Next, we develop a novel attention-based classification
+model that combines information from the estimated anatomical prior and
+automatically extracted chest region of interest (ROI) masks to provide
+attention to the feature maps generated from a deep convolution network. Unlike
+previous works that utilize various self-attention mechanisms, the proposed
+method leverages the extracted chest ROI masks along with the probabilistic
+anatomical prior information, which selects the region of interest for
+different diseases to provide attention. Results: The proposed method shows
+superior performance in disease classification on the NIH ChestX-ray14 dataset
+compared to existing state-of-the-art methods while reaching an area under the
+ROC curve (%AUC) of 84.67. Regarding disease localization, the anatomy prior
+attention method shows competitive performance compared to state-of-the-art
+methods, achieving an accuracy of 0.80, 0.63, 0.49, 0.33, 0.28, 0.21, and 0.04
+with an Intersection over Union (IoU) threshold of 0.1, 0.2, 0.3, 0.4, 0.5,
+0.6, and 0.7, respectively.
+
+
+
+ comment: Accepted to IEEE ACCESS
+
+
+
+
+
+
+ ♻ ☆ ChessGPT: Bridging Policy Learning and Language Modeling NeurIPS 2023
+
+
+
+
+
+
+
+
+ Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, Jun Wang
+
+
+ When solving decision-making tasks, humans typically depend on information
+from two key sources: (1) Historical policy data, which provides interaction
+replay from the environment, and (2) Analytical insights in natural language
+form, exposing the invaluable thought process or strategic considerations.
+Despite this, the majority of preceding research focuses on only one source:
+they either use historical replay exclusively to directly learn policy or value
+functions, or engaged in language model training utilizing mere language
+corpus. In this paper, we argue that a powerful autonomous agent should cover
+both sources. Thus, we propose ChessGPT, a GPT model bridging policy learning
+and language modeling by integrating data from these two sources in Chess
+games. Specifically, we build a large-scale game and language dataset related
+to chess. Leveraging the dataset, we showcase two model examples ChessCLIP and
+ChessGPT, integrating policy learning and language modeling. Finally, we
+propose a full evaluation framework for evaluating language model's chess
+ability. Experimental results validate our model and dataset's effectiveness.
+We open source our code, model, and dataset at
+https://github.com/waterhorse1/ChessGPT.
+
+
+
+ comment: Published as a conference article in NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Prot2Text: Multimodal Protein's Function Generation with GNNs and
+ Transformers
+
+
+ The complex nature of big biological systems pushed some scientists to
+classify its understanding under the inconceivable missions. Different leveled
+challenges complicated this task, one of is the prediction of a protein's
+function. In recent years, significant progress has been made in this field
+through the development of various machine learning approaches. However, most
+existing methods formulate the task as a multi-classification problem, i.e
+assigning predefined labels to proteins. In this work, we propose a novel
+approach, \textbf{Prot2Text}, which predicts a protein function's in a free
+text style, moving beyond the conventional binary or categorical
+classifications. By combining Graph Neural Networks(GNNs) and Large Language
+Models(LLMs), in an encoder-decoder framework, our model effectively integrates
+diverse data types including proteins' sequences, structures, and textual
+annotations. This multimodal approach allows for a holistic representation of
+proteins' functions, enabling the generation of detailed and accurate
+descriptions. To evaluate our model, we extracted a multimodal protein dataset
+from SwissProt, and demonstrate empirically the effectiveness of Prot2Text.
+These results highlight the transformative impact of multimodal models,
+specifically the fusion of GNNs and LLMs, empowering researchers with powerful
+tools for more accurate prediction of proteins' functions. The code, the models
+and a demo will be publicly released.
+
+
+
+
+
+
+
+ ♻ ☆ Invariant Learning via Probability of Sufficient and Necessary Causes
+
+
+
+
+
+
+
+
+ Mengyue Yang, Zhen Fang, Yonggang Zhang, Yali Du, Furui Liu, Jean-Francois Ton, Jianhong Wang, Jun Wang
+
+
+ Out-of-distribution (OOD) generalization is indispensable for learning models
+in the wild, where testing distribution typically unknown and different from
+the training. Recent methods derived from causality have shown great potential
+in achieving OOD generalization. However, existing methods mainly focus on the
+invariance property of causes, while largely overlooking the property of
+\textit{sufficiency} and \textit{necessity} conditions. Namely, a necessary but
+insufficient cause (feature) is invariant to distribution shift, yet it may not
+have required accuracy. By contrast, a sufficient yet unnecessary cause
+(feature) tends to fit specific data well but may have a risk of adapting to a
+new domain. To capture the information of sufficient and necessary causes, we
+employ a classical concept, the probability of sufficiency and necessary causes
+(PNS), which indicates the probability of whether one is the necessary and
+sufficient cause. To associate PNS with OOD generalization, we propose PNS risk
+and formulate an algorithm to learn representation with a high PNS value. We
+theoretically analyze and prove the generalizability of the PNS risk.
+Experiments on both synthetic and real-world benchmarks demonstrate the
+effectiveness of the proposed method. The details of the implementation can be
+found at the GitHub repository: https://github.com/ymy4323460/CaSN.
+
+
+
+
+
+
+
+ ♻ ☆ Fair GANs through model rebalancing for extremely imbalanced class
+ distributions
+
+
+ Deep generative models require large amounts of training data. This often
+poses a problem as the collection of datasets can be expensive and difficult,
+in particular datasets that are representative of the appropriate underlying
+distribution (e.g. demographic). This introduces biases in datasets which are
+further propagated in the models. We present an approach to construct an
+unbiased generative adversarial network (GAN) from an existing biased GAN by
+rebalancing the model distribution. We do so by generating balanced data from
+an existing imbalanced deep generative model using an evolutionary algorithm
+and then using this data to train a balanced generative model. Additionally, we
+propose a bias mitigation loss function that minimizes the deviation of the
+learned class distribution from being equiprobable. We show results for the
+StyleGAN2 models while training on the Flickr Faces High Quality (FFHQ) dataset
+for racial fairness and see that the proposed approach improves on the fairness
+metric by almost 5 times, whilst maintaining image quality. We further validate
+our approach by applying it to an imbalanced CIFAR10 dataset where we show that
+we can obtain comparable fairness and image quality as when training on a
+balanced CIFAR10 dataset which is also twice as large. Lastly, we argue that
+the traditionally used image quality metrics such as Frechet inception distance
+(FID) are unsuitable for scenarios where the class distributions are imbalanced
+and a balanced reference set is not available.
+
+
+
+
+
+
+
+ ♻ ☆ Limitations of Face Image Generation AAAI
+
+
+
+
+
+
+
+
+ Harrison Rosenberg, Shimaa Ahmed, Guruprasad V Ramesh, Ramya Korlakai Vinayak, Kassem Fawaz
+
+
+ Text-to-image diffusion models have achieved widespread popularity due to
+their unprecedented image generation capability. In particular, their ability
+to synthesize and modify human faces has spurred research into using generated
+face images in both training data augmentation and model performance
+assessments. In this paper, we study the efficacy and shortcomings of
+generative models in the context of face generation. Utilizing a combination of
+qualitative and quantitative measures, including embedding-based metrics and
+user studies, we present a framework to audit the characteristics of generated
+faces conditioned on a set of social attributes. We applied our framework on
+faces generated through state-of-the-art text-to-image diffusion models. We
+identify several limitations of face image generation that include faithfulness
+to the text prompt, demographic disparities, and distributional shifts.
+Furthermore, we present an analytical model that provides insights into how
+training data selection contributes to the performance of generative models.
+
+
+
+ comment: Accepted to The 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI 2024)
+
+
+
+
+
+
+ ♻ ☆ Strategyproof Decision-Making in Panel Data Settings and Beyond
+
+
+ We consider the problem of decision-making using panel data, in which a
+decision-maker gets noisy, repeated measurements of multiple units (or agents).
+We consider a setup where there is a pre-intervention period, when the
+principal observes the outcomes of each unit, after which the principal uses
+these observations to assign a treatment to each unit. Unlike this classical
+setting, we permit the units generating the panel data to be strategic, i.e.
+units may modify their pre-intervention outcomes in order to receive a more
+desirable intervention. The principal's goal is to design a strategyproof
+intervention policy, i.e. a policy that assigns units to their
+utility-maximizing interventions despite their potential strategizing. We first
+identify a necessary and sufficient condition under which a strategyproof
+intervention policy exists, and provide a strategyproof mechanism with a simple
+closed form when one does exist. Along the way, we prove impossibility results
+for strategic multiclass classification, which may be of independent interest.
+When there are two interventions, we establish that there always exists a
+strategyproof mechanism, and provide an algorithm for learning such a
+mechanism. For three or more interventions, we provide an algorithm for
+learning a strategyproof mechanism if there exists a sufficiently large gap in
+the principal's rewards between different interventions. Finally, we
+empirically evaluate our model using real-world panel data collected from
+product sales over 18 months. We find that our methods compare favorably to
+baselines which do not take strategic interactions into consideration, even in
+the presence of model misspecification.
+
+
+
+ comment: In the fiftieth ACM SIGMETRICS International Conference on
+ Measurement and Modeling of Computer Systems (SIGMETRICS 2024)
+
+
+
+
+
+
+
+ Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, Soon-heung Jung
+
+
+ Traditional multi-view stereo (MVS) methods rely heavily on photometric and
+geometric consistency constraints, but newer machine learning-based MVS methods
+check geometric consistency across multiple source views only as a
+post-processing step. In this paper, we present a novel approach that
+explicitly encourages geometric consistency of reference view depth maps across
+multiple source views at different scales during learning (see Fig. 1). We find
+that adding this geometric consistency loss significantly accelerates learning
+by explicitly penalizing geometrically inconsistent pixels, reducing the
+training iteration requirements to nearly half that of other MVS methods. Our
+extensive experiments show that our approach achieves a new state-of-the-art on
+the DTU and BlendedMVS datasets, and competitive results on the Tanks and
+Temples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt
+to enforce multi-view, multi-scale geometric consistency during learning.
+
+
+
+ comment: Accepted in WACV 2024 Link:
+ https://openaccess.thecvf.com/content/WACV2024/html/Vats_GC-MVSNet_Multi-View_Multi-Scale_Geometrically-Consistent_Multi-View_Stereo_WACV_2024_paper.html
+
+
+
+
+
+
+ ♻ ☆ Reduced Policy Optimization for Continuous Control with Hard Constraints NeurIPS2023
+
+
+
+
+
+
+
+
+ Shutong Ding, Jingya Wang, Yali Du, Ye Shi
+
+
+ Recent advances in constrained reinforcement learning (RL) have endowed
+reinforcement learning with certain safety guarantees. However, deploying
+existing constrained RL algorithms in continuous control tasks with general
+hard constraints remains challenging, particularly in those situations with
+non-convex hard constraints. Inspired by the generalized reduced gradient (GRG)
+algorithm, a classical constrained optimization technique, we propose a reduced
+policy optimization (RPO) algorithm that combines RL with GRG to address
+general hard constraints. RPO partitions actions into basic actions and
+nonbasic actions following the GRG method and outputs the basic actions via a
+policy network. Subsequently, RPO calculates the nonbasic actions by solving
+equations based on equality constraints using the obtained basic actions. The
+policy network is then updated by implicitly differentiating nonbasic actions
+with respect to basic actions. Additionally, we introduce an action projection
+procedure based on the reduced gradient and apply a modified Lagrangian
+relaxation technique to ensure inequality constraints are satisfied. To the
+best of our knowledge, RPO is the first attempt that introduces GRG to RL as a
+way of efficiently handling both equality and inequality hard constraints. It
+is worth noting that there is currently a lack of RL environments with complex
+hard constraints, which motivates us to develop three new benchmarks: two
+robotics manipulation tasks and a smart grid operation control task. With these
+benchmarks, RPO achieves better performance than previous constrained RL
+algorithms in terms of both cumulative reward and constraint violation. We
+believe RPO, along with the new benchmarks, will open up new opportunities for
+applying RL to real-world problems with complex constraints.
+
+
+
+ comment: Accepted by NeurIPS2023
+
+
+
+
+
+
+ ♻ ☆ Two Sides of The Same Coin: Bridging Deep Equilibrium Models and Neural
+ ODEs via Homotopy Continuation NeurIPS2023
+
+
+
+
+
+
+
+
+ Shutong Ding, Tianyu Cui, Jingya Wang, Ye Shi
+
+
+ Deep Equilibrium Models (DEQs) and Neural Ordinary Differential Equations
+(Neural ODEs) are two branches of implicit models that have achieved remarkable
+success owing to their superior performance and low memory consumption. While
+both are implicit models, DEQs and Neural ODEs are derived from different
+mathematical formulations. Inspired by homotopy continuation, we establish a
+connection between these two models and illustrate that they are actually two
+sides of the same coin. Homotopy continuation is a classical method of solving
+nonlinear equations based on a corresponding ODE. Given this connection, we
+proposed a new implicit model called HomoODE that inherits the property of high
+accuracy from DEQs and the property of stability from Neural ODEs. Unlike DEQs,
+which explicitly solve an equilibrium-point-finding problem via Newton's
+methods in the forward pass, HomoODE solves the equilibrium-point-finding
+problem implicitly using a modified Neural ODE via homotopy continuation.
+Further, we developed an acceleration method for HomoODE with a shared
+learnable initial point. It is worth noting that our model also provides a
+better understanding of why Augmented Neural ODEs work as long as the augmented
+part is regarded as the equilibrium point to find. Comprehensive experiments
+with several image classification tasks demonstrate that HomoODE surpasses
+existing implicit models in terms of both accuracy and memory consumption.
+
+
+
+ comment: Accepted by NeurIPS2023
+
+
+
+
+
+
+ ♻ ☆ Short Boolean Formulas as Explanations in Practice
+
+
+
+
+
+
+
+
+ Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Masood Feyzbakhsh Rankooh, Miikka Vilander
+
+
+ We investigate explainability via short Boolean formulas in the data model
+based on unary relations. As an explanation of length k, we take a Boolean
+formula of length k that minimizes the error with respect to the target
+attribute to be explained. We first provide novel quantitative bounds for the
+expected error in this scenario. We then also demonstrate how the setting works
+in practice by studying three concrete data sets. In each case, we calculate
+explanation formulas of different lengths using an encoding in Answer Set
+Programming. The most accurate formulas we obtain achieve errors similar to
+other methods on the same data sets. However, due to overfitting, these
+formulas are not necessarily ideal explanations, so we use cross validation to
+identify a suitable length for explanations. By limiting to shorter formulas,
+we obtain explanations that avoid overfitting but are still reasonably accurate
+and also, importantly, human interpretable.
+
+
+
+ comment: Long version of a paper published in JELIA 2023. Changes to version
+ 1: typos fixed, clarifications added
+
+
+
+
+
+
+ ♻ ☆ Foundation Models in Smart Agriculture: Basics, Opportunities, and
+ Challenges
+
+
+ The past decade has witnessed the rapid development of ML and DL
+methodologies in agricultural systems, showcased by great successes in variety
+of agricultural applications. However, these conventional ML/DL models have
+certain limitations: They heavily rely on large, costly-to-acquire labeled
+datasets for training, require specialized expertise for development and
+maintenance, and are mostly tailored for specific tasks, thus lacking
+generalizability. Recently, foundation models have demonstrated remarkable
+successes in language and vision tasks across various domains. These models are
+trained on a vast amount of data from multiple domains and modalities. Once
+trained, they can accomplish versatile tasks with just minor fine-tuning and
+minimal task-specific labeled data. Despite their proven effectiveness and huge
+potential, there has been little exploration of applying FMs to agriculture
+fields. Therefore, this study aims to explore the potential of FMs in the field
+of smart agriculture. In particular, we present conceptual tools and technical
+background to facilitate the understanding of the problem space and uncover new
+research directions in this field. To this end, we first review recent FMs in
+the general computer science domain and categorize them into four categories:
+language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs.
+Subsequently, we outline the process of developing agriculture FMs and discuss
+their potential applications in smart agriculture. We also discuss the unique
+challenges associated with developing AFMs, including model training,
+validation, and deployment. Through this study, we contribute to the
+advancement of AI in agriculture by introducing AFMs as a promising paradigm
+that can significantly mitigate the reliance on extensive labeled datasets and
+enhance the efficiency, effectiveness, and generalization of agricultural AI
+systems.
+
+
+
+ comment: 16 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ A General Recipe for the Analysis of Randomized Multi-Armed Bandit
+ Algorithms
+
+
+ In this paper we propose a general methodology to derive regret bounds for
+randomized multi-armed bandit algorithms. It consists in checking a set of
+sufficient conditions on the sampling probability of each arm and on the family
+of distributions to prove a logarithmic regret. As a direct application we
+revisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and
+Thompson Sampling (TS), under various models for the distributions including
+single parameter exponential families, Gaussian distributions, bounded
+distributions, or distributions satisfying some conditions on their moments. In
+particular, we prove that MED is asymptotically optimal for all these models,
+but also provide a simple regret analysis of some TS algorithms for which the
+optimality is already known. We then further illustrate the interest of our
+approach, by analyzing a new Non-Parametric TS algorithm (h-NPTS), adapted to
+some families of unbounded reward distributions with a bounded h-moment. This
+model can for instance capture some non-parametric families of distributions
+whose variance is upper bounded by a known constant.
+
+
+
+
+
+
+
+ ♻ ☆ Deep Learning for Survival Analysis: A Review
+
+
+
+
+
+
+
+
+ Simon Wiegrebe, Philipp Kopper, Raphael Sonabend, Bernd Bischl, Andreas Bender
+
+
+ The influx of deep learning (DL) techniques into the field of survival
+analysis in recent years has led to substantial methodological progress; for
+instance, learning from unstructured or high-dimensional data such as images,
+text or omics data. In this work, we conduct a comprehensive systematic review
+of DL-based methods for time-to-event analysis, characterizing them according
+to both survival- and DL-related attributes. In summary, the reviewed methods
+often address only a small subset of tasks relevant to time-to-event data -
+e.g., single-risk right-censored data - and neglect to incorporate more complex
+settings. Our findings are summarized in an editable, open-source, interactive
+table: https://survival-org.github.io/DL4Survival. As this research area is
+advancing rapidly, we encourage community contribution in order to keep this
+database up to date.
+
+
+ A significant amount of research is focused on developing and evaluating
+large language models for a variety of code synthesis tasks. These include
+synthesizing code from natural language instructions, synthesizing tests from
+code, and synthesizing explanations of code. In contrast, the behavior of
+instructional code editing with LLMs is understudied. These are tasks in which
+the model is instructed to update a block of code provided in a prompt. The
+editing instruction may ask for a feature to added or removed, describe a bug
+and ask for a fix, ask for a different kind of solution, or many other common
+code editing tasks.
+ We introduce a carefully crafted benchmark of code editing tasks and use it
+evaluate several cutting edge LLMs. Our evaluation exposes a significant gap
+between the capabilities of state-of-the-art open and closed models. For
+example, even GPT-3.5-Turbo is 8.8% better than the best open model at editing
+code.
+ We also introduce a new, carefully curated, permissively licensed training
+set of code edits coupled with natural language instructions. Using this
+training set, we show that we can fine-tune open Code LLMs to significantly
+improve their code editing capabilities.
+
+
+
+
+
+
+
+ ♻ ☆ The Multiverse of Dynamic Mode Decomposition Algorithms
+
+
+ Dynamic Mode Decomposition (DMD) is a popular data-driven analysis technique
+used to decompose complex, nonlinear systems into a set of modes, revealing
+underlying patterns and dynamics through spectral analysis. This review
+presents a comprehensive and pedagogical examination of DMD, emphasizing the
+role of Koopman operators in transforming complex nonlinear dynamics into a
+linear framework. A distinctive feature of this review is its focus on the
+relationship between DMD and the spectral properties of Koopman operators, with
+particular emphasis on the theory and practice of DMD algorithms for spectral
+computations. We explore the diverse "multiverse" of DMD methods, categorized
+into three main areas: linear regression-based methods, Galerkin
+approximations, and structure-preserving techniques. Each category is studied
+for its unique contributions and challenges, providing a detailed overview of
+significant algorithms and their applications as outlined in Table 1. We
+include a MATLAB package with examples and applications to enhance the
+practical understanding of these methods. This review serves as both a
+practical guide and a theoretical reference for various DMD methods, accessible
+to both experts and newcomers, and enabling readers to delve into their areas
+of interest in the expansive field of DMD.
+
+
+ Reasoning, a crucial ability for complex problem-solving, plays a pivotal
+role in various real-world settings such as negotiation, medical diagnosis, and
+criminal investigation. It serves as a fundamental methodology in the field of
+Artificial General Intelligence (AGI). With the ongoing development of
+foundation models, there is a growing interest in exploring their abilities in
+reasoning tasks. In this paper, we introduce seminal foundation models proposed
+or adaptable for reasoning, highlighting the latest advancements in various
+reasoning tasks, methods, and benchmarks. We then delve into the potential
+future directions behind the emergence of reasoning abilities within foundation
+models. We also discuss the relevance of multimodal learning, autonomous
+agents, and super alignment in the context of reasoning. By discussing these
+future research directions, we hope to inspire researchers in their exploration
+of this field, stimulate further advancements in reasoning with foundation
+models, and contribute to the development of AGI.
+
+
+
+
+
+
+
+ ♻ ☆ Can gamification reduce the burden of self-reporting in mHealth
+ applications? A feasibility study using machine learning from smartwatch data
+ to estimate cognitive load
+
+
+
+
+
+
+
+
+ Michal K. Grzeszczyk, Paulina Adamczyk, Sylwia Marek, Ryszard Pręcikowski, Maciej Kuś, M. Patrycja Lelujko, Rosmary Blanco, Tomasz Trzciński, Arkadiusz Sitek, Maciej Malawski, Aneta Lisowska
+
+
+ The effectiveness of digital treatments can be measured by requiring patients
+to self-report their state through applications, however, it can be
+overwhelming and causes disengagement. We conduct a study to explore the impact
+of gamification on self-reporting. Our approach involves the creation of a
+system to assess cognitive load (CL) through the analysis of
+photoplethysmography (PPG) signals. The data from 11 participants is utilized
+to train a machine learning model to detect CL. Subsequently, we create two
+versions of surveys: a gamified and a traditional one. We estimate the CL
+experienced by other participants (13) while completing surveys. We find that
+CL detector performance can be enhanced via pre-training on stress detection
+tasks. For 10 out of 13 participants, a personalized CL detector can achieve an
+F1 score above 0.7. We find no difference between the gamified and non-gamified
+surveys in terms of CL but participants prefer the gamified version.
+
+
+
+
+
+
+
+
+ Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn
+
+
+ In this study, we aim to extend the capabilities of diffusion-based
+text-to-image (T2I) generation models by incorporating diverse modalities
+beyond textual description, such as sketch, box, color palette, and style
+embedding, within a single model. We thus design a multimodal T2I diffusion
+model, coined as DiffBlender, by separating the channels of conditions into
+three types, i.e., image forms, spatial tokens, and non-spatial tokens. The
+unique architecture of DiffBlender facilitates adding new input modalities,
+pioneering a scalable framework for conditional image generation. Notably, we
+achieve this without altering the parameters of the existing generative model,
+Stable Diffusion, only with updating partial components. Our study establishes
+new benchmarks in multimodal generation through quantitative and qualitative
+comparisons with existing conditional generation methods. We demonstrate that
+DiffBlender faithfully blends all the provided information and showcase its
+various applications in the detailed image synthesis.
+
+
+ Distribution shifts are common in real-world datasets and can affect the
+performance and reliability of deep learning models. In this paper, we study
+two types of distribution shifts: diversity shifts, which occur when test
+samples exhibit patterns unseen during training, and correlation shifts, which
+occur when test data present a different correlation between seen invariant and
+spurious features. We propose an integrated protocol to analyze both types of
+shifts using datasets where they co-exist in a controllable manner. Finally, we
+apply our approach to a real-world classification problem of skin cancer
+analysis, using out-of-distribution datasets and specialized bias annotations.
+Our protocol reveals three findings: 1) Models learn and propagate correlation
+shifts even with low-bias training; this poses a risk of accumulating and
+combining unaccountable weak biases; 2) Models learn robust features in high-
+and low-bias scenarios but use spurious ones if test samples have them; this
+suggests that spurious correlations do not impair the learning of robust
+features; 3) Diversity shift can reduce the reliance on spurious correlations;
+this is counter intuitive since we expect biased models to depend more on
+biases when invariant features are missing. Our work has implications for
+distribution shift research and practice, providing new insights into how
+models learn and rely on spurious correlations under different types of shifts.
+
+
+
+ comment: Paper under consideration at Pattern Recognition Letters
+
+
+
+
+
+
+ ♻ ☆ Are you talking to ['xem'] or ['x', 'em']? On Tokenization and
+ Addressing Misgendering in LLMs with Pronoun Tokenization Parity
+
+
+ A large body of NLP research has documented the ways gender biases manifest
+and amplify within large language models (LLMs), though this research has
+predominantly operated within a gender binary-centric context. A growing body
+of work has identified the harmful limitations of this gender-exclusive
+framing; many LLMs cannot correctly and consistently refer to persons outside
+the gender binary, especially if they use neopronouns. While data scarcity has
+been identified as a possible culprit, the precise mechanisms through which it
+influences LLM misgendering remain underexplored. Our work addresses this gap
+by studying data scarcity's role in subword tokenization and, consequently, the
+formation of LLM word representations. We uncover how the Byte-Pair Encoding
+(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun
+misgendering through out-of-vocabulary behavior. We introduce pronoun
+tokenization parity (PTP), a novel approach to reduce LLM neopronoun
+misgendering by preserving a token's functional structure. We evaluate PTP's
+efficacy using pronoun consistency-based metrics and a novel syntax-based
+metric. Through several controlled experiments, finetuning LLMs with PTP
+improves neopronoun consistency from 14.5% to 58.4%, highlighting the
+significant role tokenization plays in LLM pronoun consistency.
+
+
+
+ comment: Accepted to 2023 Neurips Queer in AI workshop
+
+
+
+
+
+
+ ♻ ☆ Sustainable Transparency in Recommender Systems: Bayesian Ranking of
+ Images for Explainability
+
+
+ Recommender Systems have become crucial in the modern world, commonly guiding
+users towards relevant content or products, and having a large influence over
+the decisions of users and citizens. However, ensuring transparency and user
+trust in these systems remains a challenge; personalized explanations have
+emerged as a solution, offering justifications for recommendations. Among the
+existing approaches for generating personalized explanations, using existing
+visual content created by users is a promising option to maximize transparency
+and user trust. State-of-the-art models that follow this approach, despite
+leveraging highly optimized architectures, employ surrogate learning tasks that
+do not efficiently model the objective of ranking images as explanations for a
+given recommendation; this leads to a suboptimal training process with high
+computational costs that may not be reduced without affecting model
+performance. This work presents BRIE, a novel model where we leverage Bayesian
+Pairwise Ranking to enhance the training process, allowing us to consistently
+outperform state-of-the-art models in six real-world datasets while reducing
+its model size by up to 64 times and its CO${_2}$ emissions by up to 75% in
+training and inference.
+
+
+ Recently, instruction-following audio-language models have received broad
+attention for audio interaction with humans. However, the absence of
+pre-trained audio models capable of handling diverse audio types and tasks has
+hindered progress in this field. Consequently, most existing works have only
+been able to support a limited range of interaction capabilities. In this
+paper, we develop the Qwen-Audio model and address this limitation by scaling
+up audio-language pre-training to cover over 30 tasks and various audio types,
+such as human speech, natural sounds, music, and songs, to facilitate universal
+audio understanding abilities. However, directly co-training all tasks and
+datasets can lead to interference issues, as the textual labels associated with
+different datasets exhibit considerable variations due to differences in task
+focus, language, granularity of annotation, and text structure. To overcome the
+one-to-many interference, we carefully design a multi-task training framework
+by conditioning on a sequence of hierarchical tags to the decoder for
+encouraging knowledge sharing and avoiding interference through shared and
+specified tags respectively. Remarkably, Qwen-Audio achieves impressive
+performance across diverse benchmark tasks without requiring any task-specific
+fine-tuning, surpassing its counterparts. Building upon the capabilities of
+Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from
+various audios and text inputs, enabling multi-turn dialogues and supporting
+various audio-central scenarios.
+
+
+
+ comment: The code, checkpoints and demo are released at
+ https://github.com/QwenLM/Qwen-Audio
+
+
+
+
+
+
+ ♻ ☆ Ultra-fast high-dynamic range imaging of Cygnus A with the R2D2 deep
+ neural network series
+
+
+
+
+
+
+
+
+ Aghabiglou A, Chu C S, Jackson A, Dabbech A, Wiaux Y
+
+
+ We present a novel AI approach for high-resolution high-dynamic range
+synthesis imaging by radio interferometry (RI) in astronomy. R2D2, standing for
+``{R}esidual-to-{R}esidual {D}NN series for high-{D}ynamic range imaging'', is
+a model-based data-driven approach relying on hybrid deep neural networks
+(DNNs) and data-consistency updates. Its reconstruction is built as a series of
+residual images estimated as the outputs of DNNs, each taking the residual
+dirty image of the previous iteration as an input. The approach can be
+interpreted as a learned version of a matching pursuit approach, whereby model
+components are iteratively identified from residual dirty images, and of which
+CLEAN is a well-known example. We propose two variants of the R2D2 model, built
+upon two distinctive DNN architectures: a standard U-Net, and a novel unrolled
+architecture. We demonstrate their use for monochromatic intensity imaging on
+highly-sensitive observations of the radio galaxy Cygnus A at S band, from the
+Very Large Array (VLA). R2D2 is validated against CLEAN and the recent RI
+algorithms AIRI and uSARA, which respectively inject a learned implicit
+regularization and an advanced handcrafted sparsity-based regularization into
+the RI data. With only few terms in its series, the R2D2 model is able to
+deliver high-precision imaging, superseding the resolution of CLEAN, and
+matching the precision of AIRI and uSARA. In terms of computational efficiency,
+R2D2 runs at a fraction of the cost of AIRI and uSARA, and is also faster than
+CLEAN, opening the door to near real-time precision imaging in RI.
+
+
+ Graph neural networks (GNNs) have demonstrated promising performance across
+various chemistry-related tasks. However, conventional graphs only model the
+pairwise connectivity in molecules, failing to adequately represent
+higher-order connections like multi-center bonds and conjugated structures. To
+tackle this challenge, we introduce molecular hypergraphs and propose Molecular
+Hypergraph Neural Networks (MHNN) to predict the optoelectronic properties of
+organic semiconductors, where hyperedges represent conjugated structures. A
+general algorithm is designed for irregular high-order connections, which can
+efficiently operate on molecular hypergraphs with hyperedges of various orders.
+The results show that MHNN outperforms all baseline models on most tasks of
+OPV, OCELOTv1 and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D
+geometric information, surpassing the baseline model that utilizes atom
+positions. Moreover, MHNN achieves better performance than pretrained GNNs
+under limited training data, underscoring its excellent data efficiency. This
+work provides a new strategy for more general molecular representations and
+property prediction tasks related to high-order connections.
+
+
+
+
+
+
+
+ ♻ ☆ Context Matters: Data-Efficient Augmentation of Large Language Models
+ for Scientific Applications
+
+
+ In this paper, we explore the challenges inherent to Large Language Models
+(LLMs) like GPT-4, particularly their propensity for hallucinations, logic
+mistakes, and incorrect conclusions when tasked with answering complex
+questions. The capacity of LLMs to present erroneous answers in a coherent and
+semantically rigorous manner further complicates the detection of factual
+inaccuracies. This issue is especially pronounced in fields that require
+specialized expertise. Our work delves into these challenges, aiming to enhance
+the understanding and mitigation of such errors, thereby contributing to the
+improvement of LLM accuracy and reliability in scientific and other specialized
+domains. Our findings reveal a non-linear relationship between the context's
+relevancy and the answers' measured quality. In addition, we demonstrate that
+with the correct calibration, it is possible to automate the grading procedure
+-- a finding suggesting that, at least to some degree, the LLMs can be used to
+self-examine the quality of their own performance. Finally, we describe an
+experimental platform that can be seen as a proof-of-concept of the techniques
+described in this work.
+
+
+
+ comment: 11 pages, 6 figures, 4 tables, 3 pages of supplementary material
+
+
+
+
+
+
+ ♻ ☆ A note on the connectedness property of union-free generic sets of
+ partial orders
+
+
+ This short note describes and proves a connectedness property which was
+introduced in Blocher et al. [2023] in the context of data depth functions for
+partial orders. The connectedness property gives a structural insight into
+union-free generic sets. These sets, presented in Blocher et al. [2023], are
+defined by using a closure operator on the set of all partial orders which
+naturally appears within the theory of formal concept analysis. In the language
+of formal concept analysis, the property of connectedness can be vividly
+proven. However, since within Blocher et al. [2023] we did not discuss formal
+concept analysis, we outsourced the proof to this note.
+
+
+
+
+
+
+
+ ♻ ☆ Comparison of two data fusion approaches for land use classification
+
+
+
+
+
+
+
+
+ Martin Cubaud, Arnaud Le Bris, Laurence Jolivet, Ana-Maria Olteanu-Raimond
+
+
+ Accurate land use maps, describing the territory from an anthropic
+utilisation point of view, are useful tools for land management and planning.
+To produce them, the use of optical images alone remains limited. It is
+therefore necessary to make use of several heterogeneous sources, each carrying
+complementary or contradictory information due to their imperfections or their
+different specifications. This study compares two different approaches i.e. a
+pre-classification and a post-classification fusion approach for combining
+several sources of spatial data in the context of land use classification. The
+approaches are applied on authoritative land use data located in the Gers
+department in the southwest of France. Pre-classification fusion, while not
+explicitly modeling imperfections, has the best final results, reaching an
+overall accuracy of 97% and a macro-mean F1 score of 88%.
+
+
+
+
+
+
+
+ ♻ ☆ Finding Order in Chaos: A Novel Data Augmentation Method for Time Series
+ in Contrastive Learning NeurIPS
+
+
+ The success of contrastive learning is well known to be dependent on data
+augmentation. Although the degree of data augmentations has been well
+controlled by utilizing pre-defined techniques in some domains like vision,
+time-series data augmentation is less explored and remains a challenging
+problem due to the complexity of the data generation mechanism, such as the
+intricate mechanism involved in the cardiovascular system. Moreover, there is
+no widely recognized and general time-series augmentation method that can be
+applied across different tasks. In this paper, we propose a novel data
+augmentation method for quasi-periodic time-series tasks that aims to connect
+intra-class samples together, and thereby find order in the latent space. Our
+method builds upon the well-known mixup technique by incorporating a novel
+approach that accounts for the periodic nature of non-stationary time-series.
+Also, by controlling the degree of chaos created by data augmentation, our
+method leads to improved feature representations and performance on downstream
+tasks. We evaluate our proposed method on three time-series tasks, including
+heart rate estimation, human activity recognition, and cardiovascular disease
+detection. Extensive experiments against state-of-the-art methods show that the
+proposed approach outperforms prior works on optimal data generation and known
+data augmentation techniques in the three tasks, reflecting the effectiveness
+of the presented method. Source code:
+https://github.com/eth-siplab/Finding_Order_in_Chaos
+
+
+
+ comment: Published at the Conference on Neural Information Processing Systems
+ (NeurIPS) 2023
+
+
+
+
+
+
+ ♻ ☆ Improving Gradient-Trend Identification: Fast-Adaptive Moment Estimation
+ with Finance-Inspired Triple Exponential Moving Average
+
+
+ The performance improvement of deep networks significantly depends on their
+optimizers. With existing optimizers, precise and efficient recognition of the
+gradients trend remains a challenge. Existing optimizers predominantly adopt
+techniques based on the first-order exponential moving average (EMA), which
+results in noticeable delays that impede the real-time tracking of gradients
+trend and consequently yield sub-optimal performance. To overcome this
+limitation, we introduce a novel optimizer called fast-adaptive moment
+estimation (FAME). Inspired by the triple exponential moving average (TEMA)
+used in the financial domain, FAME leverages the potency of higher-order TEMA
+to improve the precision of identifying gradient trends. TEMA plays a central
+role in the learning process as it actively influences optimization dynamics;
+this role differs from its conventional passive role as a technical indicator
+in financial contexts. Because of the introduction of TEMA into the
+optimization process, FAME can identify gradient trends with higher accuracy
+and fewer lag issues, thereby offering smoother and more consistent responses
+to gradient fluctuations compared to conventional first-order EMA. To study the
+effectiveness of our novel FAME optimizer, we conducted comprehensive
+experiments encompassing six diverse computer-vision benchmarks and tasks,
+spanning detection, classification, and semantic comprehension. We integrated
+FAME into 15 learning architectures and compared its performance with those of
+six popular optimizers. Results clearly showed that FAME is more robust and
+accurate and provides superior performance stability by minimizing noise (i.e.,
+trend fluctuations). Notably, FAME achieves higher accuracy levels in
+remarkably fewer training epochs than its counterparts, clearly indicating its
+significance for optimizing deep networks in computer-vision tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Improving Generalization in Game Agents with Data Augmentation in
+ Imitation Learning
+
+
+ Imitation learning is an effective approach for training game-playing agents
+and, consequently, for efficient game production. However, generalization - the
+ability to perform well in related but unseen scenarios - is an essential
+requirement that remains an unsolved challenge for game AI. Generalization is
+difficult for imitation learning agents because it requires the algorithm to
+take meaningful actions outside of the training distribution. In this paper we
+propose a solution to this challenge. Inspired by the success of data
+augmentation in supervised learning, we augment the training data so the
+distribution of states and actions in the dataset better represents the real
+state-action distribution. This study evaluates methods for combining and
+applying data augmentations to observations, to improve generalization of
+imitation learning agents. It also provides a performance benchmark of these
+augmentations across several 3D environments. These results demonstrate that
+data augmentation is a promising framework for improving generalization in
+imitation learning agents.
+
+
+
+ comment: 8 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Hybrid Internal Model: A Simple and Efficient Learner for Agile Legged
+ Locomotion
+
+
+
+
+
+
+
+
+ Junfeng Long, Zirui Wang, Quanyi Li, Jiawei Gao, Liu Cao, Jiangmiao Pang
+
+
+ Robust locomotion control depends on accurate state estimations. However, the
+sensors of most legged robots can only provide partial and noisy observations,
+making the estimation particularly challenging, especially for external states
+like terrain frictions and elevation maps. Inspired by the classical Internal
+Model Control principle, we consider these external states as disturbances and
+introduce Hybrid Internal Model (HIM) to estimate them according to the
+response of the robot. The response, which we refer to as the hybrid internal
+embedding, contains the robot's explicit velocity and implicit stability
+representation, corresponding to two primary goals for locomotion tasks:
+explicitly tracking velocity and implicitly maintaining stability. We use
+contrastive learning to optimize the embedding to be close to the robot's
+successor state, in which the response is naturally embedded. HIM has several
+appealing benefits: It only needs the robot's proprioceptions, i.e., those from
+joint encoders and IMU as observations. It innovatively maintains consistent
+observations between simulation reference and reality that avoids information
+loss in mimicking learning. It exploits batch-level information that is more
+robust to noises and keeps better sample efficiency. It only requires 1 hour of
+training on an RTX 4090 to enable a quadruped robot to traverse any terrain
+under any disturbances. A wealth of real-world experiments demonstrates its
+agility, even in high-difficulty tasks and cases never occurred during the
+training process, revealing remarkable open-world generalizability.
+
+
+
+ comment: Use 1 hour to train a quadruped robot capable of traversing any
+ terrain under any disturbances in the open world, Project Page:
+ https://github.com/OpenRobotLab/HIMLoco
+
+
+
+
+
+
+ ♻ ☆ Unleashing the Power of Graph Data Augmentation on Covariate
+ Distribution Shift
+
+
+ The issue of distribution shifts is emerging as a critical concern in graph
+representation learning. From the perspective of invariant learning and stable
+learning, a recently well-established paradigm for out-of-distribution
+generalization, stable features of the graph are assumed to causally determine
+labels, while environmental features tend to be unstable and can lead to the
+two primary types of distribution shifts. The correlation shift is often caused
+by the spurious correlation between environmental features and labels that
+differs between the training and test data; the covariate shift often stems
+from the presence of new environmental features in test data. However, most
+strategies, such as invariant learning or graph augmentation, typically
+struggle with limited training environments or perturbed stable features, thus
+exposing limitations in handling the problem of covariate shift. To address
+this challenge, we propose a simple-yet-effective data augmentation strategy,
+Adversarial Invariant Augmentation (AIA), to handle the covariate shift on
+graphs. Specifically, given the training data, AIA aims to extrapolate and
+generate new environments, while concurrently preserving the original stable
+features during the augmentation process. Such a design equips the graph
+classification model with an enhanced capability to identify stable features in
+new environments, thereby effectively tackling the covariate shift in data.
+Extensive experiments with in-depth empirical analysis demonstrate the
+superiority of our approach. The implementation codes are publicly available at
+https://github.com/yongduosui/AIA.
+
+
+
+
+
+
+
+ ♻ ☆ Federated Learning While Providing Model as a Service: Joint Training
+ and Inference Optimization
+
+
+ While providing machine learning model as a service to process users'
+inference requests, online applications can periodically upgrade the model
+utilizing newly collected data. Federated learning (FL) is beneficial for
+enabling the training of models across distributed clients while keeping the
+data locally. However, existing work has overlooked the coexistence of model
+training and inference under clients' limited resources. This paper focuses on
+the joint optimization of model training and inference to maximize inference
+performance at clients. Such an optimization faces several challenges. The
+first challenge is to characterize the clients' inference performance when
+clients may partially participate in FL. To resolve this challenge, we
+introduce a new notion of age of model (AoM) to quantify client-side model
+freshness, based on which we use FL's global model convergence error as an
+approximate measure of inference performance. The second challenge is the tight
+coupling among clients' decisions, including participation probability in FL,
+model download probability, and service rates. Toward the challenges, we
+propose an online problem approximation to reduce the problem complexity and
+optimize the resources to balance the needs of model training and inference.
+Experimental results demonstrate that the proposed algorithm improves the
+average inference accuracy by up to 12%.
+
+
+
+ comment: Accepted by IEEE International Conference on Computer Communications
+ (INFOCOM) 2024
+
+
+
+
+
+
+ ♻ ☆ BloombergGPT: A Large Language Model for Finance
+
+
+
+
+
+
+
+
+ Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann
+
+
+ The use of NLP in the realm of financial technology is broad and complex,
+with applications ranging from sentiment analysis and named entity recognition
+to question answering. Large Language Models (LLMs) have been shown to be
+effective on a variety of tasks; however, no LLM specialized for the financial
+domain has been reported in literature. In this work, we present BloombergGPT,
+a 50 billion parameter language model that is trained on a wide range of
+financial data. We construct a 363 billion token dataset based on Bloomberg's
+extensive data sources, perhaps the largest domain-specific dataset yet,
+augmented with 345 billion tokens from general purpose datasets. We validate
+BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite
+of internal benchmarks that most accurately reflect our intended usage. Our
+mixed dataset training leads to a model that outperforms existing models on
+financial tasks by significant margins without sacrificing performance on
+general LLM benchmarks. Additionally, we explain our modeling choices, training
+process, and evaluation methodology. We release Training Chronicles (Appendix
+C) detailing our experience in training BloombergGPT.
+
+
+
+ comment: Updated to include Training Chronicles (Appendix C)
+
+
+
+
+
+
+
+ Shangchao Su, Mingzhao Yang, Bin Li, Xiangyang Xue
+
+
+ Federated learning (FL) enables multiple clients to collaboratively train a
+global model without disclosing their data. Previous researches often require
+training the complete model parameters. However, the emergence of powerful
+pre-trained models makes it possible to achieve higher performance with fewer
+learnable parameters in FL. In this paper, we propose a federated adaptive
+prompt tuning algorithm, FedAPT, for multi-domain collaborative image
+classification with powerful foundation models, like CLIP. Compared with direct
+federated prompt tuning, our core idea is to adaptively unlock specific domain
+knowledge for each test sample in order to provide them with personalized
+prompts. To implement this idea, we design an adaptive prompt tuning module,
+which consists of a meta prompt, an adaptive network, and some keys. The server
+randomly generates a set of keys and assigns a unique key to each client. Then
+all clients cooperatively train the global adaptive network and meta prompt
+with the local datasets and the frozen keys. Ultimately, the global aggregation
+model can assign a personalized prompt to CLIP based on the domain features of
+each test sample. We perform extensive experiments on two multi-domain image
+classification datasets across two different settings -- supervised and
+unsupervised. The results show that FedAPT can achieve better performance with
+less than 10\% of the number of parameters of the fully trained model, and the
+global model can perform well in diverse client domains simultaneously.
+
+
+ Through this paper, we introduce a novel driver cognitive load assessment
+dataset, CL-Drive, which contains Electroencephalogram (EEG) signals along with
+other physiological signals such as Electrocardiography (ECG) and Electrodermal
+Activity (EDA) as well as eye tracking data. The data was collected from 21
+subjects while driving in an immersive vehicle simulator, in various driving
+conditions, to induce different levels of cognitive load in the subjects. The
+tasks consisted of 9 complexity levels for 3 minutes each. Each driver reported
+their subjective cognitive load every 10 seconds throughout the experiment. The
+dataset contains the subjective cognitive load recorded as ground truth. In
+this paper, we also provide benchmark classification results for different
+machine learning and deep learning models for both binary and ternary label
+distributions. We followed 2 evaluation criteria namely 10-fold and
+leave-one-subject-out (LOSO). We have trained our models on both hand-crafted
+features as well as on raw data.
+
+
+
+ comment: 16 pages, 9 figures, 11 tables. This work has been accepted to the
+ IEEE Transactions on Intelligent Transportation Systems. \c{opyright} 2023
+ IEEE. Personal use of this material is permitted. Permission from IEEE must
+ be obtained for all other uses
+
+
+
+
+
+
+ ♻ ☆ Can Transformers Learn Sequential Function Classes In Context?
+
+
+
+
+
+
+
+
+ Ryan Campbell, Emma Guo, Evan Hu, Reya Vir, Ethan Hsiao
+
+
+ In-context learning (ICL) has revolutionized the capabilities of transformer
+models in NLP. In our project, we extend the understanding of the mechanisms
+underpinning ICL by exploring whether transformers can learn from sequential,
+non-textual function class data distributions. We introduce a novel sliding
+window sequential function class and employ toy-sized transformers with a GPT-2
+architecture to conduct our experiments. Our analysis indicates that these
+models can indeed leverage ICL when trained on non-textual sequential function
+classes. Additionally, our experiments with randomized y-label sequences
+highlights that transformers retain some ICL capabilities even when the label
+associations are obfuscated. We provide evidence that transformers can reason
+with and understand sequentiality encoded within function classes, as reflected
+by the effective learning of our proposed tasks. Our results also show that the
+performance deteriorated with increasing randomness in the labels, though not
+to the extent one might expect, implying a potential robustness of learned
+sequentiality against label noise. Future research may want to look into how
+previous explanations of transformers, such as induction heads and task
+vectors, relate to sequentiality in ICL in these toy examples. Our
+investigation lays the groundwork for further research into how transformers
+process and perceive sequential data.
+
+
+
+ comment: 8 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ Reversible and irreversible bracket-based dynamics for deep graph neural
+ networks
+
+
+ Recent works have shown that physics-inspired architectures allow the
+training of deep graph neural networks (GNNs) without oversmoothing. The role
+of these physics is unclear, however, with successful examples of both
+reversible (e.g., Hamiltonian) and irreversible (e.g., diffusion) phenomena
+producing comparable results despite diametrically opposed mechanisms, and
+further complications arising due to empirical departures from mathematical
+theory. This work presents a series of novel GNN architectures based upon
+structure-preserving bracket-based dynamical systems, which are provably
+guaranteed to either conserve energy or generate positive dissipation with
+increasing depth. It is shown that the theoretically principled framework
+employed here allows for inherently explainable constructions, which
+contextualize departures from theory in current architectures and better
+elucidate the roles of reversibility and irreversibility in network
+performance.
+
+
+
+
+
+
+
+ ♻ ☆ Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate
+ Reward Hacking
+
+
+
+
+
+
+
+
+ Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant
+
+
+ Reward models play a key role in aligning language model applications towards
+human preferences. However, this setup creates an incentive for the language
+model to exploit errors in the reward model to achieve high estimated reward, a
+phenomenon often termed \emph{reward hacking}. A natural mitigation is to train
+an ensemble of reward models, aggregating over model outputs to obtain a more
+robust reward estimate. We explore the application of reward ensembles to
+alignment at both training time (through reinforcement learning) and inference
+time (through reranking). First, we show that reward models are
+\emph{underspecified}: reward models that perform similarly in-distribution can
+yield very different rewards when used in alignment, due to distribution shift.
+Second, underspecification results in overoptimization, where alignment to one
+reward model does not improve reward as measured by another reward model
+trained on the same data. Third, overoptimization is mitigated by the use of
+reward ensembles, and ensembles that vary by their \emph{pretraining} seeds
+lead to better generalization than ensembles that differ only by their
+\emph{fine-tuning} seeds, with both outperforming individual reward models.
+However, even pretrain reward ensembles do not eliminate reward hacking: we
+show several qualitative reward hacking phenomena that are not mitigated by
+ensembling because all reward models in the ensemble exhibit similar error
+patterns.
+
+
+
+
+
+
+
+
+ Weigang Lu, Ziyu Guan, Wei Zhao, Yaming Yang, Long Jin
+
+
+ Graph Neural Networks (GNNs) have become mainstream methods for solving the
+semi-supervised node classification problem. However, due to the uneven
+location distribution of labeled nodes in the graph, labeled nodes are only
+accessible to a small portion of unlabeled nodes, leading to the
+\emph{under-reaching} issue. In this study, we firstly reveal under-reaching by
+conducting an empirical investigation on various well-known graphs. Then, we
+demonstrate that under-reaching results in unsatisfactory distribution
+alignment between labeled and unlabeled nodes through systematic experimental
+analysis, significantly degrading GNNs' performance. To tackle under-reaching
+for GNNs, we propose an architecture-agnostic method dubbed NodeMixup. The
+fundamental idea is to (1) increase the reachability of labeled nodes by
+labeled-unlabeled pairs mixup, (2) leverage graph structures via fusing the
+neighbor connections of intra-class node pairs to improve performance gains of
+mixup, and (3) use neighbor label distribution similarity incorporating node
+degrees to determine sampling weights for node mixup. Extensive experiments
+demonstrate the efficacy of NodeMixup in assisting GNNs in handling
+under-reaching. The source code is available at
+\url{https://github.com/WeigangLu/NodeMixup}.
+
+
+
+ comment: Accepted by AAAI-24
+
+
+
+
+
+
+ ♻ ☆ Towards Better Serialization of Tabular Data for Few-shot Classification
+ with Large Language Models
+
+
+ We present a study on the integration of Large Language Models (LLMs) in
+tabular data classification, emphasizing an efficient framework. Building upon
+existing work done in TabLLM (arXiv:2210.10723), we introduce three novel
+serialization techniques, including the standout LaTeX serialization method.
+This method significantly boosts the performance of LLMs in processing
+domain-specific datasets, Our method stands out for its memory efficiency and
+ability to fully utilize complex data structures. Through extensive
+experimentation, including various serialization approaches like feature
+combination and importance, we demonstrate our work's superiority in accuracy
+and efficiency over traditional models.
+
+
+
+ comment: 4 pages, 2 figures
+
+
+
+
+
+
+ ♻ ☆ Stochastic Bayesian Optimization with Unknown Continuous Context
+ Distribution via Kernel Density Estimation AAAI 2024
+
+
+
+
+
+
+
+
+ Xiaobin Huang, Lei Song, Ke Xue, Chao Qian
+
+
+ Bayesian optimization (BO) is a sample-efficient method and has been widely
+used for optimizing expensive black-box functions. Recently, there has been a
+considerable interest in BO literature in optimizing functions that are
+affected by context variable in the environment, which is uncontrollable by
+decision makers. In this paper, we focus on the optimization of functions'
+expectations over continuous context variable, subject to an unknown
+distribution. To address this problem, we propose two algorithms that employ
+kernel density estimation to learn the probability density function (PDF) of
+continuous context variable online. The first algorithm is simpler, which
+directly optimizes the expectation under the estimated PDF. Considering that
+the estimated PDF may have high estimation error when the true distribution is
+complicated, we further propose the second algorithm that optimizes the
+distributionally robust objective. Theoretical results demonstrate that both
+algorithms have sub-linear Bayesian cumulative regret on the expectation
+objective. Furthermore, we conduct numerical experiments to empirically
+demonstrate the effectiveness of our algorithms.
+
+
+
+
+
+
+
+
+ Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
+
+
+ Existing algorithms for reinforcement learning from human feedback (RLHF) can
+incentivize responses at odds with preferences because they are based on models
+that assume independence of irrelevant alternatives (IIA). The perverse
+incentives induced by IIA give rise to egregious behavior when innovating on
+query formats or learning algorithms.
+
+
+
+
+
+
+
+ ♻ ☆ Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic
+ Embedding
+
+
+
+
+
+
+
+
+ Tongzheng Ren, Zhaolin Ren, Haitong Ma, Na Li, Bo Dai
+
+
+ This paper presents an approach, Spectral Dynamics Embedding Control (SDEC),
+to optimal control for nonlinear stochastic systems. This method leverages an
+infinite-dimensional feature to linearly represent the state-action value
+function and exploits finite-dimensional truncation approximation for practical
+implementation. To characterize the effectiveness of these finite dimensional
+approximations, we provide an in-depth theoretical analysis to characterize the
+approximation error induced by the finite-dimension truncation and statistical
+error induced by finite-sample approximation in both policy evaluation and
+policy optimization. Our analysis includes two prominent kernel approximation
+methods: truncations onto random features and Nystrom features. We also
+empirically test the algorithm and compare the performance with Koopman-based,
+iLQR, and energy-based methods on a few benchmark problems.
+
+
+
+ comment: Compared to v1, added analysis of Nystrom features, more streamlined
+ proofs, and more extensive numerical studies; compared to v2, corrected a
+ small error in ordering of author list
+
+
+
+
+
+
+ ♻ ☆ Transformers à Grande Vitesse
+
+
+
+
+
+
+
+
+ Farid Arthaud, Guillaume Lecoeur, Alban Pierre
+
+
+ Robust travel time predictions are of prime importance in managing any
+transportation infrastructure, and particularly in rail networks where they
+have major impacts both on traffic regulation and passenger satisfaction. We
+aim at predicting the travel time of trains on rail sections at the scale of an
+entire rail network in real-time, by estimating trains' delays relative to a
+theoretical circulation plan.
+ Predicting the evolution of a given train's delay is a uniquely hard problem,
+distinct from mainstream road traffic forecasting problems, since it involves
+several hard-to-model phenomena: train spacing, station congestion and
+heterogeneous rolling stock among others. We first offer empirical evidence of
+the previously unexplored phenomenon of delay propagation at the scale of a
+railway network, leading to delays being amplified by interactions between
+trains and the network's physical limitations.
+ We then contribute a novel technique using the transformer architecture and
+pre-trained embeddings to make real-time massively parallel predictions for
+train delays at the scale of the whole rail network (over 3000 trains at peak
+hours, making predictions at an average horizon of 70 minutes). Our approach
+yields very positive results on real-world data when compared to currently-used
+and experimental prediction techniques.
+
+
+
+ comment: 10 pages including 1 page of appendices, 5 figures. Presented at
+ IAROR RailBelgrade 2023 and published in Journal of Rail Transport P&M
+
+ In this paper, we study the collaborative learning model, which concerns the
+tradeoff between parallelism and communication overhead in multi-agent
+multi-armed bandits. For regret minimization in multi-armed bandits, we present
+the first set of tradeoffs between the number of rounds of communication among
+the agents and the regret of the collaborative learning process.
+
+
+
+
+
+
+
+
+ Mingtian Zhang, Alex Hawkins-Hooker, Brooks Paige, David Barber
+
+
+ Energy-Based Models (EBMs) offer a versatile framework for modeling complex
+data distributions. However, training and sampling from EBMs continue to pose
+significant challenges. The widely-used Denoising Score Matching (DSM) method
+for scalable EBM training suffers from inconsistency issues, causing the energy
+model to learn a `noisy' data distribution. In this work, we propose an
+efficient sampling framework: (pseudo)-Gibbs sampling with moment matching,
+which enables effective sampling from the underlying clean model when given a
+`noisy' model that has been well-trained via DSM. We explore the benefits of
+our approach compared to related methods and demonstrate how to scale the
+method to high-dimensional datasets.
+
+
+
+
+
+
+
+ ♻ ☆ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
+ Generalizable 3D Reconstruction
+
+
+
+
+
+
+
+
+ David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann
+
+
+ We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D
+radiance fields parameterized by 3D Gaussian primitives from pairs of images.
+Our model features real-time and memory-efficient rendering for scalable
+training as well as fast 3D reconstruction at inference time. To overcome local
+minima inherent to sparse and locally supported representations, we predict a
+dense probability distribution over 3D and sample Gaussian means from that
+probability distribution. We make this sampling operation differentiable via a
+reparameterization trick, allowing us to back-propagate gradients through the
+Gaussian splatting representation. We benchmark our method on wide-baseline
+novel view synthesis on the real-world RealEstate10k and ACID datasets, where
+we outperform state-of-the-art light field transformers and accelerate
+rendering by 2.5 orders of magnitude while reconstructing an interpretable and
+editable 3D radiance field.
+
+
+
+
+
+
+
+ ♻ ☆ Shall We Pretrain Autoregressive Language Models with Retrieval? A
+ Comprehensive Study EMNLP 2023
+
+
+
+
+
+
+
+
+ Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro
+
+
+ Large decoder-only language models (LMs) can be largely improved in terms of
+perplexity by retrieval (e.g., RETRO), but its impact on text generation
+quality and downstream task accuracy is unclear. Thus, it is still an open
+question: shall we pretrain large autoregressive LMs with retrieval? To answer
+it, we perform a comprehensive study on a scalable pre-trained
+retrieval-augmented LM (i.e., RETRO) compared with standard GPT and
+retrieval-augmented GPT incorporated at fine-tuning or inference stages. We
+first provide the recipe to reproduce RETRO up to 9.5B parameters while
+retrieving a text corpus with 330B tokens. Based on that, we have the following
+novel findings: i) RETRO outperforms GPT on text generation with much less
+degeneration (i.e., repetition), moderately higher factual accuracy, and
+slightly lower toxicity with a nontoxic retrieval database. ii) On the LM
+Evaluation Harness benchmark, RETRO largely outperforms GPT on
+knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore,
+we introduce a simple variant of the model, RETRO++, which largely improves
+open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural
+Question) and significantly outperforms retrieval-augmented GPT in both
+fine-tuning and zero-shot evaluation settings. Our findings highlight the
+promising direction of pretraining autoregressive LMs with retrieval as future
+foundation models. We release our code and model at:
+https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md
+
+
+ Understanding object recognition patterns in mice is crucial for advancing
+behavioral neuroscience and has significant implications for human health,
+particularly in the realm of Alzheimer's research. This study is centered on
+the development, application, and evaluation of a state-of-the-art
+computational pipeline designed to analyze such behaviors, specifically
+focusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition
+(SLR) tasks. The pipeline integrates three advanced computational models:
+Any-Maze for initial data collection, DeepLabCut for detailed pose estimation,
+and Convolutional Neural Networks (CNNs) for nuanced behavioral classification.
+Employed across four distinct mouse groups, this pipeline demonstrated high
+levels of accuracy and robustness. Despite certain challenges like video
+quality limitations and the need for manual calculations, the results affirm
+the pipeline's efficacy and potential for scalability. The study serves as a
+proof of concept for a multidimensional computational approach to behavioral
+neuroscience, emphasizing the pipeline's versatility and readiness for future,
+more complex analyses.
+
+
+
+ comment: Aspects of the paper contain errors, and data in the pipeline must be
+ vetted one more time. More testing is necessary
+
+ We introduce OpenVoice, a versatile voice cloning approach that requires only
+a short audio clip from the reference speaker to replicate their voice and
+generate speech in multiple languages. OpenVoice represents a significant
+advancement in addressing the following open challenges in the field: 1)
+Flexible Voice Style Control. OpenVoice enables granular control over voice
+styles, including emotion, accent, rhythm, pauses, and intonation, in addition
+to replicating the tone color of the reference speaker. The voice styles are
+not directly copied from and constrained by the style of the reference speaker.
+Previous approaches lacked the ability to flexibly manipulate voice styles
+after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves
+zero-shot cross-lingual voice cloning for languages not included in the
+massive-speaker training set. Unlike previous approaches, which typically
+require extensive massive-speaker multi-lingual (MSML) dataset for all
+languages, OpenVoice can clone voices into a new language without any
+massive-speaker training data for that language. OpenVoice is also
+computationally efficient, costing tens of times less than commercially
+available APIs that offer even inferior performance. To foster further research
+in the field, we have made the source code and trained model publicly
+accessible. We also provide qualitative results in our demo website. Prior to
+its public release, our internal version of OpenVoice was used tens of millions
+of times by users worldwide between May and October 2023, serving as the
+backend of MyShell.
+
+
+ Black-box variational inference is widely used in situations where there is
+no proof that its stochastic optimization succeeds. We suggest this is due to a
+theoretical gap in existing stochastic optimization proofs: namely the
+challenge of gradient estimators with unusual noise bounds, and a composite
+non-smooth objective. For dense Gaussian variational families, we observe that
+existing gradient estimators based on reparameterization satisfy a quadratic
+noise bound and give novel convergence guarantees for proximal and projected
+stochastic gradient descent using this bound. This provides rigorous guarantees
+that methods similar to those used in practice converge on realistic inference
+problems.
+
+
+
+ comment: Accepted at NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Decentralized and Privacy-Preserving Learning of Approximate Stackelberg
+ Solutions in Energy Trading Games with Demand Response Aggregators
+
+
+
+
+
+
+
+
+ Styliani I. Kampezidou, Justin Romberg, Kyriakos G. Vamvoudakis, Dimitri N. Mavris
+
+
+ In this work, a novel Stackelberg game theoretic framework is proposed for
+trading energy bidirectionally between the demand-response (DR) aggregator and
+the prosumers. This formulation allows for flexible energy arbitrage and
+additional monetary rewards while ensuring that the prosumers' desired daily
+energy demand is met. Then, a scalable (linear with the number of prosumers),
+decentralized, privacy-preserving algorithm is proposed to find approximate
+equilibria with online sampling and learning of the prosumers' cumulative best
+response, which finds applications beyond this energy game. Moreover, cost
+bounds are provided on the quality of the approximate equilibrium solution.
+Finally, real data from the California day-ahead market and the UC Davis campus
+building energy demands are utilized to demonstrate the efficacy of the
+proposed framework and algorithm.
+
+
+
+ comment: This work has been submitted to the IEEE for possible publication.
+ Copyright may be transferred without notice, after which this version may no
+ longer be accessible
+
+
+
+
+
+
+ ♻ ☆ Two Independent Teachers are Better Role Model
+
+
+
+
+
+
+
+
+ Afifa Khaled, Ahmed A. Mubarak, Kun He
+
+
+ Recent deep learning models have attracted substantial attention in infant
+brain analysis. These models have performed state-of-the-art performance, such
+as semi-supervised techniques (e.g., Temporal Ensembling, mean teacher).
+However, these models depend on an encoder-decoder structure with stacked local
+operators to gather long-range information, and the local operators limit the
+efficiency and effectiveness. Besides, the $MRI$ data contain different tissue
+properties ($TPs$) such as $T1$ and $T2$. One major limitation of these models
+is that they use both data as inputs to the segment process, i.e., the models
+are trained on the dataset once, and it requires much computational and memory
+requirements during inference. In this work, we address the above limitations
+by designing a new deep-learning model, called 3D-DenseUNet, which works as
+adaptable global aggregation blocks in down-sampling to solve the issue of
+spatial information loss. The self-attention module connects the down-sampling
+blocks to up-sampling blocks, and integrates the feature maps in three
+dimensions of spatial and channel, effectively improving the representation
+potential and discriminating ability of the model. Additionally, we propose a
+new method called Two Independent Teachers ($2IT$), that summarizes the model
+weights instead of label predictions. Each teacher model is trained on
+different types of brain data, $T1$ and $T2$, respectively. Then, a fuse model
+is added to improve test accuracy and enable training with fewer parameters and
+labels compared to the Temporal Ensembling method without modifying the network
+architecture. Empirical results demonstrate the effectiveness of the proposed
+method. The code is available at
+https://github.com/AfifaKhaled/Two-Independent-Teachers-are-Better-Role-Model.
+
+
+
+ comment: This manuscript contains 14 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ Better Trees: An empirical study on hyperparameter tuning of
+ classification decision tree induction algorithms
+
+
+
+
+
+
+
+
+ Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André Carlos Ponce de Leon Ferreira de Carvalho
+
+
+ Machine learning algorithms often contain many hyperparameters (HPs) whose
+values affect the predictive performance of the induced models in intricate
+ways. Due to the high number of possibilities for these HP configurations and
+their complex interactions, it is common to use optimization techniques to find
+settings that lead to high predictive performance. However, insights into
+efficiently exploring this vast space of configurations and dealing with the
+trade-off between predictive and runtime performance remain challenging.
+Furthermore, there are cases where the default HPs fit the suitable
+configuration. Additionally, for many reasons, including model validation and
+attendance to new legislation, there is an increasing interest in interpretable
+models, such as those created by the Decision Tree (DT) induction algorithms.
+This paper provides a comprehensive approach for investigating the effects of
+hyperparameter tuning for the two DT induction algorithms most often used, CART
+and C4.5. DT induction algorithms present high predictive performance and
+interpretable classification models, though many HPs need to be adjusted.
+Experiments were carried out with different tuning strategies to induce models
+and to evaluate HPs' relevance using 94 classification datasets from OpenML.
+The experimental results point out that different HP profiles for the tuning of
+each algorithm provide statistically significant improvements in most of the
+datasets for CART, but only in one-third for C4.5. Although different
+algorithms may present different tuning scenarios, the tuning techniques
+generally required few evaluations to find accurate solutions. Furthermore, the
+best technique for all the algorithms was the IRACE. Finally, we found out that
+tuning a specific small subset of HPs is a good alternative for achieving
+optimal predictive performance.
+
+
+
+ comment: 60 pages, 16 figures
+
+
+
+
+
+
+ ♻ ☆ nbi: the Astronomer's Package for Neural Posterior Estimation NeurIPS 2023
+
+
+
+
+
+
+
+
+ Keming Zhang, Joshua S. Bloom, Stéfan van der Walt, Nina Hernitschek
+
+
+ Despite the promise of Neural Posterior Estimation (NPE) methods in
+astronomy, the adaptation of NPE into the routine inference workflow has been
+slow. We identify three critical issues: the need for custom featurizer
+networks tailored to the observed data, the inference inexactness, and the
+under-specification of physical forward models. To address the first two
+issues, we introduce a new framework and open-source software nbi (Neural
+Bayesian Inference), which supports both amortized and sequential NPE. First,
+nbi provides built-in "featurizer" networks with demonstrated efficacy on
+sequential data, such as light curve and spectra, thus obviating the need for
+this customization on the user end. Second, we introduce a modified algorithm
+SNPE-IS, which facilities asymptotically exact inference by using the surrogate
+posterior under NPE only as a proposal distribution for importance sampling.
+These features allow nbi to be applied off-the-shelf to astronomical inference
+problems involving light curves and spectra. We discuss how nbi may serve as an
+effective alternative to existing methods such as Nested Sampling. Our package
+is at https://github.com/kmzzhang/nbi.
+
+
+
+ comment: Update references. Accepted to NeurIPS 2023 Workshop on Deep Learning
+ and Inverse Problems. Initially appeared at ICML 2023 Workshop on Machine
+ Learning for Astrophysics. Code at https://github.com/kmzzhang/nbi
+
+
+
+
+
+
+ ♻ ☆ Coordinating Distributed Example Orders for Provably Accelerated
+ Training NeurIPS 2023
+
+
+
+
+
+
+
+
+ A. Feder Cooper, Wentao Guo, Khiem Pham, Tiancheng Yuan, Charlie F. Ruan, Yucheng Lu, Christopher De Sa
+
+
+ Recent research on online Gradient Balancing (GraB) has revealed that there
+exist permutation-based example orderings for SGD that are guaranteed to
+outperform random reshuffling (RR). Whereas RR arbitrarily permutes training
+examples, GraB leverages stale gradients from prior epochs to order examples --
+achieving a provably faster convergence rate than RR. However, GraB is limited
+by design: while it demonstrates an impressive ability to scale-up training on
+centralized data, it does not naturally extend to modern distributed ML
+workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which
+uses insights from prior work on kernel thinning to translate the benefits of
+provably faster permutation-based example ordering to distributed settings.
+With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate
+over centralized GraB and outperforms distributed RR on a variety of benchmark
+tasks.
+
+
+
+
+
+
+
+
+ Nicholas D. Sidiropoulos, Paris Karakasis, Aritra Konar
+
+
+ We consider the problem of finding the smallest or largest entry of a tensor
+of order N that is specified via its rank decomposition. Stated in a different
+way, we are given N sets of R-dimensional vectors and we wish to select one
+vector from each set such that the sum of the Hadamard product of the selected
+vectors is minimized or maximized. We show that this fundamental tensor problem
+is NP-hard for any tensor rank higher than one, and polynomial-time solvable in
+the rank-one case. We also propose a continuous relaxation and prove that it is
+tight for any rank. For low-enough ranks, the proposed continuous reformulation
+is amenable to low-complexity gradient-based optimization, and we propose a
+suite of gradient-based optimization algorithms drawing from projected gradient
+descent, Frank-Wolfe, or explicit parametrization of the relaxed constraints.
+We also show that our core results remain valid no matter what kind of polyadic
+tensor model is used to represent the tensor of interest, including Tucker,
+HOSVD/MLSVD, tensor train, or tensor ring. Next, we consider the class of
+problems that can be posed as special instances of the problem of interest. We
+show that this class includes the partition problem (and thus all NP-complete
+problems via polynomial-time transformation), integer least squares, integer
+linear programming, integer quadratic programming, sign retrieval (a special
+kind of mixed integer programming / restricted version of phase retrieval), and
+maximum likelihood decoding of parity check codes. We demonstrate promising
+experimental results on a number of hard problems, including state-of-art
+performance in decoding low density parity check codes and general parity check
+codes.
+
+
+
+ comment: 14 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ Neural Implicit Manifold Learning for Topology-Aware Density Estimation
+
+
+
+
+
+
+
+
+ Brendan Leigh Ross, Gabriel Loaiza-Ganem, Anthony L. Caterini, Jesse C. Cresswell
+
+
+ Natural data observed in $\mathbb{R}^n$ is often constrained to an
+$m$-dimensional manifold $\mathcal{M}$, where $m < n$. This work focuses on the
+task of building theoretically principled generative models for such data.
+Current generative models learn $\mathcal{M}$ by mapping an $m$-dimensional
+latent variable through a neural network $f_\theta: \mathbb{R}^m \to
+\mathbb{R}^n$. These procedures, which we call pushforward models, incur a
+straightforward limitation: manifolds cannot in general be represented with a
+single parameterization, meaning that attempts to do so will incur either
+computational instability or the inability to learn probability densities
+within the manifold. To remedy this problem, we propose to model $\mathcal{M}$
+as a neural implicit manifold: the set of zeros of a neural network. We then
+learn the probability density within $\mathcal{M}$ with a constrained
+energy-based model, which employs a constrained variant of Langevin dynamics to
+train and sample from the learned manifold. In experiments on synthetic and
+natural data, we show that our model can learn manifold-supported distributions
+with complex topologies more accurately than pushforward models.
+
+
+
+ comment: Accepted to TMLR in 2023. Code:
+ https://github.com/layer6ai-labs/implicit-manifolds
+
+ Multimodal emotion recognition (MMER) is an active research field that aims
+to accurately recognize human emotions by fusing multiple perceptual
+modalities. However, inherent heterogeneity across modalities introduces
+distribution gaps and information redundancy, posing significant challenges for
+MMER. In this paper, we propose a novel fine-grained disentangled
+representation learning (FDRL) framework to address these challenges.
+Specifically, we design modality-shared and modality-private encoders to
+project each modality into modality-shared and modality-private subspaces,
+respectively. In the shared subspace, we introduce a fine-grained alignment
+component to learn modality-shared representations, thus capturing modal
+consistency. Subsequently, we tailor a fine-grained disparity component to
+constrain the private subspaces, thereby learning modality-private
+representations and enhancing their diversity. Lastly, we introduce a
+fine-grained predictor component to ensure that the labels of the output
+representations from the encoders remain unchanged. Experimental results on the
+IEMOCAP dataset show that FDRL outperforms the state-of-the-art methods,
+achieving 78.34% and 79.44% on WAR and UAR, respectively.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
+
+
+
+
+
+
+
+
+ Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li
+
+
+ Existing open-vocabulary image segmentation methods require a fine-tuning
+step on mask annotations and/or image-text datasets. Mask labels are
+labor-intensive, which limits the number of categories in segmentation
+datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is
+severely reduced after fine-tuning. However, without fine-tuning, VLMs trained
+under weak image-text supervision tend to make suboptimal mask predictions when
+there are text queries referring to non-existing concepts in the image. To
+alleviate these issues, we introduce a novel recurrent framework that
+progressively filters out irrelevant texts and enhances mask quality without
+training efforts. The recurrent unit is a two-stage segmenter built upon a VLM
+with frozen weights. Thus, our model retains the VLM's broad vocabulary space
+and strengthens its segmentation capability. Experimental results show that our
+method outperforms not only the training-free counterparts, but also those
+fine-tuned with millions of additional data samples, and sets new
+state-of-the-art records for both zero-shot semantic and referring image
+segmentation tasks. Specifically, we improve the current record by 28.8, 16.0,
+and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
+
+
+
+
+
+ ☆ dIR -- Discrete Information Retrieval: Conversational Search over
+ Unstructured (and Structured) Data with Large Language Models
+
+
+
+
+
+
+
+
+ Pablo M. Rodriguez Bertorello, Jean Rodmond Junior Laguerre
+
+
+ Data is stored in both structured and unstructured form. Querying both, to
+power natural language conversations, is a challenge. This paper introduces
+dIR, Discrete Information Retrieval, providing a unified interface to query
+both free text and structured knowledge. Specifically, a Large Language Model
+(LLM) transforms text into expressive representation. After the text is
+extracted into columnar form, it can then be queried via a text-to-SQL Semantic
+Parser, with an LLM converting natural language into SQL. Where desired, such
+conversation may be effected by a multi-step reasoning conversational agent. We
+validate our approach via a proprietary question/answer data set, concluding
+that dIR makes a whole new class of queries on free text possible when compared
+to traditionally fine-tuned dense-embedding-model-based Information Retrieval
+(IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR
+can succeed where no other method stands a chance.
+
+
+
+ comment: 8 pages, 5 figures, Association for Computational Linguistics
+
+ We present a framework for robots to learn novel visual concepts and tasks
+via in-situ linguistic interactions with human users. Previous approaches have
+either used large pre-trained visual models to infer novel objects zero-shot,
+or added novel concepts along with their attributes and representations to a
+concept hierarchy. We extend the approaches that focus on learning visual
+concept hierarchies by enabling them to learn novel concepts and solve unseen
+robotics tasks with them. To enable a visual concept learner to solve robotics
+tasks one-shot, we developed two distinct techniques. Firstly, we propose a
+novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which
+augments information of a novel concept to its parent nodes within a concept
+hierarchy. This information propagation allows all concepts in a hierarchy to
+update as novel concepts are taught in a continual learning setting. Secondly,
+we represent a visual task as a scene graph with language annotations, allowing
+us to create novel permutations of a demonstrated task zero-shot in-situ. We
+present two sets of results. Firstly, we compare Hi-Viscont with the baseline
+model (FALCON) on visual question answering(VQA) in three domains. While being
+comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an
+improvement of over 9% on non-leaf concepts on average. We compare our model's
+performance against the baseline FALCON model. Our framework achieves 33%
+improvements in success rate metric, and 19% improvements in the object level
+accuracy compared to the baseline model. With both of these results we
+demonstrate the ability of our model to learn tasks and concepts in a continual
+learning setting on the robot.
+
+
+
+ comment: In Proceedings of The 38th Annual AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+ ☆ DSFormer: Effective Compression of Text-Transformers by Dense-Sparse
+ Weight Factorization
+
+
+ With the tremendous success of large transformer models in natural language
+understanding, down-sizing them for cost-effective deployments has become
+critical. Recent studies have explored the low-rank weight factorization
+techniques which are efficient to train, and apply out-of-the-box to any
+transformer architecture. Unfortunately, the low-rank assumption tends to be
+over-restrictive and hinders the expressiveness of the compressed model. This
+paper proposes, DSFormer, a simple alternative factorization scheme which
+expresses a target weight matrix as the product of a small dense and a
+semi-structured sparse matrix. The resulting approximation is more faithful to
+the weight distribution in transformers and therefore achieves a stronger
+efficiency-accuracy trade-off. Another concern with existing factorizers is
+their dependence on a task-unaware initialization step which degrades the
+accuracy of the resulting model. DSFormer addresses this issue through a novel
+Straight-Through Factorizer (STF) algorithm that jointly learns all the weight
+factorizations to directly maximize the final task accuracy. Extensive
+experiments on multiple natural language understanding benchmarks demonstrate
+that DSFormer obtains up to 40% better compression than the state-of-the-art
+low-rank factorizers, leading semi-structured sparsity baselines and popular
+knowledge distillation approaches. Our approach is also orthogonal to
+mainstream compressors and offers up to 50% additional compression when added
+to popular distilled, layer-shared and quantized transformers. We empirically
+evaluate the benefits of STF over conventional optimization practices.
+
+
+
+ comment: 9 page main paper. 1 page appendix
+
+
+
+
+
+
+ ☆ LlaMaVAE: Guiding Large Language Model Generation via Continuous Latent
+ Sentence Spaces
+
+
+
+
+
+
+
+
+ Yingji Zhang, Danilo S. Carvalho, Ian Pratt-Hartmann, André Freitas
+
+
+ Deep generative neural networks, such as Variational AutoEncoders (VAEs),
+offer an opportunity to better understand and control language models from the
+perspective of sentence-level latent spaces. To combine the controllability of
+VAE latent spaces with the state-of-the-art performance of recent large
+language models (LLMs), we present in this work LlaMaVAE, which combines
+expressive encoder and decoder models (sentenceT5 and LlaMA) with a VAE
+architecture, aiming to provide better text generation control to LLMs. In
+addition, to conditionally guide the VAE generation, we investigate a new
+approach based on flow-based invertible neural networks (INNs) named Invertible
+CVAE. Experimental results reveal that LlaMaVAE can outperform the previous
+state-of-the-art VAE language model, Optimus, across various tasks, including
+language modelling, semantic textual similarity and definition modelling.
+Qualitative analysis on interpolation and traversal experiments also indicates
+an increased degree of semantic clustering and geometric consistency, which
+enables better generation control.
+
+
+
+
+
+
+
+ ☆ HCDIR: End-to-end Hate Context Detection, and Intensity Reduction model
+ for online comments
+
+
+ Warning: This paper contains examples of the language that some people may
+find offensive.
+ Detecting and reducing hateful, abusive, offensive comments is a critical and
+challenging task on social media. Moreover, few studies aim to mitigate the
+intensity of hate speech. While studies have shown that context-level semantics
+are crucial for detecting hateful comments, most of this research focuses on
+English due to the ample datasets available. In contrast, low-resource
+languages, like Indian languages, remain under-researched because of limited
+datasets. Contrary to hate speech detection, hate intensity reduction remains
+unexplored in high-resource and low-resource languages. In this paper, we
+propose a novel end-to-end model, HCDIR, for Hate Context Detection, and Hate
+Intensity Reduction in social media posts. First, we fine-tuned several
+pre-trained language models to detect hateful comments to ascertain the
+best-performing hateful comments detection model. Then, we identified the
+contextual hateful words. Identification of such hateful words is justified
+through the state-of-the-art explainable learning model, i.e., Integrated
+Gradient (IG). Lastly, the Masked Language Modeling (MLM) model has been
+employed to capture domain-specific nuances to reduce hate intensity. We masked
+the 50\% hateful words of the comments identified as hateful and predicted the
+alternative words for these masked terms to generate convincing sentences. An
+optimal replacement for the original hate comments from the feasible sentences
+is preferred. Extensive experiments have been conducted on several recent
+datasets using automatic metric-based evaluation (BERTScore) and thorough human
+evaluation. To enhance the faithfulness in human evaluation, we arranged a
+group of three human annotators with varied expertise.
+
+
+
+
+
+
+
+ ☆ Contextual Code Switching for Machine Translation using Language Models
+
+
+ Large language models (LLMs) have exerted a considerable impact on diverse
+language-related tasks in recent years. Their demonstrated state-of-the-art
+performance is achieved through methodologies such as zero-shot or few-shot
+prompting. These models undergo training on extensive datasets that encompass
+segments of the Internet and subsequently undergo fine-tuning tailored to
+specific tasks. Notably, they exhibit proficiency in tasks such as translation,
+summarization, question answering, and creative writing, even in the absence of
+explicit training for those particular tasks. While they have shown substantial
+improvement in the multilingual tasks their performance in the code switching,
+especially for machine translation remains relatively uncharted. In this paper,
+we present an extensive study on the code switching task specifically for the
+machine translation task comparing multiple LLMs. Our results indicate that
+despite the LLMs having promising results in the certain tasks, the models with
+relatively lesser complexity outperform the multilingual large language models
+in the machine translation task. We posit that the efficacy of multilingual
+large language models in contextual code switching is constrained by their
+training methodologies. In contrast, relatively smaller models, when trained
+and fine-tuned on bespoke datasets, may yield superior results in comparison to
+the majority of multilingual models.
+
+
+ The rampant occurrence of cybersecurity breaches imposes substantial
+limitations on the progress of network infrastructures, leading to compromised
+data, financial losses, potential harm to individuals, and disruptions in
+essential services. The current security landscape demands the urgent
+development of a holistic security assessment solution that encompasses
+vulnerability analysis and investigates the potential exploitation of these
+vulnerabilities as attack paths. In this paper, we propose Prometheus, an
+advanced system designed to provide a detailed analysis of the security posture
+of computing infrastructures. Using user-provided information, such as device
+details and software versions, Prometheus performs a comprehensive security
+assessment. This assessment includes identifying associated vulnerabilities and
+constructing potential attack graphs that adversaries can exploit. Furthermore,
+Prometheus evaluates the exploitability of these attack paths and quantifies
+the overall security posture through a scoring mechanism. The system takes a
+holistic approach by analyzing security layers encompassing hardware, system,
+network, and cryptography. Furthermore, Prometheus delves into the
+interconnections between these layers, exploring how vulnerabilities in one
+layer can be leveraged to exploit vulnerabilities in others. In this paper, we
+present the end-to-end pipeline implemented in Prometheus, showcasing the
+systematic approach adopted for conducting this thorough security analysis.
+
+
+
+
+
+
+
+ ☆ Exploring Multimodal Large Language Models for Radiology Report
+ Error-checking
+
+
+
+
+
+
+
+
+ Jinge Wu, Yunsoo Kim, Eva C. Keller, Jamie Chow, Adam P. Levine, Nikolas Pontikos, Zina Ibrahim, Paul Taylor, Michelle C. Williams, Honghan Wu
+
+
+ This paper proposes one of the first clinical applications of multimodal
+large language models (LLMs) as an assistant for radiologists to check errors
+in their reports. We created an evaluation dataset from two real-world
+radiology datasets (MIMIC-CXR and IU-Xray), with 1,000 subsampled reports each.
+A subset of original reports was modified to contain synthetic errors by
+introducing various type of mistakes. The evaluation contained two difficulty
+levels: SIMPLE for binary error-checking and COMPLEX for identifying error
+types. LLaVA (Large Language and Visual Assistant) variant models, including
+our instruction-tuned model, were used for the evaluation. Additionally, a
+domain expert evaluation was conducted on a small test set. At the SIMPLE
+level, the LLaVA v1.5 model outperformed other publicly available models.
+Instruction tuning significantly enhanced performance by 47.4% and 25.4% on
+MIMIC-CXR and IU-Xray data, respectively. The model also surpassed the domain
+experts accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among the subsets
+(N=21) of the test set where a clinician did not achieve the correct
+conclusion, the LLaVA ensemble mode correctly identified 71.4% of these cases.
+This study marks a promising step toward utilizing multi-modal LLMs to enhance
+diagnostic accuracy in radiology. The ensemble model demonstrated comparable
+performance to clinicians, even capturing errors overlooked by humans.
+Nevertheless, future work is needed to improve the model ability to identify
+the types of inconsistency.
+
+
+
+
+
+
+
+ ☆ In Generative AI we Trust: Can Chatbots Effectively Verify Political
+ Information?
+
+
+
+
+
+
+
+
+ Elizaveta Kuznetsova, Mykola Makhortykh, Victoria Vziatysheva, Martha Stolze, Ani Baghumyan, Aleksandra Urman
+
+
+ This article presents a comparative analysis of the ability of two large
+language model (LLM)-based chatbots, ChatGPT and Bing Chat, recently rebranded
+to Microsoft Copilot, to detect veracity of political information. We use AI
+auditing methodology to investigate how chatbots evaluate true, false, and
+borderline statements on five topics: COVID-19, Russian aggression against
+Ukraine, the Holocaust, climate change, and LGBTQ+ related debates. We compare
+how the chatbots perform in high- and low-resource languages by using prompts
+in English, Russian, and Ukrainian. Furthermore, we explore the ability of
+chatbots to evaluate statements according to political communication concepts
+of disinformation, misinformation, and conspiracy theory, using
+definition-oriented prompts. We also systematically test how such evaluations
+are influenced by source bias which we model by attributing specific claims to
+various political and social actors. The results show high performance of
+ChatGPT for the baseline veracity evaluation task, with 72 percent of the cases
+evaluated correctly on average across languages without pre-training. Bing Chat
+performed worse with a 67 percent accuracy. We observe significant disparities
+in how chatbots evaluate prompts in high- and low-resource languages and how
+they adapt their evaluations to political communication concepts with ChatGPT
+providing more nuanced outputs than Bing Chat. Finally, we find that for some
+veracity detection-related tasks, the performance of chatbots varied depending
+on the topic of the statement or the source to which it is attributed. These
+findings highlight the potential of LLM-based chatbots in tackling different
+forms of false information in online environments, but also points to the
+substantial variation in terms of how such potential is realized due to
+specific factors, such as language of the prompt or the topic.
+
+
+
+
+
+
+
+
+ Weixuan Wang, Barry Haddow, Alexandra Birch
+
+
+ Knowledge represented in Large Language Models (LLMs) is quite often
+incorrect and can also become obsolete over time. Updating knowledge via
+fine-tuning is computationally resource-hungry and not reliable, and so
+knowledge editing (KE) has developed as an effective and economical alternative
+to inject new knowledge or to fix factual errors in LLMs. Although there has
+been considerable interest in this area, current KE research exclusively
+focuses on the monolingual setting, typically in English. However, what happens
+if the new knowledge is supplied in one language, but we would like to query
+the LLM in a different language? To address the problem of multilingual
+knowledge editing, we propose Retrieval-augmented Multilingual Knowledge Editor
+(ReMaKE) to update new knowledge in LLMs. ReMaKE can perform model-agnostic
+knowledge editing in multilingual settings. ReMaKE concatenates the new
+knowledge retrieved from a multilingual knowledge base with prompts. Our
+experimental results show that ReMaKE outperforms baseline knowledge editing
+methods by a significant margin and is the first KE method to work in a
+multilingual setting. We provide our multilingual knowledge editing dataset
+(MzsRE) in 12 languages, which along with code, and additional project
+information is available at https://github.com/Vicky-Wil/ReMaKE.
+
+
+ Continued pre-training (CP) offers multiple advantages, like target domain
+adaptation and the potential to exploit the continuous stream of unlabeled data
+available online. However, continued pre-training on out-of-domain
+distributions often leads to catastrophic forgetting of previously acquired
+knowledge, leading to sub-optimal ASR performance. This paper presents FusDom,
+a simple and novel methodology for SSL-based continued pre-training. FusDom
+learns speech representations that are robust and adaptive yet not forgetful of
+concepts seen in the past. Instead of solving the SSL pre-text task on the
+output representations of a single model, FusDom leverages two identical
+pre-trained SSL models, a teacher and a student, with a modified pre-training
+head to solve the CP SSL pre-text task. This head employs a cross-attention
+mechanism between the representations of both models while only the student
+receives gradient updates and the teacher does not. Finally, the student is
+fine-tuned for ASR. In practice, FusDom outperforms all our baselines across
+settings significantly, with WER improvements in the range of 0.2 WER - 7.3 WER
+in the target domain while retaining the performance in the earlier domain.
+
+
+
+ comment: Accepted at ICASSP 2024. Code: https://github.com/cs20s030/fusdom
+
+
+
+
+
+
+ ☆ AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and
+ Optimisation
+
+
+
+
+
+
+
+
+ Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, Heming Cui
+
+
+ The advancement of natural language processing (NLP) has been significantly
+boosted by the development of transformer-based large language models (LLMs).
+These models have revolutionized NLP tasks, particularly in code generation,
+aiding developers in creating software with enhanced efficiency. Despite their
+advancements, challenges in balancing code snippet generation with effective
+test case generation and execution persist. To address these issues, this paper
+introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution
+comprising a multi-agent framework with specialized agents: the programmer
+agent, the test designer agent, and the test executor agent. During the coding
+procedure, the programmer agent will focus on the code generation and
+refinement based on the test executor agent's feedback. The test designer agent
+will generate test cases for the generated code, and the test executor agent
+will run the code with the test cases and write the feedback to the programmer.
+This collaborative system ensures robust code generation, surpassing the
+limitations of single-agent models and traditional methodologies. Our extensive
+experiments on 9 code generation models and 12 enhancement approaches showcase
+AgentCoder's superior performance over existing code generation models and
+prompt engineering techniques across various benchmarks. For example,
+AgentCoder achieves 77.4% and 89.1% pass@1 in HumanEval-ET and MBPP-ET with
+GPT-3.5, while SOTA baselines obtain only 69.5% and 63.0%.
+
+
+
+ comment: 21 pages, 12 figures
+
+
+
+
+
+
+ ☆ Machine Mindset: An MBTI Exploration of Large Language Models
+
+
+ We present a novel approach for integrating Myers-Briggs Type Indicator
+(MBTI) personality traits into large language models (LLMs), addressing the
+challenges of personality consistency in personalized AI. Our method, "Machine
+Mindset," involves a two-phase fine-tuning and Direct Preference Optimization
+(DPO) to embed MBTI traits into LLMs. This approach ensures that models
+internalize these traits, offering a stable and consistent personality profile.
+We demonstrate the effectiveness of our models across various domains, showing
+alignment between model performance and their respective MBTI traits. The paper
+highlights significant contributions in the development of personality datasets
+and a new training methodology for personality integration in LLMs, enhancing
+the potential for personalized AI applications. We also open-sourced our model
+and part of the data at \url{https://github.com/PKU-YuanGroup/Machine-Mindset}.
+
+
+
+
+
+
+
+ ☆ Benchmarking and Analyzing In-context Learning, Fine-tuning and
+ Supervised Learning for Biomedical Knowledge Curation: a focused study on
+ chemical entities of biological interest
+
+
+ Automated knowledge curation for biomedical ontologies is key to ensure that
+they remain comprehensive, high-quality and up-to-date. In the era of
+foundational language models, this study compares and analyzes three NLP
+paradigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and
+supervised learning (ML). Using the Chemical Entities of Biological Interest
+(ChEBI) database as a model ontology, three curation tasks were devised. For
+ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT.
+PubmedBERT was chosen for the FT paradigm. For ML, six embedding models were
+utilized for training Random Forest and Long-Short Term Memory models. Five
+setups were designed to assess ML and FT model performance across different
+data availability scenarios.Datasets for curation tasks included: task 1
+(620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive
+versus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of
+0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML
+(trained on ~260,000 triples) outperformed ICL in accuracy across all tasks.
+(accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed
+similarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and
++.002), but worse in task 3 (-.048). Simulations revealed performance declines
+in both ML and FT models with smaller and higher imbalanced training data.
+where ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks
+1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed
+ML/FT in task 2.ICL-augmented foundation models can be good assistants for
+knowledge curation with correct prompting, however, not making ML and FT
+paradigms obsolete. The latter two require task-specific data to beat ICL. In
+such cases, ML relies on small pretrained embeddings, minimizing computational
+demands.
+
+
+
+ comment: 26 pages, 5 figures, 14 tables
+
+
+
+
+
+
+ ☆ Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors AAAI 2024
+
+
+
+
+
+
+
+
+ Yi-Fan Zhang, Zhang Zhang, Liang Wang, Rong Jin
+
+
+ To combat the potential misuse of Natural Language Generation (NLG)
+technology, a variety of algorithms have been developed for the detection of
+AI-generated texts. Traditionally, this task is treated as a binary
+classification problem. Although supervised learning has demonstrated promising
+results, acquiring labeled data for detection purposes poses real-world
+challenges and the risk of overfitting. In an effort to address these issues,
+we delve into the realm of zero-shot machine-generated text detection. Existing
+zero-shot detectors, typically designed for specific tasks or topics, often
+assume uniform testing scenarios, limiting their practicality. In our research,
+we explore various advanced Large Language Models (LLMs) and their specialized
+variants, contributing to this field in several ways. In empirical studies, we
+uncover a significant correlation between topics and detection performance.
+Secondly, we delve into the influence of topic shifts on zero-shot detectors.
+These investigations shed light on the adaptability and robustness of these
+detection methods across diverse topics.
+
+
+
+ comment: 8 pages, 3 figures, AAAI 2024 Workshop on Responsible Language Models
+
+
+
+
+
+
+ ☆ Big Tech influence over AI research revisited: memetic analysis of
+ attribution of ideas to affiliation
+
+
+ There exists a growing discourse around the domination of Big Tech on the
+landscape of artificial intelligence (AI) research, yet our comprehension of
+this phenomenon remains cursory. This paper aims to broaden and deepen our
+understanding of Big Tech's reach and power within AI research. It highlights
+the dominance not merely in terms of sheer publication volume but rather in the
+propagation of new ideas or \textit{memes}. Current studies often oversimplify
+the concept of influence to the share of affiliations in academic papers,
+typically sourced from limited databases such as arXiv or specific academic
+conferences.
+ The main goal of this paper is to unravel the specific nuances of such
+influence, determining which AI ideas are predominantly driven by Big Tech
+entities. By employing network and memetic analysis on AI-oriented paper
+abstracts and their citation network, we are able to grasp a deeper insight
+into this phenomenon. By utilizing two databases: OpenAlex and S2ORC, we are
+able to perform such analysis on a much bigger scale than previous attempts.
+ Our findings suggest, that while Big Tech-affiliated papers are
+disproportionately more cited in some areas, the most cited papers are those
+affiliated with both Big Tech and Academia. Focusing on the most contagious
+memes, their attribution to specific affiliation groups (Big Tech, Academia,
+mixed affiliation) seems to be equally distributed between those three groups.
+This suggests that the notion of Big Tech domination over AI research is
+oversimplified in the discourse.
+ Ultimately, this more nuanced understanding of Big Tech's and Academia's
+influence could inform a more symbiotic alliance between these stakeholders
+which would better serve the dual goals of societal welfare and the scientific
+integrity of AI research.
+
+
+
+
+
+
+
+ ☆ CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks
+ for Chinese Large Language Models AAAI 2024
+
+
+ As an indispensable ingredient of intelligence, commonsense reasoning is
+crucial for large language models (LLMs) in real-world scenarios. In this
+paper, we propose CORECODE, a dataset that contains abundant commonsense
+knowledge manually annotated on dyadic dialogues, to evaluate the commonsense
+reasoning and commonsense conflict detection capabilities of Chinese LLMs. We
+categorize commonsense knowledge in everyday conversations into three
+dimensions: entity, event, and social interaction. For easy and consistent
+annotation, we standardize the form of commonsense knowledge annotation in
+open-domain dialogues as "domain: slot = value". A total of 9 domains and 37
+slots are defined to capture diverse commonsense knowledge. With these
+pre-defined domains and slots, we collect 76,787 commonsense knowledge
+annotations from 19,700 dialogues through crowdsourcing. To evaluate and
+enhance the commonsense reasoning capability for LLMs on the curated dataset,
+we establish a series of dialogue-level reasoning and detection tasks,
+including commonsense knowledge filling, commonsense knowledge generation,
+commonsense conflict phrase detection, domain identification, slot
+identification, and event causal inference. A wide variety of existing
+open-source Chinese LLMs are evaluated with these tasks on our dataset.
+Experimental results demonstrate that these models are not competent to predict
+CORECODE's plentiful reasoning content, and even ChatGPT could only achieve
+0.275 and 0.084 accuracy on the domain identification and slot identification
+tasks under the zero-shot setting. We release the data and codes of CORECODE at
+https://github.com/danshi777/CORECODE to promote commonsense reasoning
+evaluation and study of LLMs in the context of daily conversations.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ Language Resources for Dutch Large Language Modelling
+
+
+ Despite the rapid expansion of types of large language models, there remains
+a notable gap in models specifically designed for the Dutch language. This gap
+is not only a shortage in terms of pretrained Dutch models but also in terms of
+data, and benchmarks and leaderboards. This work provides a small step to
+improve the situation. First, we introduce two fine-tuned variants of the Llama
+2 13B model. We first fine-tuned Llama 2 using Dutch-specific web-crawled data
+and subsequently refined this model further on multiple synthetic instruction
+and chat datasets. These datasets as well as the model weights are made
+available. In addition, we provide a leaderboard to keep track of the
+performance of (Dutch) models on a number of generation tasks, and we include
+results of a number of state-of-the-art models, including our own. Finally we
+provide a critical conclusion on what we believe is needed to push forward
+Dutch language models and the whole eco-system around the models.
+
+
+
+
+
+
+
+ ☆ A Stochastic Analysis of the Linguistic Provenance of English Place
+ Names
+
+
+ In English place name analysis, meanings are often derived from the
+resemblance of roots in place names to topographical features, proper names
+and/or habitation terms in one of the languages that have had an influence on
+English place names. The problem here is that it is sometimes difficult to
+determine the base language to use to interpret the roots. The purpose of this
+paper is to stochastically determine the resemblance between 18799 English
+place names and 84685 place names from Ireland, Scotland, Wales, Denmark,
+Norway, Sweden, France, Germany, the Netherlands and Ancient Rome. Each English
+place name is ranked according to the extent to which it resembles place names
+from the other countries, and this provides a basis for determining the likely
+language to use to interpret the place name. A number of observations can be
+made using the ranking provided. In particular, it is found that `Didlington'
+is the most archetypically English place name in the English sample, and `Anna'
+is the least. Furthermore, it is found that the place names in the non-English
+datasets are most similar to Norwegian place names and least similar to Welsh
+place names.
+
+
+
+
+
+
+
+ ☆ Turning Dust into Gold: Distilling Complex Reasoning Capabilities from
+ LLMs by Leveraging Negative Data AAAI 2024
+
+
+
+
+
+
+
+
+ Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, Kan Li
+
+
+ Large Language Models (LLMs) have performed well on various reasoning tasks,
+but their inaccessibility and numerous parameters hinder wide application in
+practice. One promising way is distilling the reasoning ability from LLMs to
+small models by the generated chain-of-thought reasoning paths. In some cases,
+however, LLMs may produce incorrect reasoning chains, especially when facing
+complex mathematical problems. Previous studies only transfer knowledge from
+positive samples and drop the synthesized data with wrong answers. In this
+work, we illustrate the merit of negative data and propose a model
+specialization framework to distill LLMs with negative samples besides positive
+ones. The framework consists of three progressive steps, covering from training
+to inference stages, to absorb knowledge from negative data. We conduct
+extensive experiments across arithmetic reasoning tasks to demonstrate the role
+of negative data in distillation from LLM.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using
+ Semantic Understanding in Mixed Reality
+
+
+ One key challenge in augmented reality is the placement of virtual content in
+natural locations. Existing automated techniques are only able to work with a
+closed-vocabulary, fixed set of objects. In this paper, we introduce a new
+open-vocabulary method for object placement. Our eight-stage pipeline leverages
+recent advances in segmentation models, vision-language models, and LLMs to
+place any virtual object in any AR camera frame or scene. In a preliminary user
+study, we show that our method performs at least as well as human experts 57%
+of the time.
+
+
+
+ comment: IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
+ 2023
+
+
+
+
+
+
+ ☆ Enhancing Consistency in Multimodal Dialogue System Using LLM with
+ Dialogue Scenario
+
+
+ This paper describes our dialogue system submitted to Dialogue Robot
+Competition 2023. The system's task is to help a user at a travel agency decide
+on a plan for visiting two sightseeing spots in Kyoto City that satisfy the
+user. Our dialogue system is flexible and stable and responds to user
+requirements by controlling dialogue flow according to dialogue scenarios. We
+also improved user satisfaction by introducing motion and speech control based
+on system utterances and user situations. In the preliminary round, our system
+was ranked fifth in the impression evaluation and sixth in the plan evaluation
+among all 12 teams.
+
+
+
+ comment: This paper is part of the proceedings of the Dialogue Robot
+ Competition 2023
+
+
+
+
+
+
+ ☆ MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
+ Language Models AAAI-24
+
+
+
+
+
+
+
+
+ Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He
+
+
+ The emergence of various medical large language models (LLMs) in the medical
+domain has highlighted the need for unified evaluation standards, as manual
+evaluation of LLMs proves to be time-consuming and labor-intensive. To address
+this issue, we introduce MedBench, a comprehensive benchmark for the Chinese
+medical domain, comprising 40,041 questions sourced from authentic examination
+exercises and medical reports of diverse branches of medicine. In particular,
+this benchmark is composed of four key components: the Chinese Medical
+Licensing Examination, the Resident Standardization Training Examination, the
+Doctor In-Charge Qualification Examination, and real-world clinic cases
+encompassing examinations, diagnoses, and treatments. MedBench replicates the
+educational progression and clinical practice experiences of doctors in
+Mainland China, thereby establishing itself as a credible benchmark for
+assessing the mastery of knowledge and reasoning abilities in medical language
+learning models. We perform extensive experiments and conduct an in-depth
+analysis from diverse perspectives, which culminate in the following findings:
+(1) Chinese medical LLMs underperform on this benchmark, highlighting the need
+for significant advances in clinical knowledge and diagnostic precision. (2)
+Several general-domain LLMs surprisingly possess considerable medical
+knowledge. These findings elucidate both the capabilities and limitations of
+LLMs within the context of MedBench, with the ultimate goal of aiding the
+medical research community.
+
+
+ Continued self-supervised (SSL) pre-training for adapting existing SSL models
+to the target domain has shown to be extremely effective for low-resource
+Automatic Speech Recognition (ASR). This paper proposes Stable Distillation, a
+simple and novel approach for SSL-based continued pre-training that boosts ASR
+performance in the target domain where both labeled and unlabeled data are
+limited. Stable Distillation employs self-distillation as regularization for
+continued pre-training, alleviating the over-fitting issue, a common problem
+continued pre-training faces when the source and target domains differ.
+Specifically, first, we perform vanilla continued pre-training on an initial
+SSL pre-trained model on the target domain ASR dataset and call it the teacher.
+Next, we take the same initial pre-trained model as a student to perform
+continued pre-training while enforcing its hidden representations to be close
+to that of the teacher (via MSE loss). This student is then used for downstream
+ASR fine-tuning on the target dataset. In practice, Stable Distillation
+outperforms all our baselines by 0.8 - 7 WER when evaluated in various
+experimental settings.
+
+
+
+ comment: Accepted to ICASSP 2024. Code:
+ https://github.com/cs20s030/stable_distillation
+
+
+
+
+
+
+ ☆ Segmenting Messy Text: Detecting Boundaries in Text Derived from
+ Historical Newspaper Images
+
+
+ Text segmentation, the task of dividing a document into sections, is often a
+prerequisite for performing additional natural language processing tasks.
+Existing text segmentation methods have typically been developed and tested
+using clean, narrative-style text with segments containing distinct topics.
+Here we consider a challenging text segmentation task: dividing newspaper
+marriage announcement lists into units of one announcement each. In many cases
+the information is not structured into sentences, and adjacent segments are not
+topically distinct from each other. In addition, the text of the announcements,
+which is derived from images of historical newspapers via optical character
+recognition, contains many typographical errors. As a result, these
+announcements are not amenable to segmentation with existing techniques. We
+present a novel deep learning-based model for segmenting such text and show
+that it significantly outperforms an existing state-of-the-art method on our
+task.
+
+
+
+ comment: 8 pages, 4 figures
+
+
+
+
+
+
+ ☆ Lattice Rescoring Based on Large Ensemble of Complementary Neural
+ Language Models ICASSP 2022
+
+
+ We investigate the effectiveness of using a large ensemble of advanced neural
+language models (NLMs) for lattice rescoring on automatic speech recognition
+(ASR) hypotheses. Previous studies have reported the effectiveness of combining
+a small number of NLMs. In contrast, in this study, we combine up to eight
+NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are
+trained with two different random initialization seeds. We combine these NLMs
+through iterative lattice generation. Since these NLMs work complementarily
+with each other, by combining them one by one at each rescoring iteration,
+language scores attached to given lattice arcs can be gradually refined.
+Consequently, errors of the ASR hypotheses can be gradually reduced. We also
+investigate the effectiveness of carrying over contextual information (previous
+rescoring results) across a lattice sequence of a long speech such as a lecture
+speech. In experiments using a lecture speech corpus, by combining the eight
+NLMs and using context carry-over, we obtained a 24.4% relative word error rate
+reduction from the ASR 1-best baseline. For further comparison, we performed
+simultaneous (i.e., non-iterative) NLM combination and 100-best rescoring using
+the large ensemble of NLMs, which confirmed the advantage of lattice rescoring
+with iterative NLM combination.
+
+
+ Recently, CLIP has found practical utility in the domain of pixel-level
+zero-shot segmentation tasks. The present landscape features two-stage
+methodologies beset by issues such as intricate pipelines and elevated
+computational costs. While current one-stage approaches alleviate these
+concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's
+generalization capacity, they still fall short in fully harnessing CLIP's
+potential for pixel-level unseen class demarcation and precise pixel
+predictions. To further stimulate CLIP's zero-shot dense prediction capability,
+we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from
+image to pixel. Specifically, we initially introduce Spectral Prompt Tuning
+(SPT), incorporating spectral prompts into the CLIP visual encoder's shallow
+layers to capture structural intricacies of images, thereby enhancing
+comprehension of unseen classes. Subsequently, we introduce the Spectral Guided
+Decoder (SGD), utilizing both high and low-frequency information to steer the
+network's spatial focus towards more prominent classification features,
+enabling precise pixel-level prediction outcomes. Through extensive experiments
+on two public datasets, we demonstrate the superiority of our method over
+state-of-the-art approaches, performing well across all classes and
+particularly excelling in handling unseen classes. Code is available
+at:https://github.com/clearxu/SPT.
+
+
+
+ comment: AAAI2024 Accepted
+
+
+
+
+
+
+ ☆ ALMANACS: A Simulatability Benchmark for Language Model Explainability
+
+
+
+
+
+
+
+
+ Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons
+
+
+ How do we measure the efficacy of language model explainability methods?
+While many explainability methods have been developed, they are typically
+evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help
+fill this gap, we present ALMANACS, a language model explainability benchmark.
+ALMANACS scores explainability methods on simulatability, i.e., how well the
+explanations improve behavior prediction on new inputs. The ALMANACS scenarios
+span twelve safety-relevant topics such as ethical reasoning and advanced AI
+behaviors; they have idiosyncratic premises to invoke model-specific behavior;
+and they have a train-test distributional shift to encourage faithful
+explanations. By using another language model to predict behavior based on the
+explanations, ALMANACS is a fully automated benchmark. We use ALMANACS to
+evaluate counterfactuals, rationalizations, attention, and Integrated Gradients
+explanations. Our results are sobering: when averaged across all topics, no
+explanation method outperforms the explanation-free control. We conclude that
+despite modest successes in prior work, developing an explanation method that
+aids simulatability in ALMANACS remains an open challenge.
+
+
+
+ comment: Code is available at
+ https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS
+
+
+
+
+
+
+ ☆ ChatFDA: Medical Records Risk Assessment
+
+
+ In healthcare, the emphasis on patient safety and the minimization of medical
+errors cannot be overstated. Despite concerted efforts, many healthcare
+systems, especially in low-resource regions, still grapple with preventing
+these errors effectively. This study explores a pioneering application aimed at
+addressing this challenge by assisting caregivers in gauging potential risks
+derived from medical notes. The application leverages data from openFDA,
+delivering real-time, actionable insights regarding prescriptions. Preliminary
+analyses conducted on the MIMIC-III \cite{mimic} dataset affirm a proof of
+concept highlighting a reduction in medical errors and an amplification in
+patient safety. This tool holds promise for drastically enhancing healthcare
+outcomes in settings with limited resources. To bolster reproducibility and
+foster further research, the codebase underpinning our methodology is
+accessible on
+https://github.com/autonlab/2023.hackAuton/tree/main/prescription_checker. This
+is a submission for the 30th HackAuton CMU.
+
+
+
+
+
+
+
+ ☆ Fine-tuning Large Language Models for Adaptive Machine Translation
+
+
+
+
+
+
+
+
+ Yasmin Moslem, Rejwanul Haque, Andy Way
+
+
+ This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose
+large language model (LLM), for adaptive machine translation (MT). The
+fine-tuning process involves utilising a combination of zero-shot and one-shot
+translation prompts within the medical domain. The primary objective is to
+enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt
+translations to the required domain at inference time. The results,
+particularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned
+model, demonstrating quality improvements in both zero-shot and one-shot
+translation scenarios, surpassing Mistral 7B's baseline performance. Notably,
+the fine-tuned Mistral outperforms ChatGPT "gpt-3.5-turbo" in zero-shot
+translation while achieving comparable one-shot translation quality. Moreover,
+the zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B's
+performance, and its one-shot translation quality surpasses that of NLLB 3.3B.
+These findings emphasise the significance of fine-tuning efficient LLMs like
+Mistral 7B to yield high-quality zero-shot translations comparable to
+task-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved
+in one-shot translation are comparable to those of commercial LLMs such as
+ChatGPT. Our experiments demonstrate that, with a relatively small dataset of
+20,000 segments that incorporate a mix of zero-shot and one-shot prompts,
+fine-tuning significantly enhances Mistral's in-context learning ability,
+especially for real-time adaptive MT.
+
+
+
+
+
+
+
+ ☆ Learning and Forgetting Unsafe Examples in Large Language Models
+
+
+
+
+
+
+
+
+ Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren
+
+
+ As the number of large language models (LLMs) released to the public grows,
+there is a pressing need to understand the safety implications associated with
+these models learning from third-party custom finetuning data. We explore the
+behavior of LLMs finetuned on noisy custom data containing unsafe content,
+represented by datasets that contain biases, toxicity, and harmfulness, finding
+that while aligned LLMs can readily learn this unsafe content, they also tend
+to forget it more significantly than other examples when subsequently finetuned
+on safer content. Drawing inspiration from the discrepancies in forgetting, we
+introduce the "ForgetFilter" algorithm, which filters unsafe data based on how
+strong the model's forgetting signal is for that data. We demonstrate that the
+ForgetFilter algorithm ensures safety in customized finetuning without
+compromising downstream task performance, unlike sequential safety finetuning.
+ForgetFilter outperforms alternative strategies like replay and moral
+self-correction in curbing LLMs' ability to assimilate unsafe content during
+custom finetuning, e.g. 75% lower than not applying any safety measures and 62%
+lower than using self-correction in toxicity score.
+
+
+
+
+
+
+
+
+ Yunye Gong, Robik Shrestha, Jared Claypoole, Michael Cogswell, Arijit Ray, Christopher Kanan, Ajay Divakaran
+
+
+ We propose a novel VQA dataset, based on picture stories designed for
+educating young children, that aims to facilitate comprehensive evaluation and
+characterization of vision-language models on comprehension tasks. Unlike
+current VQA datasets that often focus on fact-based memorization and simple
+reasoning tasks without principled scientific grounding, we collect data
+containing tasks reflecting different levels of comprehension and underlying
+cognitive processes, as laid out in Bloom's Taxonomy, a classic framework
+widely adopted in education research. The proposed BloomVQA dataset can be
+mapped to a hierarchical graph-based representation of visual stories, enabling
+automatic data augmentation and novel measures characterizing model consistency
+across the underlying taxonomy. We demonstrate graded evaluation and
+reliability analysis based on our proposed consistency metrics on
+state-of-the-art vision-language models. Our results suggest that, while
+current models achieve the most gain on low-level comprehension tasks, they
+generally fall short on high-level tasks requiring more advanced comprehension
+and cognitive skills, as 38.0% drop in VQA accuracy is observed comparing
+lowest and highest level tasks. Furthermore, current models show consistency
+patterns misaligned with human comprehension in various scenarios, suggesting
+emergent structures of model behaviors.
+
+
+
+
+
+
+
+
+ Jianheng Huang, Ante Wang, Linfeng Gao, Linfeng Song, Jinsong Su
+
+
+ Leveraging vast and continually updated knowledge from the Internet has been
+considered an important ability for a dialogue system. Therefore, the dialogue
+query generation task is proposed for generating search queries from dialogue
+histories, which will be submitted to a search engine for retrieving relevant
+websites on the Internet. In this regard, previous efforts were devoted to
+collecting conversations with annotated queries and training a query producer
+(QP) via standard supervised learning. However, these studies still face the
+challenges of data scarcity and domain adaptation. To address these issues, in
+this paper, we propose a semi-supervised learning framework -- SemiDQG, to
+improve model performance with unlabeled conversations. Based on the
+observation that the search query is typically related to the topic of dialogue
+response, we train a response-augmented query producer (RA) to provide rich and
+effective training signals for QP. We first apply a similarity-based query
+selection strategy to select high-quality RA-generated pseudo queries, which
+are used to construct pseudo instances for training QP and RA. Then, we adopt
+the REINFORCE algorithm to further enhance QP, with RA-provided rewards as
+fine-grained training signals. Experimental results and in-depth analysis of
+three benchmarks show the effectiveness of our framework in cross-domain and
+low-resource scenarios. Particularly, SemiDQG significantly surpasses ChatGPT
+and competitive baselines. Our code is available at
+\url{https://github.com/DeepLearnXMU/SemiDQG}.
+
+
+
+
+
+
+
+ ☆ Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is
+ Needed?
+
+
+ The vast majority of today's large language models are English-centric,
+having been pretrained predominantly on English text. Yet, in order to meet
+user expectations, models need to be able to respond appropriately in multiple
+languages once deployed in downstream applications. Given limited exposure to
+other languages during pretraining, cross-lingual transfer is important for
+achieving decent performance in non-English settings. In this work, we
+investigate just how much multilinguality is required during finetuning to
+elicit strong cross-lingual generalisation across a range of tasks and target
+languages. We find that, compared to English-only finetuning, multilingual
+instruction tuning with as few as three languages significantly improves a
+model's cross-lingual transfer abilities on generative tasks that assume
+input/output language agreement, while being of less importance for highly
+structured tasks. Our code and data is available at
+https://github.com/ZurichNLP/multilingual-instruction-tuning.
+
+
+
+
+
+
+
+ ☆ Mini-GPTs: Efficient Large Language Models through Contextual Pruning
+
+
+ In AI research, the optimization of Large Language Models (LLMs) remains a
+significant challenge, crucial for advancing the field's practical applications
+and sustainability. Building upon the foundational work of Professor Song Han's
+lab at MIT, this paper introduces a novel approach in developing Mini-GPTs via
+contextual pruning. Our methodology strategically prunes the computational
+architecture of traditional LLMs, like Phi-1.5, focusing on retaining core
+functionalities while drastically reducing model sizes. We employ the technique
+across diverse and complex datasets, including US law, Medical Q&A, Skyrim
+dialogue, English-Taiwanese translation, and Economics articles. The results
+underscore the efficiency and effectiveness of contextual pruning, not merely
+as a theoretical concept but as a practical tool in developing domain-specific,
+resource-efficient LLMs. Contextual pruning is a promising method for building
+domain-specific LLMs, and this research is a building block towards future
+development with more hardware compute, refined fine-tuning, and quantization.
+
+
+ Biologically Inspired Design (BID), or Biomimicry, is a problem-solving
+methodology that applies analogies from nature to solve engineering challenges.
+For example, Speedo engineers designed swimsuits based on shark skin. Finding
+relevant biological solutions for real-world problems poses significant
+challenges, both due to the limited biological knowledge engineers and
+designers typically possess and to the limited BID resources. Existing BID
+datasets are hand-curated and small, and scaling them up requires costly human
+annotations.
+ In this paper, we introduce BARcode (Biological Analogy Retriever), a search
+engine for automatically mining bio-inspirations from the web at scale. Using
+advances in natural language understanding and data programming, BARcode
+identifies potential inspirations for engineering challenges. Our experiments
+demonstrate that BARcode can retrieve inspirations that are valuable to
+engineers and designers tackling real-world problems, as well as recover famous
+historical BID examples. We release data and code; we view BARcode as a step
+towards addressing the challenges that have historically hindered the practical
+application of BID to engineering innovation.
+
+
+
+ comment: To be published in the AAAI 2024 Proceedings Main Track
+
+
+
+
+
+
+ ☆ A General Model for Aggregating Annotations Across Simple, Complex, and
+ Multi-Object Annotation Tasks
+
+
+
+
+
+
+
+
+ Alexander Braylan, Madalyn Marabella, Omar Alonso, Matthew Lease
+
+
+ Human annotations are vital to supervised learning, yet annotators often
+disagree on the correct label, especially as annotation tasks increase in
+complexity. A strategy to improve label quality is to ask multiple annotators
+to label the same item and aggregate their labels. Many aggregation models have
+been proposed for categorical or numerical annotation tasks, but far less work
+has considered more complex annotation tasks involving open-ended,
+multivariate, or structured responses. While a variety of bespoke models have
+been proposed for specific tasks, our work is the first to introduce
+aggregation methods that generalize across many diverse complex tasks,
+including sequence labeling, translation, syntactic parsing, ranking, bounding
+boxes, and keypoints. This generality is achieved by devising a task-agnostic
+method to model distances between labels rather than the labels themselves.
+ This article extends our prior work with investigation of three new research
+questions. First, how do complex annotation properties impact aggregation
+accuracy? Second, how should a task owner navigate the many modeling choices to
+maximize aggregation accuracy? Finally, what diagnoses can verify that
+aggregation models are specified correctly for the given data? To understand
+how various factors impact accuracy and to inform model selection, we conduct
+simulation studies and experiments on real, complex datasets. Regarding
+testing, we introduce unit tests for aggregation models and present a suite of
+such tests to ensure that a given model is not mis-specified and exhibits
+expected behavior.
+ Beyond investigating these research questions above, we discuss the
+foundational concept of annotation complexity, present a new aggregation model
+as a bridge between traditional models and our own, and contribute a new
+semi-supervised learning method for complex label aggregation that outperforms
+prior work.
+
+
+
+
+
+
+
+ ☆ VADIS -- a VAriable Detection, Interlinking and Summarization system ECIR 2024
+
+
+
+
+
+
+
+
+ Yavuz Selim Kartal, Muhammad Ahsan Shahid, Sotaro Takeshita, Tornike Tsereteli, Andrea Zielinski, Benjamin Zapilko, Philipp Mayr
+
+
+ The VADIS system addresses the demand of providing enhanced information
+access in the domain of the social sciences. This is achieved by allowing users
+to search and use survey variables in context of their underlying research data
+and scholarly publications which have been interlinked with each other.
+
+
+
+ comment: It is 4 pages and 2 figures. This paper has recently been accepted by
+ ECIR 2024 Demo Track and this version is the camera-ready version of the
+ paper
+
+
+
+
+
+
+ ☆ Time is Encoded in the Weights of Finetuned Language Models
+
+
+
+
+
+
+
+
+ Kai Nylund, Suchin Gururangan, Noah A. Smith
+
+
+ We present time vectors, a simple tool to customize language models to new
+time periods. Time vectors are created by finetuning a language model on data
+from a single time (e.g., a year or month), and then subtracting the weights of
+the original pretrained model. This vector specifies a direction in weight
+space that, as our experiments show, improves performance on text from that
+time period. Time vectors specialized to adjacent time periods appear to be
+positioned closer together in a manifold. Using this structure, we interpolate
+between time vectors to induce new models that perform better on intervening
+and future time periods, without any additional training. We demonstrate the
+consistency of our findings across different tasks, domains, model sizes, and
+time scales. Our results suggest that time is encoded in the weight space of
+finetuned models.
+
+
+
+
+
+
+
+ ☆ DSPy Assertions: Computational Constraints for Self-Refining Language
+ Model Pipelines
+
+
+ Chaining language model (LM) calls as composable modules is fueling a new
+powerful way of programming. However, ensuring that LMs adhere to important
+constraints remains a key challenge, one often addressed with heuristic "prompt
+engineering". We introduce LM Assertions, a new programming construct for
+expressing computational constraints that LMs should satisfy. We integrate our
+constructs into the recent DSPy programming model for LMs, and present new
+strategies that allow DSPy to compile programs with arbitrary LM Assertions
+into systems that are more reliable and more accurate. In DSPy, LM Assertions
+can be integrated at compile time, via automatic prompt optimization, and/or at
+inference time, via automatic selfrefinement and backtracking. We report on two
+early case studies for complex question answering (QA), in which the LM program
+must iteratively retrieve information in multiple hops and synthesize a
+long-form answer with citations. We find that LM Assertions improve not only
+compliance with imposed rules and guidelines but also enhance downstream task
+performance, delivering intrinsic and extrinsic gains up to 35.7% and 13.3%,
+respectively. Our reference implementation of LM Assertions is integrated into
+DSPy at https://github.com/stanfordnlp/dspy
+
+
+
+ comment: Arnav*, Manish*, Shangyin* contributed equally to this work
+
+
+
+
+
+
+ ☆ WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with
+ Refined Data Generation
+
+
+ Recent work demonstrates that, after being fine-tuned on a high-quality
+instruction dataset, the resulting model can obtain impressive capabilities to
+address a wide range of tasks. However, existing methods for instruction data
+generation often produce duplicate data and are not controllable enough on data
+quality. In this paper, we extend the generalization of instruction tuning by
+classifying the instruction data to 4 code-related tasks and propose a
+LLM-based Generator-Discriminator data process framework to generate diverse,
+high-quality instruction data from open source code. Hence, we introduce
+CodeOcean, a dataset comprising 20,000 instruction instances across 4 universal
+code-related tasks,which is aimed at augmenting the effectiveness of
+instruction tuning and improving the generalization ability of fine-tuned
+model. Subsequently, we present WaveCoder, a fine-tuned Code LLM with
+Widespread And Versatile Enhanced instruction tuning. This model is
+specifically designed for enhancing instruction tuning of Code Language Models
+(LLMs). Our experiments demonstrate that Wavecoder models outperform other
+open-source models in terms of generalization ability across different
+code-related tasks at the same level of fine-tuning scale. Moreover, Wavecoder
+exhibits high efficiency in previous code generation tasks. This paper thus
+offers a significant contribution to the field of instruction data generation
+and fine-tuning models, providing new insights and tools for enhancing
+performance in code-related tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Founder-GPT: Self-play to evaluate the Founder-Idea fit
+
+
+ This research introduces an innovative evaluation method for the
+"founder-idea" fit in early-stage startups, utilizing advanced large language
+model techniques to assess founders' profiles against their startup ideas to
+enhance decision-making. Embeddings, self-play, tree-of-thought, and
+critique-based refinement techniques show early promising results that each
+idea's success patterns are unique and they should be evaluated based on the
+context of the founder's background.
+
+
+
+
+
+
+
+ ♻ ☆ Latency Adjustable Transformer Encoder for Language Understanding
+
+
+ Adjusting the latency, power, and accuracy of natural language understanding
+models is a desirable objective of an efficient architecture. This paper
+proposes an efficient Transformer architecture that adjusts the inference
+computational cost adaptively with a desired inference latency speedup. In
+fine-tuning phase, the proposed method detects less important hidden sequence
+elements (word-vectors) and eliminates them in each encoder layer using a
+proposed Attention Context Contribution (ACC) metric. After the fine-tuning
+phase, with the novel offline-tuning property, the inference latency of the
+model can be adjusted in a wide range of inference speedup selections without
+any further training. The proposed method is applied to the BERT-base and GPT-2
+models for evaluation. Extensive experiments show that most of the word-vectors
+in higher Transformer layers have less contribution to the subsequent layers;
+hence, they can be eliminated to improve the inference latency. Experimental
+results on extensive sentiment analysis, classification, text generation tasks
+and regression benchmarks like GLUE showed that the method is effective in
+various datasets with minimal impact on global context. The proposed method
+mathematically and experimentally improves the inference latency of BERT-base
+and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and
+passable perplexity on average. The suggested approach posits that in Large
+Language Models (LLMs), although the complete network is necessary for
+training, it can be truncated during the fine-tuning phase.
+
+
+
+
+
+
+
+
+ Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason
+
+
+ We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for
+evaluating language-guided agents navigating in a persistent environment over
+time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the
+agent's memory at the beginning of every episode, testing the ability to
+perform cold-start navigation with no prior information. However, deployed
+robots occupy the same environment for long periods of time. The IVLN paradigm
+addresses this disparity by training and evaluating VLN agents that maintain
+memory across tours of scenes that consist of up to 100 ordered
+instruction-following Room-to-Room (R2R) episodes, each defined by an
+individual language instruction and a target path. We present discrete and
+continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours
+each in 80 indoor scenes. We find that extending the implicit memory of
+high-performing transformer VLN agents is not sufficient for IVLN, but agents
+that build maps can benefit from environment persistence, motivating a renewed
+focus on map-building agents in VLN.
+
+
+
+ comment: Accepted by CVPR 2023
+
+
+
+
+
+
+ ♻ ☆ IndicTrans2: Towards High-Quality and Accessible Machine Translation
+ Models for all 22 Scheduled Indian Languages
+
+
+ India has a rich linguistic landscape with languages from 4 major language
+families spoken by over a billion people. 22 of these languages are listed in
+the Constitution of India (referred to as scheduled languages) are the focus of
+this work. Given the linguistic diversity, high-quality and accessible Machine
+Translation (MT) systems are essential in a country like India. Prior to this
+work, there was (i) no parallel training data spanning all 22 languages, (ii)
+no robust benchmarks covering all these languages and containing content
+relevant to India, and (iii) no existing translation models which support all
+the 22 scheduled languages of India. In this work, we aim to address this gap
+by focusing on the missing pieces required for enabling wide, easy, and open
+access to good machine translation systems for all 22 scheduled Indian
+languages. We identify four key areas of improvement: curating and creating
+larger training datasets, creating diverse and high-quality benchmarks,
+training multilingual models, and releasing models with open access. Our first
+contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
+the largest publicly available parallel corpora for Indic languages. BPCC
+contains a total of 230M bitext pairs, of which a total of 126M were newly
+added, including 644K manually translated sentence pairs created as part of
+this work. Our second contribution is the release of the first n-way parallel
+benchmark covering all 22 Indian languages, featuring diverse domains,
+Indian-origin content, and source-original test sets. Next, we present
+IndicTrans2, the first model to support all 22 languages, surpassing existing
+models on multiple existing and new benchmarks created as a part of this work.
+Lastly, to promote accessibility and collaboration, we release our models and
+associated data with permissive licenses at
+https://github.com/AI4Bharat/IndicTrans2.
+
+
+ This research delves into the intricate landscape of Musculoskeletal Disorder
+(MSD) risk factors, employing a novel fusion of Natural Language Processing
+(NLP) techniques and mode-based ranking methodologies. The primary objective is
+to advance the comprehension of MSD risk factors, their classification, and
+their relative severity, facilitating more targeted preventive and management
+interventions. The study utilizes eight diverse models, integrating pre-trained
+transformers, cosine similarity, and various distance metrics to classify risk
+factors into personal, biomechanical, workplace, psychological, and
+organizational classes. Key findings reveal that the BERT model with cosine
+similarity attains an overall accuracy of 28%, while the sentence transformer,
+coupled with Euclidean, Bray-Curtis, and Minkowski distances, achieves a
+flawless accuracy score of 100%. In tandem with the classification efforts, the
+research employs a mode-based ranking approach on survey data to discern the
+severity hierarchy of MSD risk factors. Intriguingly, the rankings align
+precisely with the previous literature, reaffirming the consistency and
+reliability of the approach. ``Working posture" emerges as the most severe risk
+factor, emphasizing the critical role of proper posture in preventing MSDs. The
+collective perceptions of survey participants underscore the significance of
+factors like "Job insecurity," "Effort reward imbalance," and "Poor employee
+facility" in contributing to MSD risks. The convergence of rankings provides
+actionable insights for organizations aiming to reduce the prevalence of MSDs.
+The study concludes with implications for targeted interventions,
+recommendations for improving workplace conditions, and avenues for future
+research.
+
+
+
+
+
+
+
+ ♻ ☆ How Far Have We Gone in Vulnerability Detection Using Large Language
+ Models
+
+
+ As software becomes increasingly complex and prone to vulnerabilities,
+automated vulnerability detection is critically important, yet challenging.
+Given the significant successes of large language models (LLMs) in various
+tasks, there is growing anticipation of their efficacy in vulnerability
+detection. However, a quantitative understanding of their potential in
+vulnerability detection is still missing. To bridge this gap, we introduce a
+comprehensive vulnerability benchmark VulBench. This benchmark aggregates
+high-quality data from a wide range of CTF (Capture-the-Flag) challenges and
+real-world applications, with annotations for each vulnerable function
+detailing the vulnerability type and its root cause. Through our experiments
+encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models
+and static analyzers, we find that several LLMs outperform traditional deep
+learning approaches in vulnerability detection, revealing an untapped potential
+in LLMs. This work contributes to the understanding and utilization of LLMs for
+enhanced software security.
+
+
+
+
+
+
+
+ ♻ ☆ Exploiting Representation Bias for Data Distillation in Abstractive Text
+ Summarization
+
+
+ Abstractive text summarization is surging with the number of training samples
+to cater to the needs of the deep learning models. These models tend to exploit
+the training data representations to attain superior performance by improving
+the quantitative element of the resultant summary. However, increasing the size
+of the training set may not always be the ideal solution to maximize the
+performance, and therefore, a need to revisit the quality of training samples
+and the learning protocol of deep learning models is a must. In this paper, we
+aim to discretize the vector space of the abstractive text summarization models
+to understand the characteristics learned between the input embedding space and
+the models' encoder space. We show that deep models fail to capture the
+diversity of the input space. Further, the distribution of data points on the
+encoder space indicates that an unchecked increase in the training samples does
+not add value; rather, a tear-down of data samples is highly needed to make the
+models focus on variability and faithfulness. We employ clustering techniques
+to learn the diversity of a model's sample space and how data points are mapped
+from the embedding space to the encoder space and vice versa. Further, we
+devise a metric to filter out redundant data points to make the model more
+robust and less data hungry. We benchmark our proposed method using
+quantitative metrics, such as Rouge, and qualitative metrics, such as
+BERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit
+the models from learning the diversity from the varied input samples.
+
+
+
+
+
+
+
+
+ Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Wang, Xiang-Yang Li, Tao Qin, Edward Lin, Tie-Yan Liu
+
+
+ Error correction in automatic speech recognition (ASR) aims to correct those
+incorrect words in sentences generated by ASR models. Since recent ASR models
+usually have low word error rate (WER), to avoid affecting originally correct
+tokens, error correction models should only modify incorrect words, and
+therefore detecting incorrect words is important for error correction. Previous
+works on error correction either implicitly detect error words through
+target-source attention or CTC (connectionist temporal classification) loss, or
+explicitly locate specific deletion/substitution/insertion errors. However,
+implicit error detection does not provide clear signal about which tokens are
+incorrect and explicit error detection suffers from low detection accuracy. In
+this paper, we propose SoftCorrect with a soft error detection mechanism to
+avoid the limitations of both explicit and implicit error detection.
+Specifically, we first detect whether a token is correct or not through a
+probability produced by a dedicatedly designed language model, and then design
+a constrained CTC loss that only duplicates the detected incorrect tokens to
+let the decoder focus on the correction of error tokens. Compared with implicit
+error detection with CTC loss, SoftCorrect provides explicit signal about which
+words are incorrect and thus does not need to duplicate every token but only
+incorrect tokens; compared with explicit error detection, SoftCorrect does not
+detect specific deletion/substitution/insertion errors but just leaves it to
+CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that
+SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming
+previous works by a large margin, while still enjoying fast speed of parallel
+generation.
+
+
+
+ comment: AAAI 2023
+
+
+
+
+
+
+ ♻ ☆ "Paraphrasing The Original Text" Makes High Accuracy Long-Context QA
+
+
+ Although LLMs continue to iterate and improve, most open-source models still
+have a context window of no more than 4k, limiting their ability to handle
+long-context problems. Most existing open-source models for long-context chat
+still lack satisfactory accuracy. To address this issue, I approach it from the
+perspective of training data and theoretically prove that training the
+capability to handle long contexts requires "effective" rather than "long"
+data. Based on this, I propose using the "original text paraphrase" task, and
+successfully extend the context window of the existing model to 32k by a
+low-cost and effective method, achieving extremely high accuracy in
+multi-document-QA and surpassing all existing open-source models of the same
+scale. The model and training data have been open-sourced on
+HuggingFace(https://huggingface.co/yuyijiong/Qwen-14b-chat-yarn-32k) and
+WiseModel(https://wisemodel.cn/models/yuyijiong/Qwen-14b-chat-yarn-32k).
+
+
+
+ comment: Chinese version of this paper can be downloaded from
+ (https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)
+
+
+
+
+
+
+ ♻ ☆ Knowledge Graphs for the Life Sciences: Recent Developments, Challenges
+ and Opportunities
+
+
+
+
+
+
+
+
+ Jiaoyan Chen, Hang Dong, Janna Hastings, Ernesto Jiménez-Ruiz, Vanessa López, Pierre Monnin, Catia Pesquita, Petr Škoda, Valentina Tamma
+
+
+ The term life sciences refers to the disciplines that study living organisms
+and life processes, and include chemistry, biology, medicine, and a range of
+other related disciplines. Research efforts in life sciences are heavily
+data-driven, as they produce and consume vast amounts of scientific data, much
+of which is intrinsically relational and graph-structured.
+ The volume of data and the complexity of scientific concepts and relations
+referred to therein promote the application of advanced knowledge-driven
+technologies for managing and interpreting data, with the ultimate aim to
+advance scientific discovery.
+ In this survey and position paper, we discuss recent developments and
+advances in the use of graph-based technologies in life sciences and set out a
+vision for how these technologies will impact these fields into the future. We
+focus on three broad topics: the construction and management of Knowledge
+Graphs (KGs), the use of KGs and associated technologies in the discovery of
+new knowledge, and the use of KGs in artificial intelligence applications to
+support explanations (explainable AI). We select a few exemplary use cases for
+each topic, discuss the challenges and open research questions within these
+topics, and conclude with a perspective and outlook that summarizes the
+overarching challenges and their potential solutions as a guide for future
+research.
+
+
+
+ comment: 33 pages, 1 figure, camera-ready version, accepted for Transactions
+ on Graph Data and Knowledge (TGDK)
+
+
+
+
+
+
+ ♻ ☆ Separating form and meaning: Using self-consistency to quantify task
+ understanding across multiple senses
+
+
+ At the staggering pace with which the capabilities of large language models
+(LLMs) are increasing, creating future-proof evaluation sets to assess their
+understanding becomes more and more challenging. In this paper, we propose a
+novel paradigm for evaluating LLMs which leverages the idea that correct world
+understanding should be consistent across different (Fregean) senses of the
+same meaning. Accordingly, we measure understanding not in terms of correctness
+but by evaluating consistency across multiple senses that are generated by the
+model itself. We showcase our approach by instantiating a test where the
+different senses are different languages, hence using multilingual
+self-consistency as a litmus test for the model's understanding and
+simultaneously addressing the important topic of multilinguality. Taking one of
+the latest versions of ChatGPT as our object of study, we evaluate multilingual
+consistency for two different tasks across three different languages. We show
+that its multilingual consistency is still lacking, and that its task and world
+understanding are thus not language-independent. As our approach does not
+require any static evaluation corpora in languages other than English, it can
+easily and cheaply be extended to different languages and tasks and could
+become an integral part of future benchmarking efforts.
+
+
+
+
+
+
+
+ ♻ ☆ A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
+
+
+ The surge of interest towards Multi-modal Large Language Models (MLLMs),
+e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
+academia and industry. They endow Large Language Models (LLMs) with powerful
+capabilities in visual understanding, enabling them to tackle diverse
+multi-modal tasks. Very recently, Google released Gemini, its newest and most
+capable MLLM built from the ground up for multi-modality. In light of the
+superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
+in multi-modal learning? In this paper, we present a preliminary exploration of
+Gemini Pro's visual understanding proficiency, which comprehensively covers
+four domains: fundamental perception, advanced cognition, challenging vision
+tasks, and various expert capacities. We compare Gemini Pro with the
+state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
+open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
+black-box systems. The qualitative samples indicate that, while GPT-4V and
+Gemini showcase different answering styles and preferences, they can exhibit
+comparable visual reasoning capabilities, and Sphinx still trails behind them
+concerning domain generalizability. Specifically, GPT-4V tends to elaborate
+detailed explanations and intermediate steps, and Gemini prefers to output a
+direct and concise answer. The quantitative evaluation on the popular MME
+benchmark also demonstrates the potential of Gemini to be a strong challenger
+to GPT-4V. Our early investigation of Gemini also observes some common issues
+of MLLMs, indicating that there still remains a considerable distance towards
+artificial general intelligence. Our project for tracking the progress of MLLM
+is released at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Total 120 pages. See our project at
+ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+
+
+
+
+
+ ♻ ☆ Benchmarking Large Language Models in Retrieval-Augmented Generation AAAI 2024
+
+
+
+
+
+
+
+
+ Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun
+
+
+ Retrieval-Augmented Generation (RAG) is a promising approach for mitigating
+the hallucination of large language models (LLMs). However, existing research
+lacks rigorous evaluation of the impact of retrieval-augmented generation on
+different large language models, which make it challenging to identify the
+potential bottlenecks in the capabilities of RAG for different LLMs. In this
+paper, we systematically investigate the impact of Retrieval-Augmented
+Generation on large language models. We analyze the performance of different
+large language models in 4 fundamental abilities required for RAG, including
+noise robustness, negative rejection, information integration, and
+counterfactual robustness. To this end, we establish Retrieval-Augmented
+Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and
+Chinese. RGB divides the instances within the benchmark into 4 separate
+testbeds based on the aforementioned fundamental abilities required to resolve
+the case. Then we evaluate 6 representative LLMs on RGB to diagnose the
+challenges of current LLMs when applying RAG. Evaluation reveals that while
+LLMs exhibit a certain degree of noise robustness, they still struggle
+significantly in terms of negative rejection, information integration, and
+dealing with false information. The aforementioned assessment outcomes indicate
+that there is still a considerable journey ahead to effectively apply RAG to
+LLMs.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Evaluating the Ripple Effects of Knowledge Editing in Language Models ACL
+
+
+
+
+
+
+
+
+ Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, Mor Geva
+
+
+ Modern language models capture a large body of factual knowledge. However,
+some facts can be incorrectly induced or become obsolete over time, resulting
+in factually incorrect generations. This has led to the development of various
+editing methods that allow updating facts encoded by the model. Evaluation of
+these methods has primarily focused on testing whether an individual fact has
+been successfully injected, and if similar predictions for other subjects have
+not changed. Here we argue that such evaluation is limited, since injecting one
+fact (e.g. ``Jack Depp is the son of Johnny Depp'') introduces a ``ripple
+effect'' in the form of additional facts that the model needs to update
+(e.g.``Jack Depp is the sibling of Lily-Rose Depp''). To address this issue, we
+propose a novel set of evaluation criteria that consider the implications of an
+edit on related facts. Using these criteria, we then construct RippleEdits, a
+diagnostic benchmark of 5K factual edits, capturing a variety of types of
+ripple effects. We evaluate prominent editing methods on RippleEdits, showing
+that current methods fail to introduce consistent changes in the model's
+knowledge. In addition, we find that a simple in-context editing baseline
+obtains the best scores on our benchmark, suggesting a promising research
+direction for model editing.
+
+
+
+ comment: Accepted for publication in Transactions of the Association for
+ Computational Linguistics (TACL), 2024. Author's final version
+
+
+
+
+
+
+ ♻ ☆ Journey to the Center of the Knowledge Neurons: Discoveries of
+ Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons AAAI
+
+
+
+
+
+
+
+
+ Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
+
+
+ Pre-trained language models (PLMs) contain vast amounts of factual knowledge,
+but how the knowledge is stored in the parameters remains unclear. This paper
+delves into the complex task of understanding how factual knowledge is stored
+in multilingual PLMs, and introduces the Architecture-adapted Multilingual
+Integrated Gradients method, which successfully localizes knowledge neurons
+more precisely compared to current methods, and is more universal across
+various architectures and languages. Moreover, we conduct an in-depth
+exploration of knowledge neurons, leading to the following two important
+discoveries: (1) The discovery of Language-Independent Knowledge Neurons, which
+store factual knowledge in a form that transcends language. We design
+cross-lingual knowledge editing experiments, demonstrating that the PLMs can
+accomplish this task based on language-independent neurons; (2) The discovery
+of Degenerate Knowledge Neurons, a novel type of neuron showing that different
+knowledge neurons can store the same fact. Its property of functional overlap
+endows the PLMs with a robust mastery of factual knowledge. We design
+fact-checking experiments, proving that the degenerate knowledge neurons can
+help the PLMs to detect wrong facts. Experiments corroborate these findings,
+shedding light on the mechanisms of factual knowledge storage in multilingual
+PLMs, and contribute valuable insights to the field. The code is available at
+https://github.com/heng840/AMIG.
+
+
+
+ comment: Accepted in the 38th AAAI Conference on Artificial Intelligence (AAAI
+ 2024)
+
+
+
+
+
+
+ ♻ ☆ Compositional Generalization for Multi-label Text Classification: A
+ Data-Augmentation Approach AAAI'24
+
+
+
+
+
+
+
+
+ Yuyang Chai, Zhuang Li, Jiahui Liu, Lei Chen, Fei Li, Donghong Ji, Chong Teng
+
+
+ Despite significant advancements in multi-label text classification, the
+ability of existing models to generalize to novel and seldom-encountered
+complex concepts, which are compositions of elementary ones, remains
+underexplored. This research addresses this gap. By creating unique data splits
+across three benchmarks, we assess the compositional generalization ability of
+existing multi-label text classification models. Our results show that these
+models often fail to generalize to compositional concepts encountered
+infrequently during training, leading to inferior performance on tests with
+these new combinations. To address this, we introduce a data augmentation
+method that leverages two innovative text generation models designed to enhance
+the classification models' capacity for compositional generalization. Our
+experiments show that this data augmentation approach significantly improves
+the compositional generalization capabilities of classification models on our
+benchmarks, with both generation models surpassing other text generation
+baselines.
+
+
+
+ comment: Accepted by AAAI'24
+
+
+
+
+
+
+ ♻ ☆ Safety Analysis in the Era of Large Language Models: A Case Study of
+ STPA using ChatGPT
+
+
+ Can safety analysis make use of Large Language Models (LLMs)? A case study
+explores Systems Theoretic Process Analysis (STPA) applied to Automatic
+Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems
+using ChatGPT. We investigate how collaboration schemes, input semantic
+complexity, and prompt guidelines influence STPA results. Comparative results
+show that using ChatGPT without human intervention may be inadequate due to
+reliability related issues, but with careful design, it may outperform human
+experts. No statistically significant differences are found when varying the
+input semantic complexity or using common prompt guidelines, which suggests the
+necessity for developing domain-specific prompt engineering. We also highlight
+future challenges, including concerns about LLM trustworthiness and the
+necessity for standardisation and regulation in this domain.
+
+
+
+ comment: Under Review
+
+
+
+
+
+
+ ♻ ☆ SEAM: An Integrated Activation-Coupled Model of Sentence Processing and
+ Eye Movements in Reading
+
+
+
+
+
+
+
+
+ Maximilian M. Rabe, Dario Paape, Daniela Mertzen, Shravan Vasishth, Ralf Engbert
+
+
+ Models of eye-movement control during reading, developed largely within
+psychology, usually focus on visual, attentional, lexical, and motor processes
+but neglect post-lexical language processing; by contrast, models of sentence
+comprehension processes, developed largely within psycholinguistics, generally
+focus only on post-lexical language processes. We present a model that combines
+these two research threads, by integrating eye-movement control and sentence
+processing. Developing such an integrated model is extremely challenging and
+computationally demanding, but such an integration is an important step toward
+complete mathematical models of natural language comprehension in reading. We
+combine the SWIFT model of eye-movement control (Seelig et al., 2020,
+doi:10.1016/j.jmp.2019.102313) with key components of the Lewis and Vasishth
+sentence processing model (Lewis & Vasishth, 2005,
+doi:10.1207/s15516709cog0000_25). This integration becomes possible, for the
+first time, due in part to recent advances in successful parameter
+identification in dynamical models, which allows us to investigate profile
+log-likelihoods for individual model parameters. We present a fully implemented
+proof-of-concept model demonstrating how such an integrated model can be
+achieved; our approach includes Bayesian model inference with Markov Chain
+Monte Carlo (MCMC) sampling as a key computational tool. The integrated
+Sentence-Processing and Eye-Movement Activation-Coupled Model (SEAM) can
+successfully reproduce eye movement patterns that arise due to similarity-based
+interference in reading. To our knowledge, this is the first-ever integration
+of a complete process model of eye-movement control with linguistic dependency
+completion processes in sentence comprehension. In future work, this proof of
+concept model will need to be evaluated using a comprehensive set of benchmark
+data.
+
+
+
+
+
+
+
+ ♻ ☆ TRAMS: Training-free Memory Selection for Long-range Language Modeling EMNLP 2023
+
+
+ The Transformer architecture is crucial for numerous AI models, but it still
+faces challenges in long-range language modeling. Though several specific
+transformer architectures have been designed to tackle issues of long-range
+dependencies, existing methods like Transformer-XL are plagued by a high
+percentage of ineffective memories. In this study, we present a plug-and-play
+strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens
+participating in attention calculation based on one simple metric. This
+strategy allows us to keep tokens that are likely to have a high attention
+score with the current queries and ignore the other ones. We have tested our
+approach on the word-level benchmark (WikiText-103) and the character-level
+benchmark (enwik8), and the results indicate an improvement without having
+additional training or adding additional parameters.
+
+
+
+ comment: Findings of EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ The Earth is Flat because...: Investigating LLMs' Belief towards
+ Misinformation via Persuasive Conversation
+
+
+
+
+
+
+
+
+ Rongwu Xu, Brian S. Lin, Shujian Yang, Tianqi Zhang, Weiyan Shi, Tianwei Zhang, Zhixuan Fang, Wei Xu, Han Qiu
+
+
+ Large Language Models (LLMs) encapsulate vast amounts of knowledge but still
+remain vulnerable to external misinformation. Existing research mainly studied
+this susceptibility behavior in a single-turn setting. However, belief can
+change during a multi-turn conversation, especially a persuasive one.
+Therefore, in this study, we delve into LLMs' susceptibility to persuasive
+conversations, particularly on factual questions that they can answer
+correctly. We first curate the Farm (i.e., Fact to Misinform) dataset, which
+contains factual questions paired with systematically generated persuasive
+misinformation. Then, we develop a testing framework to track LLMs' belief
+changes in a persuasive dialogue. Through extensive experiments, we find that
+LLMs' correct beliefs on factual knowledge can be easily manipulated by various
+persuasive strategies.
+
+
+
+ comment: 45 pages
+
+
+
+
+
+
+ ♻ ☆ A Survey of Reasoning with Foundation Models: Concepts, Methodologies,
+ and Outlook
+
+
+ Reasoning, a crucial ability for complex problem-solving, plays a pivotal
+role in various real-world settings such as negotiation, medical diagnosis, and
+criminal investigation. It serves as a fundamental methodology in the field of
+Artificial General Intelligence (AGI). With the ongoing development of
+foundation models, there is a growing interest in exploring their abilities in
+reasoning tasks. In this paper, we introduce seminal foundation models proposed
+or adaptable for reasoning, highlighting the latest advancements in various
+reasoning tasks, methods, and benchmarks. We then delve into the potential
+future directions behind the emergence of reasoning abilities within foundation
+models. We also discuss the relevance of multimodal learning, autonomous
+agents, and super alignment in the context of reasoning. By discussing these
+future research directions, we hope to inspire researchers in their exploration
+of this field, stimulate further advancements in reasoning with foundation
+models, and contribute to the development of AGI.
+
+
+
+
+
+
+
+
+ Hongzhan Chen, Siyue Wu, Xiaojun Quan, Rui Wang, Ming Yan, Ji Zhang
+
+
+ Large language models (LLMs) have showcased remarkable capabilities in
+complex reasoning through chain of thought (CoT) prompting. Recently, there has
+been a growing interest in transferring these reasoning abilities from LLMs to
+smaller models. However, achieving both the diversity and consistency in
+rationales presents a challenge. In this paper, we focus on enhancing these two
+aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to
+efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple
+rationales for each question and enforce consistency among the corresponding
+predictions by minimizing the bidirectional KL-divergence between the answer
+distributions. We investigate the effectiveness of MCC-KD with different model
+architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both
+mathematical reasoning and commonsense reasoning benchmarks. The empirical
+results not only confirm MCC-KD's superior performance on in-distribution
+datasets but also highlight its robust generalization ability on
+out-of-distribution datasets.
+
+
+
+ comment: Accepted to ENMLP 2023
+
+
+
+
+
+
+ ♻ ☆ Assessing AI Chatbots Performance in Comprehensive Standardized Test
+ Preparation; A Case Study with GRE
+
+
+ This research paper presents a comprehensive evaluation of the performance of
+three artificial 10 intelligence chatbots: Bing, ChatGPT, and GPT-4, in
+addressing standardized test questions. Graduate record examination, known as
+GRE, serves as a case study in this paper, encompassing both quantitative
+reasoning and verbal skills. A total of 137 quantitative reasoning questions,
+featuring diverse styles and 157 verbal questions categorized into varying
+levels of difficulty (easy, medium, and hard) were administered to assess the
+chatbots' capabilities. This paper provides a detailed examination of the
+results and their implications for the utilization of artificial intelligence
+in standardized test preparation by presenting the performance of each chatbot
+across various skills and styles tested in the exam. Additionally, this paper
+explores the proficiency of artificial intelligence in addressing image-based
+questions and illustrates the uncertainty level of each chatbot. The results
+reveal varying degrees of success across the chatbots, demonstrating the
+influence of model sophistication and training data. GPT-4 emerged as the most
+proficient, especially in complex language understanding tasks, highlighting
+the evolution of artificial intelligence in language comprehension and its
+ability to pass the exam with a high score.
+
+
+
+ comment: 19 Pages, 6 figures, and 6 tables
+
+
+
+
+
+
+ ♻ ☆ Climate Change from Large Language Models
+
+
+ Climate change presents significant challenges to the global community, and
+it is imperative to raise widespread awareness of the climate crisis and
+educate users about low-carbon living. Artificial intelligence, particularly
+large language models (LLMs), have emerged as powerful tools in mitigating the
+climate crisis, leveraging their extensive knowledge, broad user base, and
+natural language interaction capabilities. However, despite the growing body of
+research on climate change, there is a lack of comprehensive assessments of
+climate crisis knowledge within LLMs. This paper aims to resolve this gap by
+proposing an automatic evaluation framework. We employ a hybrid approach to
+data acquisition that combines data synthesis and manual collection to compile
+a diverse set of questions related to the climate crisis. These questions cover
+various aspects of climate change, including its causes, impacts, mitigation
+strategies, and adaptation measures. We then evaluate the model knowledge
+through prompt engineering based on the collected questions and generated
+answers. We propose a set of comprehensive metrics to evaluate the climate
+crisis knowledge, incorporating indicators from 10 different perspectives.
+Experimental results show that our method is effective in evaluating the
+knowledge of LLMs regarding the climate crisis. We evaluate several
+state-of-the-art LLMs and find that their knowledge falls short in terms of
+timeliness.
+
+
+
+
+
+
+
+ ♻ ☆ Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
+
+
+ We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
+title reranking technique to achieve efficient title reranking 20x-40x faster
+than vanilla passage reranker. However, one of the challenges with the training
+of Efficient Title Reranker is the instability. Analyzing the issue, we found
+some very difficult ground truths might act as noisy labels causing accuracy to
+drop as well as some extreme values in model probability output causing nan. To
+address these issues, we introduce the Sigmoid Trick, a novel technique that
+reduces the gradient update of both cases resulting in better retrieval
+efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
+achieved four state-of-the-art positions on the kilt knowledge benchmark.
+
+
+ Events describe happenings in our world that are of importance. Naturally,
+understanding events mentioned in multimedia content and how they are related
+forms an important way of comprehending our world. Existing literature can
+infer if events across textual and visual (video) domains are identical (via
+grounding) and thus, on the same semantic level. However, grounding fails to
+capture the intricate cross-event relations that exist due to the same events
+being referred to on many semantic levels. For example, in Figure 1, the
+abstract event of "war" manifests at a lower semantic level through subevents
+"tanks firing" (in video) and airplane "shot" (in text), leading to a
+hierarchical, multimodal relationship between the events.
+ In this paper, we propose the task of extracting event hierarchies from
+multimodal (video and text) data to capture how the same event manifests itself
+in different modalities at different semantic levels. This reveals the
+structure of events and is critical to understanding them. To support research
+on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve)
+dataset. Unlike prior video-language datasets, MultiHiEve is composed of news
+video-article pairs, which makes it rich in event hierarchies. We densely
+annotate a part of the dataset to construct the test benchmark. We show the
+limitations of state-of-the-art unimodal and multimodal baselines on this task.
+Further, we address these limitations via a new weakly supervised model,
+leveraging only unannotated video-article pairs from MultiHiEve. We perform a
+thorough evaluation of our proposed method which demonstrates improved
+performance on this task and highlight opportunities for future research.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ PMET: Precise Model Editing in a TransformerAAAI24
+
+
+
+
+
+
+
+
+ Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, Jie Yu
+
+
+ Model editing techniques modify a minor proportion of knowledge in Large
+Language Models (LLMs) at a relatively low cost, which have demonstrated
+notable success. Existing methods assume Transformer Layer (TL) hidden states
+are values of key-value memories of the Feed-Forward Network (FFN). They
+usually optimize the TL hidden states to memorize target knowledge and use it
+to update the weights of the FFN in LLMs. However, the information flow of TL
+hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN,
+and residual connections. Existing methods neglect the fact that the TL hidden
+states contains information not specifically required for FFN. Consequently,
+the performance of model editing decreases. To achieve more precise model
+editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes
+certain general knowledge extraction patterns. This implies that MHSA weights
+do not require updating when new knowledge is introduced. Based on above
+findings, we introduce PMET, which simultaneously optimizes Transformer
+Component (TC, namely MHSA and FFN) hidden states, while only using the
+optimized TC hidden states of FFN to precisely update FFN weights. Our
+experiments demonstrate that PMET exhibits state-of-the-art performance on both
+the COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the
+effectiveness of our enhancements, further reinforcing the finding that the
+MHSA encodes certain general knowledge extraction patterns and indicating its
+storage of a small amount of factual knowledge. Our code is available at
+https://github.com/xpq-tech/PMET.
+
+
+
+ comment: Accepted in AAAI24
+
+
+
+
+
+
+ ♻ ☆ Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows
+
+
+
+
+
+
+
+
+ Madeleine Grunde-McLaughlin, Michelle S. Lam, Ranjay Krishna, Daniel S. Weld, Jeffrey Heer
+
+
+ LLM chains enable complex tasks by decomposing work into a sequence of
+sub-tasks. Crowdsourcing workflows similarly decompose complex tasks into
+smaller tasks for human crowdworkers. Chains address LLM errors analogously to
+the way crowdsourcing workflows address human error. To characterize
+opportunities for LLM chaining, we survey 107 papers across the crowdsourcing
+and chaining literature to construct a design space for chain development. The
+design space connects an LLM designer's objectives to strategies they can use
+to achieve those objectives, and tactics to implement each strategy. To explore
+how techniques from crowdsourcing may apply to chaining, we adapt crowdsourcing
+workflows to implement LLM chains across three case studies: creating a
+taxonomy, shortening text, and writing a short story. From the design space and
+our case studies, we identify which techniques transfer from crowdsourcing to
+LLM chaining and raise implications for future research and development.
+
+
+
+
+
+
+
+ ♻ ☆ The Short Text Matching Model Enhanced with Knowledge via Contrastive
+ Learning
+
+
+ In recent years, short Text Matching tasks have been widely applied in the
+fields ofadvertising search and recommendation. The difficulty lies in the lack
+of semantic information and word ambiguity caused by the short length of the
+text. Previous works have introduced complement sentences or knowledge bases to
+provide additional feature information. However, these methods have not fully
+interacted between the original sentence and the complement sentence, and have
+not considered the noise issue that may arise from the introduction of external
+knowledge bases. Therefore, this paper proposes a short Text Matching model
+that combines contrastive learning and external knowledge. The model uses a
+generative model to generate corresponding complement sentences and uses the
+contrastive learning method to guide the model to obtain more semantically
+meaningful encoding of the original sentence. In addition, to avoid noise, we
+use keywords as the main semantics of the original sentence to retrieve
+corresponding knowledge words in the knowledge base, and construct a knowledge
+graph. The graph encoding model is used to integrate the knowledge base
+information into the model. Our designed model achieves state-of-the-art
+performance on two publicly available Chinese Text Matching datasets,
+demonstrating the effectiveness of our model.
+
+
+
+ comment: 11 pages,2 figures
+
+
+
+
+
+
+ ♻ ☆ Redefining Digital Health Interfaces with Large Language Models
+
+
+
+
+
+
+
+
+ Fergus Imrie, Paulius Rauba, Mihaela van der Schaar
+
+
+ Digital health tools have the potential to significantly improve the delivery
+of healthcare services. However, their adoption remains comparatively limited
+due, in part, to challenges surrounding usability and trust. Recently, Large
+Language Models (LLMs) have emerged as general-purpose models with the ability
+to process complex information and produce human-quality text, presenting a
+wealth of potential applications in healthcare. Directly applying LLMs in
+clinical settings is not straightforward, with LLMs susceptible to providing
+inconsistent or nonsensical answers. We describe how LLM-based systems can
+utilize external tools to provide a novel interface between clinicians and
+digital technologies. This enhances the utility and practical impact of digital
+healthcare tools and AI models while addressing current issues with using LLM
+in clinical settings such as hallucinations. We illustrate LLM-based interfaces
+with examples from cardiovascular disease and diabetes risk prediction,
+highlighting the benefit compared to traditional interfaces for digital tools.
+
+
+
+
+
+
+
+ ♻ ☆ Consensus, dissensus and synergy between clinicians and specialist
+ foundation models in radiology report generation
+
+
+
+
+
+
+
+
+ Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Yossi Matias, Joelle Barral, Ali Eslami, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam, Ira Ktena
+
+
+ Radiology reports are an instrumental part of modern medicine, informing key
+clinical decisions such as diagnosis and treatment. The worldwide shortage of
+radiologists, however, restricts access to expert care and imposes heavy
+workloads, contributing to avoidable errors and delays in report delivery.
+While recent progress in automated report generation with vision-language
+models offer clear potential in ameliorating the situation, the path to
+real-world adoption has been stymied by the challenge of evaluating the
+clinical quality of AI-generated reports. In this study, we build a
+state-of-the-art report generation system for chest radiographs,
+$\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation
+model on radiology data. To evaluate the quality of the AI-generated reports, a
+group of 16 certified radiologists provide detailed evaluations of AI-generated
+and human written reports for chest X-rays from an intensive care setting in
+the United States and an inpatient setting in India. At least one radiologist
+(out of two per case) preferred the AI report to the ground truth report in
+over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated
+reports that contain errors, the most frequently cited reasons were related to
+the location and finding, whereas for human written reports, most mistakes were
+related to severity and finding. This disparity suggested potential
+complementarity between our AI system and human experts, prompting us to
+develop an assistive scenario in which Flamingo-CXR generates a first-draft
+report, which is subsequently revised by a clinician. This is the first
+demonstration of clinician-AI collaboration for report writing, and the
+resultant reports are assessed to be equivalent or preferred by at least one
+radiologist to reports written by experts alone in 80$\%$ of in-patient cases
+and 60$\%$ of intensive care cases.
+
+
+
+
+
+
+
+ ♻ ☆ Towards Faithful Model Explanation in NLP: A Survey
+
+
+ End-to-end neural Natural Language Processing (NLP) models are notoriously
+difficult to understand. This has given rise to numerous efforts towards model
+explainability in recent years. One desideratum of model explanation is
+faithfulness, i.e. an explanation should accurately represent the reasoning
+process behind the model's prediction. In this survey, we review over 110 model
+explanation methods in NLP through the lens of faithfulness. We first discuss
+the definition and evaluation of faithfulness, as well as its significance for
+explainability. We then introduce recent advances in faithful explanation,
+grouping existing approaches into five categories: similarity-based methods,
+analysis of model-internal structures, backpropagation-based methods,
+counterfactual intervention, and self-explanatory models. For each category, we
+synthesize its representative studies, strengths, and weaknesses. Finally, we
+summarize their common virtues and remaining challenges, and reflect on future
+work directions towards faithful explainability in NLP.
+
+
+
+ comment: Revision round #2 for the Computational Linguistics journal
+
+
+
+
+
+
+ ♻ ☆ ConSequence: Synthesizing Logically Constrained Sequences for Electronic
+ Health Record Generation
+
+
+
+
+
+
+
+
+ Brandon Theodorou, Shrusti Jain, Cao Xiao, Jimeng Sun
+
+
+ Generative models can produce synthetic patient records for analytical tasks
+when real data is unavailable or limited. However, current methods struggle
+with adhering to domain-specific knowledge and removing invalid data. We
+present ConSequence, an effective approach to integrating domain knowledge into
+sequential generative neural network outputs. Our rule-based formulation
+includes temporal aggregation and antecedent evaluation modules, ensured by an
+efficient matrix multiplication formulation, to satisfy hard and soft logical
+constraints across time steps. Existing constraint methods often fail to
+guarantee constraint satisfaction, lack the ability to handle temporal
+constraints, and hinder the learning and computational efficiency of the model.
+In contrast, our approach efficiently handles all types of constraints with
+guaranteed logical coherence. We demonstrate ConSequence's effectiveness in
+generating electronic health records, outperforming competitors in achieving
+complete temporal and spatial constraint satisfaction without compromising
+runtime performance or generative quality. Specifically, ConSequence
+successfully prevents all rule violations while improving the model quality in
+reducing its test perplexity by 5% and incurring less than a 13% slowdown in
+generation speed compared to an unconstrained model.
+
+
+ Fine-tuning large pre-trained language models on downstream tasks has become
+an important paradigm in NLP. However, common practice fine-tunes all of the
+parameters in a pre-trained model, which becomes prohibitive when a large
+number of downstream tasks are present. Therefore, many fine-tuning methods are
+proposed to learn incremental updates of pre-trained weights in a parameter
+efficient way, e.g., low-rank increments. These methods often evenly distribute
+the budget of incremental updates across all pre-trained weight matrices, and
+overlook the varying importance of different weight parameters. As a
+consequence, the fine-tuning performance is suboptimal. To bridge this gap, we
+propose AdaLoRA, which adaptively allocates the parameter budget among weight
+matrices according to their importance score. In particular, AdaLoRA
+parameterizes the incremental updates in the form of singular value
+decomposition. Such a novel approach allows us to effectively prune the
+singular values of unimportant updates, which is essentially to reduce their
+parameter budget but circumvent intensive exact SVD computations. We conduct
+extensive experiments with several pre-trained models on natural language
+processing, question answering, and natural language generation to validate the
+effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable
+improvement over baselines, especially in the low budget settings. Our code is
+publicly available at https://github.com/QingruZhang/AdaLoRA .
+
+
+
+ comment: The 11th International Conference on Learning Representations (ICLR
+ 2023)
+
+
+
+
+
+
+ ♻ ☆ Universal and Transferable Adversarial Attacks on Aligned Language
+ Models
+
+
+
+
+
+
+
+
+ Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
+
+
+ Because "out-of-the-box" large language models are capable of generating a
+great deal of objectionable content, recent work has focused on aligning these
+models in an attempt to prevent undesirable generation. While there has been
+some success at circumventing these measures -- so-called "jailbreaks" against
+LLMs -- these attacks have required significant human ingenuity and are brittle
+in practice. In this paper, we propose a simple and effective attack method
+that causes aligned language models to generate objectionable behaviors.
+Specifically, our approach finds a suffix that, when attached to a wide range
+of queries for an LLM to produce objectionable content, aims to maximize the
+probability that the model produces an affirmative response (rather than
+refusing to answer). However, instead of relying on manual engineering, our
+approach automatically produces these adversarial suffixes by a combination of
+greedy and gradient-based search techniques, and also improves over past
+automatic prompt generation methods.
+ Surprisingly, we find that the adversarial prompts generated by our approach
+are quite transferable, including to black-box, publicly released LLMs.
+Specifically, we train an adversarial attack suffix on multiple prompts (i.e.,
+queries asking for many different types of objectionable content), as well as
+multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting
+attack suffix is able to induce objectionable content in the public interfaces
+to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,
+Pythia, Falcon, and others. In total, this work significantly advances the
+state-of-the-art in adversarial attacks against aligned language models,
+raising important questions about how such systems can be prevented from
+producing objectionable information. Code is available at
+github.com/llm-attacks/llm-attacks.
+
+
+
+
+
+
+
+
+ Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang
+
+
+ The human ability to easily solve multimodal tasks in context (i.e., with
+only a few demonstrations or simple instructions), is what current multimodal
+systems have largely struggled to imitate. In this work, we demonstrate that
+the task-agnostic in-context learning capabilities of large multimodal models
+can be significantly enhanced by effective scaling-up. We introduce Emu2, a
+generative multimodal model with 37 billion parameters, trained on large-scale
+multimodal sequences with a unified autoregressive objective. Emu2 exhibits
+strong multimodal in-context learning abilities, even emerging to solve tasks
+that require on-the-fly reasoning, such as visual prompting and object-grounded
+generation. The model sets a new record on multiple multimodal understanding
+tasks in few-shot settings. When instruction-tuned to follow specific
+instructions, Emu2 further achieves new state-of-the-art on challenging tasks
+such as question answering benchmarks for large multimodal models and
+open-ended subject-driven generation. These achievements demonstrate that Emu2
+can serve as a base model and general-purpose interface for a wide range of
+multimodal tasks. Code and models are publicly available to facilitate future
+research.
+
+
+
+
+
+
+
+ ☆ UniSDF: Unifying Neural Representations for High-Fidelity 3D
+ Reconstruction of Complex Scenes with Reflections
+
+
+
+
+
+
+
+
+ Fangjinhua Wang, Marie-Julie Rakotosaona, Michael Niemeyer, Richard Szeliski, Marc Pollefeys, Federico Tombari
+
+
+ Neural 3D scene representations have shown great potential for 3D
+reconstruction from 2D images. However, reconstructing real-world captures of
+complex scenes still remains a challenge. Existing generic 3D reconstruction
+methods often struggle to represent fine geometric details and do not
+adequately model reflective surfaces of large-scale scenes. Techniques that
+explicitly focus on reflective surfaces can model complex and detailed
+reflections by exploiting better reflection parameterizations. However, we
+observe that these methods are often not robust in real unbounded scenarios
+where non-reflective as well as reflective components are present. In this
+work, we propose UniSDF, a general purpose 3D reconstruction method that can
+reconstruct large complex scenes with reflections. We investigate both
+view-based as well as reflection-based color prediction parameterization
+techniques and find that explicitly blending these representations in 3D space
+enables reconstruction of surfaces that are more geometrically accurate,
+especially for reflective surfaces. We further combine this representation with
+a multi-resolution grid backbone that is trained in a coarse-to-fine manner,
+enabling faster reconstructions than prior methods. Extensive experiments on
+object-level datasets DTU, Shiny Blender as well as unbounded datasets Mip-NeRF
+360 and Ref-NeRF real demonstrate that our method is able to robustly
+reconstruct complex large-scale scenes with fine details and reflective
+surfaces. Please see our project page at
+https://fangjinhuawang.github.io/UniSDF.
+
+
+
+
+
+
+
+ ☆ Deep Learning on 3D Neural Fields ICLR 2023
+
+
+
+
+
+
+
+
+ Pierluigi Zama Ramirez, Luca De Luigi, Daniele Sirocchi, Adriano Cardace, Riccardo Spezialetti, Francesco Ballerini, Samuele Salti, Luigi Di Stefano
+
+
+ In recent years, Neural Fields (NFs) have emerged as an effective tool for
+encoding diverse continuous signals such as images, videos, audio, and 3D
+shapes. When applied to 3D data, NFs offer a solution to the fragmentation and
+limitations associated with prevalent discrete representations. However, given
+that NFs are essentially neural networks, it remains unclear whether and how
+they can be seamlessly integrated into deep learning pipelines for solving
+downstream tasks. This paper addresses this research problem and introduces
+nf2vec, a framework capable of generating a compact latent representation for
+an input NF in a single inference pass. We demonstrate that nf2vec effectively
+embeds 3D objects represented by the input NFs and showcase how the resulting
+embeddings can be employed in deep learning pipelines to successfully address
+various tasks, all while processing exclusively NFs. We test this framework on
+several NFs used to represent 3D surfaces, such as unsigned/signed distance and
+occupancy fields. Moreover, we demonstrate the effectiveness of our approach
+with more complex NFs that encompass both geometry and appearance of 3D objects
+such as neural radiance fields.
+
+
+
+ comment: Extended version of the paper "Deep Learning on Implicit Neural
+ Representations of Shapes" that was presented at ICLR 2023. arXiv admin note:
+ text overlap with arXiv:2302.05438
+
+
+
+
+
+
+ ☆ Repaint123: Fast and High-quality One Image to 3D Generation with
+ Progressive Controllable 2D Repainting
+
+
+ Recent one image to 3D generation methods commonly adopt Score Distillation
+Sampling (SDS). Despite the impressive results, there are multiple deficiencies
+including multi-view inconsistency, over-saturated and over-smoothed textures,
+as well as the slow generation speed. To address these deficiencies, we present
+Repaint123 to alleviate multi-view bias as well as texture degradation and
+speed up the generation process. The core idea is to combine the powerful image
+generation capability of the 2D diffusion model and the texture alignment
+ability of the repainting strategy for generating high-quality multi-view
+images with consistency. We further propose visibility-aware adaptive
+repainting strength for overlap regions to enhance the generated image quality
+in the repainting process. The generated high-quality and multi-view consistent
+images enable the use of simple Mean Square Error (MSE) loss for fast 3D
+content generation. We conduct extensive experiments and show that our method
+has a superior ability to generate high-quality 3D content with multi-view
+consistency and fine textures in 2 minutes from scratch. Code is at
+https://github.com/junwuzhang19/repaint123.
+
+
+
+
+
+
+
+ ☆ ClassLIE: Structure- and Illumination-Adaptive Classification for
+ Low-Light Image Enhancement
+
+
+
+
+
+
+
+
+ Zixiang Wei, Yiting Wang, Lichao Sun, Athanasios V. Vasilakos, Lin Wang
+
+
+ Low-light images often suffer from limited visibility and multiple types of
+degradation, rendering low-light image enhancement (LIE) a non-trivial task.
+Some endeavors have been recently made to enhance low-light images using
+convolutional neural networks (CNNs). However, they have low efficiency in
+learning the structural information and diverse illumination levels at the
+local regions of an image. Consequently, the enhanced results are affected by
+unexpected artifacts, such as unbalanced exposure, blur, and color bias. To
+this end, this paper proposes a novel framework, called ClassLIE, that combines
+the potential of CNNs and transformers. It classifies and adaptively learns the
+structural and illumination information from the low-light images in a holistic
+and regional manner, thus showing better enhancement performance. Our framework
+first employs a structure and illumination classification (SIC) module to learn
+the degradation information adaptively. In SIC, we decompose an input image
+into an illumination map and a reflectance map. A class prediction block is
+then designed to classify the degradation information by calculating the
+structure similarity scores on the reflectance map and mean square error on the
+illumination map. As such, each input image can be divided into patches with
+three enhancement difficulty levels. Then, a feature learning and fusion (FLF)
+module is proposed to adaptively learn the feature information with CNNs for
+different enhancement difficulty levels while learning the long-range
+dependencies for the patches in a holistic manner. Experiments on five
+benchmark datasets consistently show our ClassLIE achieves new state-of-the-art
+performance, with 25.74 PSNR and 0.92 SSIM on the LOL dataset.
+
+
+
+
+
+
+
+ ☆ Conditional Image Generation with Pretrained Generative Model
+
+
+ In recent years, diffusion models have gained popularity for their ability to
+generate higher-quality images in comparison to GAN models. However, like any
+other large generative models, these models require a huge amount of data,
+computational resources, and meticulous tuning for successful training. This
+poses a significant challenge, rendering it infeasible for most individuals. As
+a result, the research community has devised methods to leverage pre-trained
+unconditional diffusion models with additional guidance for the purpose of
+conditional image generative. These methods enable conditional image
+generations on diverse inputs and, most importantly, circumvent the need for
+training the diffusion model. In this paper, our objective is to reduce the
+time-required and computational overhead introduced by the addition of guidance
+in diffusion models -- while maintaining comparable image quality. We propose a
+set of methods based on our empirical analysis, demonstrating a reduction in
+computation time by approximately threefold.
+
+
+
+
+
+
+
+ ☆ Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model
+
+
+
+
+
+
+
+
+ Saurabh Saxena, Junhwa Hur, Charles Herrmann, Deqing Sun, David J. Fleet
+
+
+ While methods for monocular depth estimation have made significant strides on
+standard benchmarks, zero-shot metric depth estimation remains unsolved.
+Challenges include the joint modeling of indoor and outdoor scenes, which often
+exhibit significantly different distributions of RGB and depth, and the
+depth-scale ambiguity due to unknown camera intrinsics. Recent work has
+proposed specialized multi-head architectures for jointly modeling indoor and
+outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion
+model, with several advancements such as log-scale depth parameterization to
+enable joint modeling of indoor and outdoor scenes, conditioning on the
+field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV
+during training to generalize beyond the limited camera intrinsics in training
+datasets. Furthermore, by employing a more diverse training mixture than is
+common, and an efficient diffusion parameterization, our method, DMD (Diffusion
+for Metric Depth) achieves a 25\% reduction in relative error (REL) on
+zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the
+current SOTA using only a small number of denoising steps. For an overview see
+https://diffusion-vision.github.io/dmd
+
+
+
+
+
+
+
+ ☆ The role of data embedding in equivariant quantum convolutional neural
+ networks
+
+
+ Geometric deep learning refers to the scenario in which the symmetries of a
+dataset are used to constrain the parameter space of a neural network and thus,
+improve their trainability and generalization. Recently this idea has been
+incorporated into the field of quantum machine learning, which has given rise
+to equivariant quantum neural networks (EQNNs). In this work, we investigate
+the role of classical-to-quantum embedding on the performance of equivariant
+quantum convolutional neural networks (EQCNNs) for the classification of
+images. We discuss the connection between the data embedding method and the
+resulting representation of a symmetry group and analyze how changing
+representation affects the expressibility of an EQCNN. We numerically compare
+the classification accuracy of EQCNNs with three different basis-permuted
+amplitude embeddings to the one obtained from a non-equivariant quantum
+convolutional neural network (QCNN). Our results show that all the EQCNNs
+achieve higher classification accuracy than the non-equivariant QCNN for small
+numbers of training iterations, while for large iterations this improvement
+crucially depends on the used embedding. It is expected that the results of
+this work can be useful to the community for a better understanding of the
+importance of data embedding choice in the context of geometric quantum machine
+learning.
+
+
+
+
+
+
+
+
+ Amit Rozner, Barak Battash, Ofir Lindenbaum, Lior Wolf
+
+
+ We study the problem of performing face verification with an efficient neural
+model $f$. The efficiency of $f$ stems from simplifying the face verification
+problem from an embedding nearest neighbor search into a binary problem; each
+user has its own neural network $f$. To allow information sharing between
+different individuals in the training set, we do not train $f$ directly but
+instead generate the model weights using a hypernetwork $h$. This leads to the
+generation of a compact personalized model for face identification that can be
+deployed on edge devices. Key to the method's success is a novel way of
+generating hard negatives and carefully scheduling the training objectives. Our
+model leads to a substantially small $f$ requiring only 23k parameters and 5M
+floating point operations (FLOPS). We use six face verification datasets to
+demonstrate that our method is on par or better than state-of-the-art models,
+with a significantly reduced number of parameters and computational burden.
+Furthermore, we perform an extensive ablation study to demonstrate the
+importance of each element in our method.
+
+
+
+
+
+
+
+
+ Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov
+
+
+ Diffusion models have gained traction as powerful algorithms for synthesizing
+high-quality images. Central to these algorithms is the diffusion process,
+which maps data to noise according to equations inspired by thermodynamics and
+can significantly impact performance. A widely held assumption is that the ELBO
+objective of a diffusion model is invariant to the noise process (Kingma et
+al.,2021). In this work, we dispel this assumption -- we propose multivariate
+learned adaptive noise (MuLAN), a learned diffusion process that applies
+Gaussian noise at different rates across an image. Our method consists of three
+components -- a multivariate noise schedule, instance-conditional diffusion,
+and auxiliary variables -- which ensure that the learning objective is no
+longer invariant to the choice of the noise schedule as in previous works. Our
+work is grounded in Bayesian inference and casts the learned diffusion process
+as an approximate variational posterior that yields a tighter lower bound on
+marginal likelihood. Empirically, MuLAN sets a new state-of-the-art in density
+estimation on CIFAR-10 and ImageNet compared to classical diffusion. Code is
+available at https://github.com/s-sahoo/MuLAN
+
+
+
+
+
+
+
+
+ Shiu-hong Kao, Jierun Chen, S. H. Gary Chan
+
+
+ Knowledge distillation (KD) has been recognized as an effective tool to
+compress and accelerate models. However, current KD approaches generally suffer
+from an accuracy drop and/or an excruciatingly long distillation process. In
+this paper, we tackle the issue by first providing a new insight into a
+phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which
+makes the conventional end-to-end KD approaches unstable with noisy gradients.
+We then propose StableKD, a novel KD framework that breaks the IBOE and
+achieves more stable optimization. StableKD distinguishes itself through two
+operations: Decomposition and Recomposition, where the former divides a pair of
+teacher and student networks into several blocks for separate distillation, and
+the latter progressively merges them back, evolving towards end-to-end
+distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and
+ImageNet datasets with various teacher-student pairs. Compared to other KD
+approaches, our simple yet effective StableKD greatly boosts the model accuracy
+by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them
+with only 40% of the training data.
+
+
+
+
+
+
+
+ ☆ SISMIK for brain MRI: Deep-learning-based motion estimation and
+ model-based motion correction in k-space
+
+
+
+
+
+
+
+
+ Oscar Dabrowski, Jean-Luc Falcone, Antoine Klauser, Julien Songeon, Michel Kocher, Bastien Chopard, François Lazeyras, Sébastien Courvoisier
+
+
+ MRI, a widespread non-invasive medical imaging modality, is highly sensitive
+to patient motion. Despite many attempts over the years, motion correction
+remains a difficult problem and there is no general method applicable to all
+situations. We propose a retrospective method for motion quantification and
+correction to tackle the problem of in-plane rigid-body motion, apt for
+classical 2D Spin-Echo scans of the brain, which are regularly used in clinical
+practice. Due to the sequential acquisition of k-space, motion artifacts are
+well localized. The method leverages the power of deep neural networks to
+estimate motion parameters in k-space and uses a model-based approach to
+restore degraded images to avoid ''hallucinations''. Notable advantages are its
+ability to estimate motion occurring in high spatial frequencies without the
+need of a motion-free reference. The proposed method operates on the whole
+k-space dynamic range and is moderately affected by the lower SNR of higher
+harmonics. As a proof of concept, we provide models trained using supervised
+learning on 600k motion simulations based on motion-free scans of 43 different
+subjects. Generalization performance was tested with simulations as well as
+in-vivo. Qualitative and quantitative evaluations are presented for motion
+parameter estimations and image reconstruction. Experimental results show that
+our approach is able to obtain good generalization performance on simulated
+data and in-vivo acquisitions.
+
+
+ We present a framework for robots to learn novel visual concepts and tasks
+via in-situ linguistic interactions with human users. Previous approaches have
+either used large pre-trained visual models to infer novel objects zero-shot,
+or added novel concepts along with their attributes and representations to a
+concept hierarchy. We extend the approaches that focus on learning visual
+concept hierarchies by enabling them to learn novel concepts and solve unseen
+robotics tasks with them. To enable a visual concept learner to solve robotics
+tasks one-shot, we developed two distinct techniques. Firstly, we propose a
+novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which
+augments information of a novel concept to its parent nodes within a concept
+hierarchy. This information propagation allows all concepts in a hierarchy to
+update as novel concepts are taught in a continual learning setting. Secondly,
+we represent a visual task as a scene graph with language annotations, allowing
+us to create novel permutations of a demonstrated task zero-shot in-situ. We
+present two sets of results. Firstly, we compare Hi-Viscont with the baseline
+model (FALCON) on visual question answering(VQA) in three domains. While being
+comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an
+improvement of over 9% on non-leaf concepts on average. We compare our model's
+performance against the baseline FALCON model. Our framework achieves 33%
+improvements in success rate metric, and 19% improvements in the object level
+accuracy compared to the baseline model. With both of these results we
+demonstrate the ability of our model to learn tasks and concepts in a continual
+learning setting on the robot.
+
+
+
+ comment: In Proceedings of The 38th Annual AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+
+ Octave Mariotti, Oisin Mac Aodha, Hakan Bilen
+
+
+ Recent progress in self-supervised representation learning has resulted in
+models that are capable of extracting image features that are not only
+effective at encoding image level, but also pixel-level, semantics. These
+features have been shown to be effective for dense visual semantic
+correspondence estimation, even outperforming fully-supervised methods.
+Nevertheless, current self-supervised approaches still fail in the presence of
+challenging image characteristics such as symmetries and repeated parts. To
+address these limitations, we propose a new approach for semantic
+correspondence estimation that supplements discriminative self-supervised
+features with 3D understanding via a weak geometric spherical prior. Compared
+to more involved 3D pipelines, our model only requires weak viewpoint
+information, and the simplicity of our spherical representation enables us to
+inject informative geometric priors into the model during training. We propose
+a new evaluation metric that better accounts for repeated part and
+symmetry-induced mistakes. We present results on the challenging SPair-71k
+dataset, where we show that our approach demonstrates is capable of
+distinguishing between symmetric views and repeated parts across many object
+categories, and also demonstrate that we can generalize to unseen classes on
+the AwA dataset.
+
+
+
+
+
+
+
+ ☆ Brain-Inspired Visual Odometry: Balancing Speed and Interpretability
+ through a System of Systems Approach SC
+
+
+ In this study, we address the critical challenge of balancing speed and
+accuracy while maintaining interpretablity in visual odometry (VO) systems, a
+pivotal aspect in the field of autonomous navigation and robotics. Traditional
+VO systems often face a trade-off between computational speed and the precision
+of pose estimation. To tackle this issue, we introduce an innovative system
+that synergistically combines traditional VO methods with a specifically
+tailored fully connected network (FCN). Our system is unique in its approach to
+handle each degree of freedom independently within the FCN, placing a strong
+emphasis on causal inference to enhance interpretability. This allows for a
+detailed and accurate assessment of relative pose error (RPE) across various
+degrees of freedom, providing a more comprehensive understanding of parameter
+variations and movement dynamics in different environments. Notably, our system
+demonstrates a remarkable improvement in processing speed without compromising
+accuracy. In certain scenarios, it achieves up to a 5% reduction in Root Mean
+Square Error (RMSE), showcasing its ability to effectively bridge the gap
+between speed and accuracy that has long been a limitation in VO research. This
+advancement represents a significant step forward in developing more efficient
+and reliable VO systems, with wide-ranging applications in real-time navigation
+and robotic systems.
+
+
+
+ comment: https://www.american-cse.org/csci2023 is website of conference and
+ conference name is CSCI2023
+
+
+
+
+
+
+
+ Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi
+
+
+ We introduce the Splatter Image, an ultra-fast approach for monocular 3D
+object reconstruction which operates at 38 FPS. Splatter Image is based on
+Gaussian Splatting, which has recently brought real-time rendering, fast
+training, and excellent scaling to multi-view reconstruction. For the first
+time, we apply Gaussian Splatting in a monocular reconstruction setting. Our
+approach is learning-based, and, at test time, reconstruction only requires the
+feed-forward evaluation of a neural network. The main innovation of Splatter
+Image is the surprisingly straightforward design: it uses a 2D image-to-image
+network to map the input image to one 3D Gaussian per pixel. The resulting
+Gaussians thus have the form of an image, the Splatter Image. We further extend
+the method to incorporate more than one image as input, which we do by adding
+cross-view attention. Owning to the speed of the renderer (588 FPS), we can use
+a single GPU for training while generating entire images at each iteration in
+order to optimize perceptual metrics like LPIPS. On standard benchmarks, we
+demonstrate not only fast reconstruction but also better results than recent
+and much more expensive baselines in terms of PSNR, LPIPS, and other metrics.
+
+
+
+
+
+
+
+ ☆ Unleashing Large-Scale Video Generative Pre-training for Visual Robot
+ Manipulation
+
+
+
+
+
+
+
+
+ Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong
+
+
+ Generative pre-trained models have demonstrated remarkable effectiveness in
+language and vision domains by learning useful representations. In this paper,
+we extend the scope of this effectiveness by showing that visual robot
+manipulation can significantly benefit from large-scale video generative
+pre-training. We introduce GR-1, a straightforward GPT-style model designed for
+multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs
+a language instruction, a sequence of observation images, and a sequence of
+robot states. It predicts robot actions as well as future images in an
+end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly
+finetuned on robot data after pre-trained on a large-scale video dataset. We
+perform extensive experiments on the challenging CALVIN benchmark and a real
+robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline
+methods and improves the success rate from 88.9% to 94.9%. In the setting of
+zero-shot unseen scene generalization, GR-1 improves the success rate from
+53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline
+methods and shows strong potentials in generalization to unseen scenes and
+objects. We provide inaugural evidence that a unified GPT-style transformer,
+augmented with large-scale video generative pre-training, exhibits remarkable
+generalization to multi-task visual robot manipulation. Project page:
+https://GR1-Manipulation.github.io
+
+
+
+
+
+
+
+ ☆ Pixel-to-Abundance Translation: Conditional Generative Adversarial
+ Networks Based on Patch Transformer for Hyperspectral Unmixing
+
+
+
+
+
+
+
+
+ Li Wang, Xiaohua Zhang, Longfei Li, Hongyun Meng, Xianghai Cao
+
+
+ Spectral unmixing is a significant challenge in hyperspectral image
+processing. Existing unmixing methods utilize prior knowledge about the
+abundance distribution to solve the regularization optimization problem, where
+the difficulty lies in choosing appropriate prior knowledge and solving the
+complex regularization optimization problem. To solve these problems, we
+propose a hyperspectral conditional generative adversarial network (HyperGAN)
+method as a generic unmixing framework, based on the following assumption: the
+unmixing process from pixel to abundance can be regarded as a transformation of
+two modalities with an internal specific relationship. The proposed HyperGAN is
+composed of a generator and discriminator, the former completes the modal
+conversion from mixed hyperspectral pixel patch to the abundance of
+corresponding endmember of the central pixel and the latter is used to
+distinguish whether the distribution and structure of generated abundance are
+the same as the true ones. We propose hyperspectral image (HSI) Patch
+Transformer as the main component of the generator, which utilize adaptive
+attention score to capture the internal pixels correlation of the HSI patch and
+leverage the spatial-spectral information in a fine-grained way to achieve
+optimization of the unmixing process. Experiments on synthetic data and real
+hyperspectral data achieve impressive results compared to state-of-the-art
+competitors.
+
+
+
+
+
+
+
+
+ Haili Ye, Xiaoqing Zhang, Yan Hu, Huazhu Fu, Jiang Liu
+
+
+ The morphologies of vessel-like structures, such as blood vessels and nerve
+fibres, play significant roles in disease diagnosis, e.g., Parkinson's disease.
+Deep network-based refinement segmentation methods have recently achieved
+promising vessel-like structure segmentation results. There are still two
+challenges: (1) existing methods have limitations in rehabilitating subsection
+ruptures in segmented vessel-like structures; (2) they are often overconfident
+in predicted segmentation results. To tackle these two challenges, this paper
+attempts to leverage the potential of spatial interconnection relationships
+among subsection ruptures from the structure rehabilitation perspective. Based
+on this, we propose a novel Vessel-like Structure Rehabilitation Network
+(VSR-Net) to rehabilitate subsection ruptures and improve the model calibration
+based on coarse vessel-like structure segmentation results. VSR-Net first
+constructs subsection rupture clusters with Curvilinear Clustering Module
+(CCM). Then, the well-designed Curvilinear Merging Module (CMM) is applied to
+rehabilitate the subsection ruptures to obtain the refined vessel-like
+structures. Extensive experiments on five 2D/3D medical image datasets show
+that VSR-Net significantly outperforms state-of-the-art (SOTA) refinement
+segmentation methods with lower calibration error. Additionally, we provide
+quantitative analysis to explain the morphological difference between the
+rehabilitation results of VSR-Net and ground truth (GT), which is smaller than
+SOTA methods and GT, demonstrating that our method better rehabilitates
+vessel-like structures by restoring subsection ruptures.
+
+
+
+
+
+
+
+ ☆ Investigating Color Illusions from the Perspective of Computational
+ Color Constancy
+
+
+ Color constancy and color illusion perception are two phenomena occurring in
+the human visual system, which can help us reveal unknown mechanisms of human
+perception. For decades computer vision scientists have developed numerous
+color constancy methods, which estimate the reflectance of the surface by
+discounting the illuminant. However, color illusions have not been analyzed in
+detail in the field of computational color constancy, which we find surprising
+since the relationship they share is significant and may let us design more
+robust systems. We argue that any model that can reproduce our sensation on
+color illusions should also be able to provide pixel-wise estimates of the
+light source. In other words, we suggest that the analysis of color illusions
+helps us to improve the performance of the existing global color constancy
+methods, and enable them to provide pixel-wise estimates for scenes illuminated
+by multiple light sources. In this study, we share the outcomes of our
+investigation in which we take several color constancy methods and modify them
+to reproduce the behavior of the human visual system on color illusions. Also,
+we show that parameters purely extracted from illusions are able to improve the
+performance of color constancy methods. A noteworthy outcome is that our
+strategy based on the investigation of color illusions outperforms the
+state-of-the-art methods that are specifically designed to transform global
+color constancy algorithms into multi-illuminant algorithms.
+
+
+
+ comment: This work is accepted at VISAPP 2024 as a long paper
+
+ Graphical User Interface (GUI) automation holds significant promise for
+assisting users with complex tasks, thereby boosting human productivity.
+Existing works leveraging Large Language Model (LLM) or LLM-based AI agents
+have shown capabilities in automating tasks on Android and Web platforms.
+However, these tasks are primarily aimed at simple device usage and
+entertainment operations. This paper presents a novel benchmark, AssistGUI, to
+evaluate whether models are capable of manipulating the mouse and keyboard on
+the Windows platform in response to user-requested tasks. We carefully
+collected a set of 100 tasks from nine widely-used software applications, such
+as, After Effects and MS Word, each accompanied by the necessary project files
+for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied
+Agent framework, which incorporates a sophisticated GUI parser driven by an
+LLM-agent and an enhanced reasoning mechanism adept at handling lengthy
+procedural tasks. Our experimental results reveal that our GUI Parser and
+Reasoning mechanism outshine existing methods in performance. Nevertheless, the
+potential remains substantial, with the best model attaining only a 46% success
+rate on our benchmark. We conclude with a thorough analysis of the current
+methods' limitations, setting the stage for future breakthroughs in this
+domain.
+
+
+ Predicting the trajectory of an ego vehicle is a critical component of
+autonomous driving systems. Current state-of-the-art methods typically rely on
+Deep Neural Networks (DNNs) and sequential models to process front-view images
+for future trajectory prediction. However, these approaches often struggle with
+perspective issues affecting object features in the scene. To address this, we
+advocate for the use of Bird's Eye View (BEV) perspectives, which offer unique
+advantages in capturing spatial relationships and object homogeneity. In our
+work, we leverage Graph Neural Networks (GNNs) and positional encoding to
+represent objects in a BEV, achieving competitive performance compared to
+traditional DNN-based methods. While the BEV-based approach loses some detailed
+information inherent to front-view images, we balance this by enriching the BEV
+data by representing it as a graph where relationships between the objects in a
+scene are captured effectively.
+
+
+
+ comment: Accepted for publication in the Electronic Imagine Autonomous
+ Vehicles and Machines (EI-AVM) Conference
+
+
+
+
+
+
+ ☆ Exploring Multimodal Large Language Models for Radiology Report
+ Error-checking
+
+
+
+
+
+
+
+
+ Jinge Wu, Yunsoo Kim, Eva C. Keller, Jamie Chow, Adam P. Levine, Nikolas Pontikos, Zina Ibrahim, Paul Taylor, Michelle C. Williams, Honghan Wu
+
+
+ This paper proposes one of the first clinical applications of multimodal
+large language models (LLMs) as an assistant for radiologists to check errors
+in their reports. We created an evaluation dataset from two real-world
+radiology datasets (MIMIC-CXR and IU-Xray), with 1,000 subsampled reports each.
+A subset of original reports was modified to contain synthetic errors by
+introducing various type of mistakes. The evaluation contained two difficulty
+levels: SIMPLE for binary error-checking and COMPLEX for identifying error
+types. LLaVA (Large Language and Visual Assistant) variant models, including
+our instruction-tuned model, were used for the evaluation. Additionally, a
+domain expert evaluation was conducted on a small test set. At the SIMPLE
+level, the LLaVA v1.5 model outperformed other publicly available models.
+Instruction tuning significantly enhanced performance by 47.4% and 25.4% on
+MIMIC-CXR and IU-Xray data, respectively. The model also surpassed the domain
+experts accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among the subsets
+(N=21) of the test set where a clinician did not achieve the correct
+conclusion, the LLaVA ensemble mode correctly identified 71.4% of these cases.
+This study marks a promising step toward utilizing multi-modal LLMs to enhance
+diagnostic accuracy in radiology. The ensemble model demonstrated comparable
+performance to clinicians, even capturing errors overlooked by humans.
+Nevertheless, future work is needed to improve the model ability to identify
+the types of inconsistency.
+
+
+
+
+
+
+
+
+ Li Ma, Vasu Agrawal, Haithem Turki, Changil Kim, Chen Gao, Pedro Sander, Michael Zollhöfer, Christian Richardt
+
+
+ Neural radiance fields have achieved remarkable performance in modeling the
+appearance of 3D scenes. However, existing approaches still struggle with the
+view-dependent appearance of glossy surfaces, especially under complex lighting
+of indoor environments. Unlike existing methods, which typically assume distant
+lighting like an environment map, we propose a learnable Gaussian directional
+encoding to better model the view-dependent effects under near-field lighting
+conditions. Importantly, our new directional encoding captures the
+spatially-varying nature of near-field lighting and emulates the behavior of
+prefiltered environment maps. As a result, it enables the efficient evaluation
+of preconvolved specular color at any 3D location with varying roughness
+coefficients. We further introduce a data-driven geometry prior that helps
+alleviate the shape radiance ambiguity in reflection modeling. We show that our
+Gaussian directional encoding and geometry prior significantly improve the
+modeling of challenging specular reflections in neural radiance fields, which
+helps decompose appearance into more physically meaningful components.
+
+
+
+
+
+
+
+
+ William Heyden, Habib Ullah, M. Salman Siddiqui, Fadi Al Machot
+
+
+ Generalized Zero-Shot Learning (GZSL) recognizes unseen classes by
+transferring knowledge from the seen classes, depending on the inherent
+interactions between visual and semantic data. However, the discrepancy between
+well-prepared training data and unpredictable real-world test scenarios remains
+a significant challenge. This paper introduces a dual strategy to address the
+generalization gap. Firstly, we incorporate semantic information through an
+innovative encoder. This encoder effectively integrates class-specific semantic
+information by targeting the performance disparity, enhancing the produced
+features to enrich the semantic space for class-specific attributes. Secondly,
+we refine our generative capabilities using a novel compositional loss
+function. This approach generates discriminative classes, effectively
+classifying both seen and unseen classes. In addition, we extend the
+exploitation of the learned latent space by utilizing controlled semantic
+inputs, ensuring the robustness of the model in varying environments. This
+approach yields a model that outperforms the state-of-the-art models in terms
+of both generalization and diverse settings, notably without requiring
+hyperparameter tuning or domain-specific adaptations. We also propose a set of
+novel evaluation metrics to provide a more detailed assessment of the
+reliability and reproducibility of the results. The complete code is made
+available on https://github.com/william-heyden/SEER-ZeroShotLearning/.
+
+
+
+
+
+
+
+ ☆ MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using
+ Differentiable Shading
+
+
+
+
+
+
+
+
+ Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amin Fadaeinejad, Rafael M. O. Cruz, Marc-Andre Carbonneau
+
+
+ Reconstructing an avatar from a portrait image has many applications in
+multimedia, but remains a challenging research problem. Extracting reflectance
+maps and geometry from one image is ill-posed: recovering geometry is a
+one-to-many mapping problem and reflectance and light are difficult to
+disentangle. Accurate geometry and reflectance can be captured under the
+controlled conditions of a light stage, but it is costly to acquire large
+datasets in this fashion. Moreover, training solely with this type of data
+leads to poor generalization with in-the-wild images. This motivates the
+introduction of MoSAR, a method for 3D avatar generation from monocular images.
+We propose a semi-supervised training scheme that improves generalization by
+learning from both light stage and in-the-wild datasets. This is achieved using
+a novel differentiable shading formulation. We show that our approach
+effectively disentangles the intrinsic face parameters, producing relightable
+avatars. As a result, MoSAR estimates a richer set of skin reflectance maps,
+and generates more realistic avatars than existing state-of-the-art methods. We
+also introduce a new dataset, named FFHQ-UV-Intrinsics, the first public
+dataset providing intrisic face attributes at scale (diffuse, specular, ambient
+occlusion and translucency maps) for a total of 10k subjects. The project
+website and the dataset are available on the following link:
+https://ubisoftlaforge.github.io/character/mosar
+
+
+
+
+
+
+
+ ☆ Perception Test 2023: A Summary of the First Challenge And Outcome
+
+
+
+
+
+
+
+
+ Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
+
+
+ The First Perception Test challenge was held as a half-day workshop alongside
+the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the
+goal of benchmarking state-of-the-art video models on the recently proposed
+Perception Test benchmark. The challenge had six tracks covering low-level and
+high-level tasks, with both a language and non-language interface, across
+video, audio, and text modalities, and covering: object tracking, point
+tracking, temporal action localisation, temporal sound localisation,
+multiple-choice video question-answering, and grounded video
+question-answering. We summarise in this report the task descriptions, metrics,
+baselines, and results.
+
+
+
+
+
+
+
+ ☆ BEVSeg2TP: Surround View Camera Bird's-Eye-View Based Joint Vehicle
+ Segmentation and Ego Vehicle Trajectory Prediction
+
+
+ Trajectory prediction is, naturally, a key task for vehicle autonomy. While
+the number of traffic rules is limited, the combinations and uncertainties
+associated with each agent's behaviour in real-world scenarios are nearly
+impossible to encode. Consequently, there is a growing interest in
+learning-based trajectory prediction. The proposed method in this paper
+predicts trajectories by considering perception and trajectory prediction as a
+unified system. In considering them as unified tasks, we show that there is the
+potential to improve the performance of perception. To achieve these goals, we
+present BEVSeg2TP - a surround-view camera bird's-eye-view-based joint vehicle
+segmentation and ego vehicle trajectory prediction system for autonomous
+vehicles. The proposed system uses a network trained on multiple camera views.
+The images are transformed using several deep learning techniques to perform
+semantic segmentation of objects, including other vehicles, in the scene. The
+segmentation outputs are fused across the camera views to obtain a
+comprehensive representation of the surrounding vehicles from the
+bird's-eye-view perspective. The system further predicts the future trajectory
+of the ego vehicle using a spatiotemporal probabilistic network (STPN) to
+optimize trajectory prediction. This network leverages information from
+encoder-decoder transformers and joint vehicle segmentation.
+
+
+
+ comment: Accepted for publication in the International Conference on Computer
+ Vision Theory and Applications (VISAPP) 2024
+
+
+
+
+
+
+ ☆ Point Deformable Network with Enhanced Normal Embedding for Point Cloud
+ Analysis
+
+
+ Recently MLP-based methods have shown strong performance in point cloud
+analysis. Simple MLP architectures are able to learn geometric features in
+local point groups yet fail to model long-range dependencies directly. In this
+paper, we propose Point Deformable Network (PDNet), a concise MLP-based network
+that can capture long-range relations with strong representation ability.
+Specifically, we put forward Point Deformable Aggregation Module (PDAM) to
+improve representation capability in both long-range dependency and adaptive
+aggregation among points. For each query point, PDAM aggregates information
+from deformable reference points rather than points in limited local areas. The
+deformable reference points are generated data-dependent, and we initialize
+them according to the input point positions. Additional offsets and modulation
+scalars are learned on the whole point features, which shift the deformable
+reference points to the regions of interest. We also suggest estimating the
+normal vector for point clouds and applying Enhanced Normal Embedding (ENE) to
+the geometric extractors to improve the representation ability of single-point.
+Extensive experiments and ablation studies on various benchmarks demonstrate
+the effectiveness and superiority of our PDNet.
+
+
+ Self-supervised monocular depth estimation is of significant importance with
+applications spanning across autonomous driving and robotics. However, the
+reliance on self-supervision introduces a strong static-scene assumption,
+thereby posing challenges in achieving optimal performance in dynamic scenes,
+which are prevalent in most real-world situations. To address these issues, we
+propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to
+transfer a pre-trained image model for self-supervised depth estimation. The
+training comprises two sequential stages: an initial phase trained on a dataset
+primarily composed of static scenes, succeeded by an expansion to more
+intricate datasets involving dynamic scenes. To facilitate this process, we
+design compact encoder and decoder adapters to enable parameter-efficient
+tuning, allowing the network to adapt effectively. They not only uphold
+generalized patterns from pre-trained image models but also retain knowledge
+gained from the preceding phase into the subsequent one. Extensive experiments
+demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI,
+CityScapes and DDAD datasets.
+
+
+
+
+
+
+
+
+ Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
+
+
+ Bias in text-to-image (T2I) models can propagate unfair social
+representations and may be used to aggressively market ideas or push
+controversial agendas. Existing T2I model bias evaluation methods only focus on
+social biases. We look beyond that and instead propose an evaluation
+methodology to quantify general biases in T2I generative models, without any
+preconceived notions. We assess four state-of-the-art T2I models and compare
+their baseline bias characteristics to their respective variants (two for
+each), where certain biases have been intentionally induced. We propose three
+evaluation metrics to assess model biases including: (i) Distribution bias,
+(ii) Jaccard hallucination and (iii) Generative miss-rate. We conduct two
+evaluation studies, modelling biases under general, and task-oriented
+conditions, using a marketing scenario as the domain for the latter. We also
+quantify social biases to compare our findings to related works. Finally, our
+methodology is transferred to evaluate captioned-image datasets and measure
+their bias. Our approach is objective, domain-agnostic and consistently
+measures different forms of T2I model biases. We have developed a web
+application and practical implementation of what has been proposed in this
+work, which is at https://huggingface.co/spaces/JVice/try-before-you-bias. A
+video series with demonstrations is available at
+https://www.youtube.com/channel/UCk-0xyUyT0MSd_hkp4jQt1Q
+
+
+
+
+
+
+
+
+ Byung Hyun Lee, Min-hwan Oh, Se Young Chun
+
+
+ Task-free online continual learning (TF-CL) is a challenging problem where
+the model incrementally learns tasks without explicit task information.
+Although training with entire data from the past, present as well as future is
+considered as the gold standard, naive approaches in TF-CL with the current
+samples may be conflicted with learning with samples in the future, leading to
+catastrophic forgetting and poor plasticity. Thus, a proactive consideration of
+an unseen future sample in TF-CL becomes imperative. Motivated by this
+intuition, we propose a novel TF-CL framework considering future samples and
+show that injecting adversarial perturbations on both input data and
+decision-making is effective. Then, we propose a novel method named Doubly
+Perturbed Continual Learning (DPCL) to efficiently implement these input and
+decision-making perturbations. Specifically, for input perturbation, we propose
+an approximate perturbation method that injects noise into the input data as
+well as the feature vector and then interpolates the two perturbed samples. For
+decision-making process perturbation, we devise multiple stochastic
+classifiers. We also investigate a memory management scheme and learning rate
+scheduling reflecting our proposed double perturbations. We demonstrate that
+our proposed method outperforms the state-of-the-art baseline methods by large
+margins on various TF-CL benchmarks.
+
+
+
+
+
+
+
+
+ Yuming Gu, Hongyi Xu, You Xie, Guoxian Song, Yichun Shi, Di Chang, Jing Yang, Lingjie Luo
+
+
+ We present DiffPortrait3D, a conditional diffusion model that is capable of
+synthesizing 3D-consistent photo-realistic novel views from as few as a single
+in-the-wild portrait. Specifically, given a single RGB input, we aim to
+synthesize plausible but consistent facial details rendered from novel camera
+views with retained both identity and facial expression. In lieu of
+time-consuming optimization and fine-tuning, our zero-shot method generalizes
+well to arbitrary face portraits with unposed camera views, extreme facial
+expressions, and diverse artistic depictions. At its core, we leverage the
+generative prior of 2D diffusion models pre-trained on large-scale image
+datasets as our rendering backbone, while the denoising is guided with
+disentangled attentive control of appearance and camera pose. To achieve this,
+we first inject the appearance context from the reference image into the
+self-attention layers of the frozen UNets. The rendering view is then
+manipulated with a novel conditional control module that interprets the camera
+pose by watching a condition image of a crossed subject from the same view.
+Furthermore, we insert a trainable cross-view attention module to enhance view
+consistency, which is further strengthened with a novel 3D-aware noise
+generation process during inference. We demonstrate state-of-the-art results
+both qualitatively and quantitatively on our challenging in-the-wild and
+multi-view benchmarks.
+
+
+
+
+
+
+
+ ☆ No More Shortcuts: Realizing the Potential of Temporal Self-Supervision AAAI 2024
+
+
+
+
+
+
+
+
+ Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
+
+
+ Self-supervised approaches for video have shown impressive results in video
+understanding tasks. However, unlike early works that leverage temporal
+self-supervision, current state-of-the-art methods primarily rely on tasks from
+the image domain (e.g., contrastive learning) that do not explicitly promote
+the learning of temporal features. We identify two factors that limit existing
+temporal self-supervision: 1) tasks are too simple, resulting in saturated
+training performance, and 2) we uncover shortcuts based on local appearance
+statistics that hinder the learning of high-level features. To address these
+issues, we propose 1) a more challenging reformulation of temporal
+self-supervision as frame-level (rather than clip-level) recognition tasks and
+2) an effective augmentation strategy to mitigate shortcuts. Our model extends
+a representation of single video frames, pre-trained through contrastive
+learning, with a transformer that we train through temporal self-supervision.
+We demonstrate experimentally that our more challenging frame-level task
+formulations and the removal of shortcuts drastically improve the quality of
+features learned through temporal self-supervision. The generalization
+capability of our self-supervised video method is evidenced by its
+state-of-the-art performance in a wide range of high-level semantic tasks,
+including video retrieval, action classification, and video attribute
+recognition (such as object and scene identification), as well as low-level
+temporal correspondence tasks like video object segmentation and pose tracking.
+Additionally, we show that the video representations learned through our method
+exhibit increased robustness to the input perturbations.
+
+
+
+ comment: AAAI 2024 (Main Technical Track)
+
+
+
+
+
+
+ ☆ Aggregating Multiple Bio-Inspired Image Region Classifiers For Effective
+ And Lightweight Visual Place Recognition
+
+
+
+
+
+
+
+
+ Bruno Arcanjo, Bruno Ferrarini, Maria Fasli, Michael Milford, Klaus D. McDonald-Maier, Shoaib Ehsan
+
+
+ Visual place recognition (VPR) enables autonomous systems to localize
+themselves within an environment using image information. While VPR techniques
+built upon a Convolutional Neural Network (CNN) backbone dominate
+state-of-the-art VPR performance, their high computational requirements make
+them unsuitable for platforms equipped with low-end hardware. Recently, a
+lightweight VPR system based on multiple bio-inspired classifiers, dubbed
+DrosoNets, has been proposed, achieving great computational efficiency at the
+cost of reduced absolute place retrieval performance. In this work, we propose
+a novel multi-DrosoNet localization system, dubbed RegionDrosoNet, with
+significantly improved VPR performance, while preserving a low-computational
+profile. Our approach relies on specializing distinct groups of DrosoNets on
+differently sliced partitions of the original image, increasing extrinsic model
+differentiation. Furthermore, we introduce a novel voting module to combine the
+outputs of all DrosoNets into the final place prediction which considers
+multiple top refence candidates from each DrosoNet. RegionDrosoNet outperforms
+other lightweight VPR techniques when dealing with both appearance changes and
+viewpoint variations. Moreover, it competes with computationally expensive
+methods on some benchmark datasets at a small fraction of their online
+inference time.
+
+
+
+
+
+
+
+ ☆ Multi-task Learning To Improve Semantic Segmentation Of CBCT Scans Using
+ Image Reconstruction
+
+
+
+
+
+
+
+
+ Maximilian Ernst Tschuchnig, Julia Coste-Marin, Philipp Steininger, Michael Gadermayr
+
+
+ Semantic segmentation is a crucial task in medical image processing,
+essential for segmenting organs or lesions such as tumors. In this study we aim
+to improve automated segmentation in CBCTs through multi-task learning. To
+evaluate effects on different volume qualities, a CBCT dataset is synthesised
+from the CT Liver Tumor Segmentation Benchmark (LiTS) dataset. To improve
+segmentation, two approaches are investigated. First, we perform multi-task
+learning to add morphology based regularization through a volume reconstruction
+task. Second, we use this reconstruction task to reconstruct the best quality
+CBCT (most similar to the original CT), facilitating denoising effects. We
+explore both holistic and patch-based approaches. Our findings reveal that,
+especially using a patch-based approach, multi-task learning improves
+segmentation in most cases and that these results can further be improved by
+our denoising approach.
+
+
+
+ comment: Accepted at German Conference on Medical Image Computing (BVM) 2024
+
+ Establishing accurate and representative matches is a crucial step in
+addressing the point cloud registration problem. A commonly employed approach
+involves detecting keypoints with salient geometric features and subsequently
+mapping these keypoints from one frame of the point cloud to another. However,
+methods within this category are hampered by the repeatability of the sampled
+keypoints. In this paper, we introduce a saliency-guided trans\textbf{former},
+referred to as \textit{D3Former}, which entails the joint learning of
+repeatable \textbf{D}ense \textbf{D}etectors and feature-enhanced
+\textbf{D}escriptors. The model comprises a Feature Enhancement Descriptor
+Learning (FEDL) module and a Repetitive Keypoints Detector Learning (RKDL)
+module. The FEDL module utilizes a region attention mechanism to enhance
+feature distinctiveness, while the RKDL module focuses on detecting repeatable
+keypoints to enhance matching capabilities. Extensive experimental results on
+challenging indoor and outdoor benchmarks demonstrate that our proposed method
+consistently outperforms state-of-the-art point cloud matching methods.
+Notably, tests on 3DLoMatch, even with a low overlap ratio, show that our
+method consistently outperforms recently published approaches such as RoReg and
+RoITr. For instance, with the number of extracted keypoints reduced to 250, the
+registration recall scores for RoReg, RoITr, and our method are 64.3\%, 73.6\%,
+and 76.5\%, respectively.
+
+
+
+ comment: 15 pages, 6 figures
+
+
+
+
+
+
+ ☆ Radar Fields: An Extension of Radiance Fields to SAR
+
+
+
+
+
+
+
+
+ Thibaud Ehret, Roger Marí, Dawa Derksen, Nicolas Gasnier, Gabriele Facciolo
+
+
+ Radiance fields have been a major breakthrough in the field of inverse
+rendering, novel view synthesis and 3D modeling of complex scenes from
+multi-view image collections. Since their introduction, it was shown that they
+could be extended to other modalities such as LiDAR, radio frequencies, X-ray
+or ultrasound. In this paper, we show that, despite the important difference
+between optical and synthetic aperture radar (SAR) image formation models, it
+is possible to extend radiance fields to radar images thus presenting the first
+"radar fields". This allows us to learn surface models using only collections
+of radar images, similar to how regular radiance fields are learned and with
+the same computational complexity on average. Thanks to similarities in how
+both fields are defined, this work also shows a potential for hybrid methods
+combining both optical and SAR images.
+
+
+
+
+
+
+
+ ☆ TADAP: Trajectory-Aided Drivable area Auto-labeling with Pre-trained
+ self-supervised features in winter driving conditions
+
+
+
+
+
+
+
+
+ Eerik Alamikkotervo, Risto Ojala, Alvari Seppänen, Kari Tammi
+
+
+ Detection of the drivable area in all conditions is crucial for autonomous
+driving and advanced driver assistance systems. However, the amount of labeled
+data in adverse driving conditions is limited, especially in winter, and
+supervised methods generalize poorly to conditions outside the training
+distribution. For easy adaption to all conditions, the need for human
+annotation should be removed from the learning process. In this paper,
+Trajectory-Aided Drivable area Auto-labeling with Pre-trained self-supervised
+features (TADAP) is presented for automated annotation of the drivable area in
+winter driving conditions. A sample of the drivable area is extracted based on
+the trajectory estimate from the global navigation satellite system. Similarity
+with the sample area is determined based on pre-trained self-supervised visual
+features. Image areas similar to the sample area are considered to be drivable.
+These TADAP labels were evaluated with a novel winter-driving dataset,
+collected in varying driving scenes. A prediction model trained with the TADAP
+labels achieved a +9.6 improvement in intersection over union compared to the
+previous state-of-the-art of self-supervised drivable area detection.
+
+
+
+
+
+
+
+ ☆ Sign Language Production with Latent Motion TransformerWACV2024
+
+
+
+
+
+
+
+
+ Pan Xie, Taiyi Peng, Yao Du, Qipeng Zhang
+
+
+ Sign Language Production (SLP) is the tough task of turning sign language
+into sign videos. The main goal of SLP is to create these videos using a sign
+gloss. In this research, we've developed a new method to make high-quality sign
+videos without using human poses as a middle step. Our model works in two main
+parts: first, it learns from a generator and the video's hidden features, and
+next, it uses another model to understand the order of these hidden features.
+To make this method even better for sign videos, we make several significant
+improvements. (i) In the first stage, we take an improved 3D VQ-GAN to learn
+downsampled latent representations. (ii) In the second stage, we introduce
+sequence-to-sequence attention to better leverage conditional information.
+(iii) The separated two-stage training discards the realistic visual semantic
+of the latent codes in the second stage. To endow the latent sequences semantic
+information, we extend the token-level autoregressive latent codes learning
+with perceptual loss and reconstruction loss for the prior model with visual
+perception. Compared with previous state-of-the-art approaches, our model
+performs consistently better on two word-level sign language datasets, i.e.,
+WLASL and NMFs-CSL.
+
+
+
+ comment: Accepted by WACV2024
+
+
+
+
+
+
+ ☆ Produce Once, Utilize Twice for Anomaly Detection
+
+
+ Visual anomaly detection aims at classifying and locating the regions that
+deviate from the normal appearance. Embedding-based methods and
+reconstruction-based methods are two main approaches for this task. However,
+they are either not efficient or not precise enough for the industrial
+detection. To deal with this problem, we derive POUTA (Produce Once Utilize
+Twice for Anomaly detection), which improves both the accuracy and efficiency
+by reusing the discriminant information potential in the reconstructive
+network. We observe that the encoder and decoder representations of the
+reconstructive network are able to stand for the features of the original and
+reconstructed image respectively. And the discrepancies between the symmetric
+reconstructive representations provides roughly accurate anomaly information.
+To refine this information, a coarse-to-fine process is proposed in POUTA,
+which calibrates the semantics of each discriminative layer by the high-level
+representations and supervision loss. Equipped with the above modules, POUTA is
+endowed with the ability to provide a more precise anomaly location than the
+prior arts. Besides, the representation reusage also enables to exclude the
+feature extraction process in the discriminative network, which reduces the
+parameters and improves the efficiency. Extensive experiments show that, POUTA
+is superior or comparable to the prior methods with even less cost.
+Furthermore, POUTA also achieves better performance than the state-of-the-art
+few-shot anomaly detection methods without any special design, showing that
+POUTA has strong ability to learn representations inherent in the training
+data.
+
+
+
+
+
+
+
+ ☆ The Common Optical Music Recognition Evaluation Framework
+
+
+ The quality of Optical Music Recognition (OMR) systems is a rather difficult
+magnitude to measure. There is no lingua franca shared among OMR datasets that
+allows to compare systems' performance on equal grounds, since most of them are
+specialised on certain approaches. As a result, most state-of-the-art works
+currently report metrics that cannot be compared directly. In this paper we
+identify the need of a common music representation language and propose the
+Music Tree Notation (MTN) format, thanks to which the definition of standard
+metrics is possible. This format represents music as a set of primitives that
+group together into higher-abstraction nodes, a compromise between the
+expression of fully graph-based and sequential notation formats. We have also
+developed a specific set of OMR metrics and a typeset score dataset as a proof
+of concept of this idea.
+
+
+
+ comment: 18 pages, 4 figures, 3 tables, submitted (under review) for the
+ International Journal in Document Analysis and Recognition
+
+
+
+
+
+
+ ☆ Testing the Segment Anything Model on radiology data
+
+
+
+
+
+
+
+
+ José Guilherme de Almeida, Nuno M. Rodrigues, Sara Silva, Nickolas Papanikolaou
+
+
+ Deep learning models trained with large amounts of data have become a recent
+and effective approach to predictive problem solving -- these have become known
+as "foundation models" as they can be used as fundamental tools for other
+applications. While the paramount examples of image classification (earlier)
+and large language models (more recently) led the way, the Segment Anything
+Model (SAM) was recently proposed and stands as the first foundation model for
+image segmentation, trained on over 10 million images and with recourse to over
+1 billion masks. However, the question remains -- what are the limits of this
+foundation? Given that magnetic resonance imaging (MRI) stands as an important
+method of diagnosis, we sought to understand whether SAM could be used for a
+few tasks of zero-shot segmentation using MRI data. Particularly, we wanted to
+know if selecting masks from the pool of SAM predictions could lead to good
+segmentations.
+ Here, we provide a critical assessment of the performance of SAM on magnetic
+resonance imaging data. We show that, while acceptable in a very limited set of
+cases, the overall trend implies that these models are insufficient for MRI
+segmentation across the whole volume, but can provide good segmentations in a
+few, specific slices. More importantly, we note that while foundation models
+trained on natural images are set to become key aspects of predictive
+modelling, they may prove ineffective when used on other imaging modalities.
+
+
+
+
+
+
+
+ ☆ Relightable and Animatable Neural Avatars from Videos AAAI 2024
+
+
+ Lightweight creation of 3D digital avatars is a highly desirable but
+challenging task. With only sparse videos of a person under unknown
+illumination, we propose a method to create relightable and animatable neural
+avatars, which can be used to synthesize photorealistic images of humans under
+novel viewpoints, body poses, and lighting. The key challenge here is to
+disentangle the geometry, material of the clothed body, and lighting, which
+becomes more difficult due to the complex geometry and shadow changes caused by
+body motions. To solve this ill-posed problem, we propose novel techniques to
+better model the geometry and shadow changes. For geometry change modeling, we
+propose an invertible deformation field, which helps to solve the inverse
+skinning problem and leads to better geometry quality. To model the spatial and
+temporal varying shading cues, we propose a pose-aware part-wise light
+visibility network to estimate light occlusion. Extensive experiments on
+synthetic and real datasets show that our approach reconstructs high-quality
+geometry and generates realistic shadows under different body poses. Code and
+data are available at
+\url{https://wenbin-lin.github.io/RelightableAvatar-page/}.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ COVID-19 Diagnosis: ULGFBP-ResNet51 approach on the CT and the Chest
+ X-ray Images Classification
+
+
+ The contagious and pandemic COVID-19 disease is currently considered as the
+main health concern and posed widespread panic across human-beings. It affects
+the human respiratory tract and lungs intensely. So that it has imposed
+significant threats for premature death. Although, its early diagnosis can play
+a vital role in revival phase, the radiography tests with the manual
+intervention are a time-consuming process. Time is also limited for such manual
+inspecting of numerous patients in the hospitals. Thus, the necessity of
+automatic diagnosis on the chest X-ray or the CT images with a high efficient
+performance is urgent. Toward this end, we propose a novel method, named as the
+ULGFBP-ResNet51 to tackle with the COVID-19 diagnosis in the images. In fact,
+this method includes Uniform Local Binary Pattern (ULBP), Gabor Filter (GF),
+and ResNet51. According to our results, this method could offer superior
+performance in comparison with the other methods, and attain maximum accuracy.
+
+
+
+ comment: 16 pages, 8 figures, submitted for possible journal publication
+
+
+
+
+
+
+ ☆ Integration and Performance Analysis of Artificial Intelligence and
+ Computer Vision Based on Deep Learning Algorithms
+
+
+
+
+
+
+
+
+ Bo Liu, Liqiang Yu, Chang Che, Qunwei Lin, Hao Hu, Xinyu Zhao
+
+
+ This paper focuses on the analysis of the application effectiveness of the
+integration of deep learning and computer vision technologies. Deep learning
+achieves a historic breakthrough by constructing hierarchical neural networks,
+enabling end-to-end feature learning and semantic understanding of images. The
+successful experiences in the field of computer vision provide strong support
+for training deep learning algorithms. The tight integration of these two
+fields has given rise to a new generation of advanced computer vision systems,
+significantly surpassing traditional methods in tasks such as machine vision
+image classification and object detection. In this paper, typical image
+classification cases are combined to analyze the superior performance of deep
+neural network models while also pointing out their limitations in
+generalization and interpretability, proposing directions for future
+improvements. Overall, the efficient integration and development trend of deep
+learning with massive visual data will continue to drive technological
+breakthroughs and application expansion in the field of computer vision, making
+it possible to build truly intelligent machine vision systems. This deepening
+fusion paradigm will powerfully promote unprecedented tasks and functions in
+computer vision, providing stronger development momentum for related
+disciplines and industries.
+
+
+
+
+
+
+
+ ☆ The Audio-Visual Conversational Graph: From an Egocentric-Exocentric
+ Perspective
+
+
+ In recent years, the thriving development of research related to egocentric
+videos has provided a unique perspective for the study of conversational
+interactions, where both visual and audio signals play a crucial role. While
+most prior work focus on learning about behaviors that directly involve the
+camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction
+problem, marking the first attempt to infer exocentric conversational
+interactions from egocentric videos. We propose a unified multi-modal,
+multi-task framework -- Audio-Visual Conversational Attention (Av-CONV), for
+the joint prediction of conversation behaviors -- speaking and listening -- for
+both the camera wearer as well as all other social partners present in the
+egocentric video. Specifically, we customize the self-attention mechanism to
+model the representations across-time, across-subjects, and across-modalities.
+To validate our method, we conduct experiments on a challenging egocentric
+video dataset that includes first-person perspective, multi-speaker, and
+multi-conversation scenarios. Our results demonstrate the superior performance
+of our method compared to a series of baselines. We also present detailed
+ablation studies to assess the contribution of each component in our model.
+Project page: https://vjwq.github.io/AV-CONV/.
+
+
+
+
+
+
+
+
+ Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Daniel C. Castro, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay, Maximilian Ilse
+
+
+ Biomedical imaging datasets are often small and biased, meaning that
+real-world performance of predictive models can be substantially lower than
+expected from internal testing. This work proposes using generative image
+editing to simulate dataset shifts and diagnose failure modes of biomedical
+vision models; this can be used in advance of deployment to assess readiness,
+potentially reducing cost and patient harm. Existing editing methods can
+produce undesirable changes, with spurious correlations learned due to the
+co-occurrence of disease and treatment interventions, limiting practical
+applicability. To address this, we train a text-to-image diffusion model on
+multiple chest X-ray datasets and introduce a new editing method RadEdit that
+uses multiple masks, if present, to constrain changes and ensure consistency in
+the edited images. We consider three types of dataset shifts: acquisition
+shift, manifestation shift, and population shift, and demonstrate that our
+approach can diagnose failures and quantify model robustness without additional
+data collection, complementing more qualitative tools for explainable AI.
+
+
+
+
+
+
+
+ ☆ SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
+ Remote Sensing AAAI 2024
+
+
+ Remote sensing imagery, despite its broad applications in helping achieve
+Sustainable Development Goals and tackle climate change, has not yet benefited
+from the recent advancements of versatile, task-agnostic vision language models
+(VLMs). A key reason is that the large-scale, semantically diverse image-text
+dataset required for developing VLMs is still absent for remote sensing images.
+Unlike natural images, remote sensing images and their associated text
+descriptions cannot be efficiently collected from the public Internet at scale.
+In this work, we bridge this gap by using geo-coordinates to automatically
+connect open, unlabeled remote sensing images with rich semantics covered in
+OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language
+dataset for remote sensing images, comprising 2.6 million image-text pairs
+covering 29K distinct semantic tags. With continual pre-training on this
+dataset, we obtain a VLM that surpasses baseline models with a 6.2% average
+accuracy gain in zero-shot scene classification across seven benchmark
+datasets. It also demonstrates the ability of zero-shot transfer for
+fine-grained object attribute classification and cross-modal retrieval. We hope
+this dataset can support the advancement of VLMs for various multi-modal tasks
+in remote sensing, such as open-vocabulary classification, retrieval,
+captioning, and text-to-image synthesis.
+
+
+
+
+
+
+
+
+ Shahrokh Heidari, Michael J. Dinneen, Patrice Delmas
+
+
+ Computer Vision (CV) labelling algorithms play a pivotal role in the domain
+of low-level vision. For decades, it has been known that these problems can be
+elegantly formulated as discrete energy minimization problems derived from
+probabilistic graphical models (such as Markov Random Fields). Despite recent
+advances in inference algorithms (such as graph-cut and message-passing
+algorithms), the resulting energy minimization problems are generally viewed as
+intractable. The emergence of quantum computations, which offer the potential
+for faster solutions to certain problems than classical methods, has led to an
+increased interest in utilizing quantum properties to overcome intractable
+problems. Recently, there has also been a growing interest in Quantum Computer
+Vision (QCV), with the hope of providing a credible alternative or assistant to
+deep learning solutions in the field. This study investigates a new Quantum
+Annealing based inference algorithm for CV discrete energy minimization
+problems. Our contribution is focused on Stereo Matching as a significant CV
+labeling problem. As a proof of concept, we also use a hybrid quantum-classical
+solver provided by D-Wave System to compare our results with the best classical
+inference algorithms in the literature.
+
+
+
+
+
+
+
+ ☆ FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image
+ Segmentation Against Heterogeneous Annotation Noise AAAI'24
+
+
+ Federated learning (FL) has emerged as a promising paradigm for training
+segmentation models on decentralized medical data, owing to its
+privacy-preserving property. However, existing research overlooks the prevalent
+annotation noise encountered in real-world medical datasets, which limits the
+performance ceilings of FL. In this paper, we, for the first time, identify and
+tackle this problem. For problem formulation, we propose a contour evolution
+for modeling non-independent and identically distributed (Non-IID) noise across
+pixels within each client and then extend it to the case of multi-source data
+to form a heterogeneous noise model (\textit{i.e.}, Non-IID annotation noise
+across clients). For robust learning from annotations with such two-level
+Non-IID noise, we emphasize the importance of data quality in model
+aggregation, allowing high-quality clients to have a greater impact on FL. To
+achieve this, we propose \textbf{Fed}erated learning with \textbf{A}nnotation
+qu\textbf{A}lity-aware \textbf{A}ggregat\textbf{I}on, named \textbf{FedA$^3$I},
+by introducing a quality factor based on client-wise noise estimation.
+Specifically, noise estimation at each client is accomplished through the
+Gaussian mixture model and then incorporated into model aggregation in a
+layer-wise manner to up-weight high-quality clients. Extensive experiments on
+two real-world medical image segmentation datasets demonstrate the superior
+performance of FedA$^3$I against the state-of-the-art approaches in dealing
+with cross-client annotation noise. The code is available at
+\color{blue}{https://github.com/wnn2000/FedAAAI}.
+
+
+
+ comment: Accepted at AAAI'24
+
+
+
+
+
+
+ ☆ Learning Exhaustive Correlation for Spectral Super-Resolution: Where
+ Unified Spatial-Spectral Attention Meets Mutual Linear Dependence
+
+
+ Spectral super-resolution from the easily obtainable RGB image to
+hyperspectral image (HSI) has drawn increasing interest in the field of
+computational photography. The crucial aspect of spectral super-resolution lies
+in exploiting the correlation within HSIs. However, two types of bottlenecks in
+existing Transformers limit performance improvement and practical applications.
+First, existing Transformers often separately emphasize either spatial-wise or
+spectral-wise correlation, disrupting the 3D features of HSI and hindering the
+exploitation of unified spatial-spectral correlation. Second, the existing
+self-attention mechanism learns the correlation between pairs of tokens and
+captures the full-rank correlation matrix, leading to its inability to
+establish mutual linear dependence among multiple tokens. To address these
+issues, we propose a novel Exhaustive Correlation Transformer (ECT) for
+spectral super-resolution. First, we propose a Spectral-wise Discontinuous 3D
+(SD3D) splitting strategy, which models unified spatial-spectral correlation by
+simultaneously utilizing spatial-wise continuous splitting and spectral-wise
+discontinuous splitting. Second, we propose a Dynamic Low-Rank Mapping (DLRM)
+model, which captures mutual linear dependence among multiple tokens through a
+dynamically calculated low-rank dependence map. By integrating unified
+spatial-spectral attention with mutual linear dependence, our ECT can establish
+exhaustive correlation within HSI. The experimental results on both simulated
+and real data indicate that our method achieves state-of-the-art performance.
+Codes and pretrained models will be available later.
+
+
+
+
+
+
+
+ ☆ TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
+ Multi-Label Classification of CLIP Without Training AAAI2024
+
+
+ Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive
+capabilities in open-vocabulary classification. The class token in the image
+encoder is trained to capture the global features to distinguish different text
+descriptions supervised by contrastive loss, making it highly effective for
+single-label classification. However, it shows poor performance on multi-label
+datasets because the global feature tends to be dominated by the most prominent
+class and the contrastive nature of softmax operation aggravates it. In this
+study, we observe that the multi-label classification results heavily rely on
+discriminative local features but are overlooked by CLIP. As a result, we
+dissect the preservation of patch-wise spatial information in CLIP and proposed
+a local-to-global framework to obtain image tags. It comprises three steps: (1)
+patch-level classification to obtain coarse scores; (2) dual-masking attention
+refinement (DMAR) module to refine the coarse scores; (3) class-wise
+reidentification (CWR) module to remedy predictions from a global perspective.
+This framework is solely based on frozen CLIP and significantly enhances its
+multi-label classification performance on various benchmarks without
+dataset-specific training. Besides, to comprehensively assess the quality and
+practicality of generated tags, we extend their application to the downstream
+task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags
+as image-level pseudo labels. Experiments demonstrate that this
+classify-then-segment paradigm dramatically outperforms other annotation-free
+segmentation methods and validates the effectiveness of generated tags. Our
+code is available at https://github.com/linyq2117/TagCLIP.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ☆ ReCo-Diff: Explore Retinex-Based Condition Strategy in Diffusion Model
+ for Low-Light Image Enhancement
+
+
+ Low-light image enhancement (LLIE) has achieved promising performance by
+employing conditional diffusion models. In this study, we propose ReCo-Diff, a
+novel approach that incorporates Retinex-based prior as an additional
+pre-processing condition to regulate the generating capabilities of the
+diffusion model. ReCo-Diff first leverages a pre-trained decomposition network
+to produce initial reflectance and illumination maps of the low-light image.
+Then, an adjustment network is introduced to suppress the noise in the
+reflectance map and brighten the illumination map, thus forming the learned
+Retinex-based condition. The condition is integrated into a refinement network,
+implementing Retinex-based conditional modules that offer sufficient guidance
+at both feature- and image-levels. By treating Retinex theory as a condition,
+ReCo-Diff presents a unique perspective for establishing an LLIE-specific
+diffusion model. Extensive experiments validate the rationality and superiority
+of our ReCo-Diff approach. The code will be made publicly available.
+
+
+
+
+
+
+
+ ☆ FedSODA: Federated Cross-assessment and Dynamic Aggregation for
+ Histopathology Segmentation ICASSP2024
+
+
+ Federated learning (FL) for histopathology image segmentation involving
+multiple medical sites plays a crucial role in advancing the field of accurate
+disease diagnosis and treatment. However, it is still a task of great
+challenges due to the sample imbalance across clients and large data
+heterogeneity from disparate organs, variable segmentation tasks, and diverse
+distribution. Thus, we propose a novel FL approach for histopathology nuclei
+and tissue segmentation, FedSODA, via synthetic-driven cross-assessment
+operation (SO) and dynamic stratified-layer aggregation (DA). Our SO constructs
+a cross-assessment strategy to connect clients and mitigate the representation
+bias under sample imbalance. Our DA utilizes layer-wise interaction and dynamic
+aggregation to diminish heterogeneity and enhance generalization. The
+effectiveness of our FedSODA has been evaluated on the most extensive
+histopathology image segmentation dataset from 7 independent datasets. The code
+is available at https://github.com/yuanzhang7/FedSODA.
+
+
+
+
+
+
+
+
+ Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang
+
+
+ This paper focuses on the Audio-Visual Question Answering (AVQA) task that
+aims to answer questions derived from untrimmed audible videos. To generate
+accurate answers, an AVQA model is expected to find the most informative
+audio-visual clues relevant to the given questions. In this paper, we propose
+to explicitly consider fine-grained visual objects in video frames
+(object-level clues) and explore the multi-modal relations(i.e., the object,
+audio, and question) in terms of feature interaction and model optimization.
+For the former, we present an end-to-end object-oriented network that adopts a
+question-conditioned clue discovery module to concentrate audio/visual
+modalities on respective keywords of the question and designs a
+modality-conditioned clue collection module to highlight closely associated
+audio segments or visual objects. For model optimization, we propose an
+object-aware adaptive-positivity learning strategy that selects the highly
+semantic-matched multi-modal pair as positivity. Specifically, we design two
+object-aware contrastive loss functions to identify the highly relevant
+question-object pairs and audio-object pairs, respectively. These selected
+pairs are constrained to have larger similarity values than the mismatched
+pairs. The positivity-selecting process is adaptive as the positivity pairs
+selected in each video frame may be different. These two object-aware
+objectives help the model understand which objects are exactly relevant to the
+question and which are making sounds. Extensive experiments on the MUSIC-AVQA
+dataset demonstrate the proposed method is effective in finding favorable
+audio-visual clues and also achieves new state-of-the-art question-answering
+performance.
+
+
+
+ comment: Accepted by AAAI-2024
+
+
+
+
+
+
+ ☆ OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using
+ Semantic Understanding in Mixed Reality
+
+
+ One key challenge in augmented reality is the placement of virtual content in
+natural locations. Existing automated techniques are only able to work with a
+closed-vocabulary, fixed set of objects. In this paper, we introduce a new
+open-vocabulary method for object placement. Our eight-stage pipeline leverages
+recent advances in segmentation models, vision-language models, and LLMs to
+place any virtual object in any AR camera frame or scene. In a preliminary user
+study, we show that our method performs at least as well as human experts 57%
+of the time.
+
+
+
+ comment: IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
+ 2023
+
+
+
+
+
+
+ ☆ All but One: Surgical Concept Erasing with Model Preservation in
+ Text-to-Image Diffusion Models
+
+
+
+
+
+
+
+
+ Seunghoo Hong, Juhun Lee, Simon S. Woo
+
+
+ Text-to-Image models such as Stable Diffusion have shown impressive image
+generation synthesis, thanks to the utilization of large-scale datasets.
+However, these datasets may contain sexually explicit, copyrighted, or
+undesirable content, which allows the model to directly generate them. Given
+that retraining these large models on individual concept deletion requests is
+infeasible, fine-tuning algorithms have been developed to tackle concept
+erasing in diffusion models. While these algorithms yield good concept erasure,
+they all present one of the following issues: 1) the corrupted feature space
+yields synthesis of disintegrated objects, 2) the initially synthesized content
+undergoes a divergence in both spatial structure and semantics in the generated
+images, and 3) sub-optimal training updates heighten the model's susceptibility
+to utility harm. These issues severely degrade the original utility of
+generative models. In this work, we present a new approach that solves all of
+these challenges. We take inspiration from the concept of classifier guidance
+and propose a surgical update on the classifier guidance term while
+constraining the drift of the unconditional score term. Furthermore, our
+algorithm empowers the user to select an alternative to the erasing concept,
+allowing for more controllability. Our experimental results show that our
+algorithm not only erases the target concept effectively but also preserves the
+model's generation capability.
+
+
+
+ comment: Main paper with supplementary materials
+
+
+
+
+
+
+ ☆ Multi-stages attention Breast cancer classification based on nonlinear
+ spiking neural P neurons with autapses
+
+
+
+
+
+
+
+
+ Bo Yang, Hong Peng, Xiaohui Luo, Jun Wang, Xianzhong Long
+
+
+ Breast cancer(BC) is a prevalent type of malignant tumor in women. Early
+diagnosis and treatment are vital for enhancing the patients' survival rate.
+Downsampling in deep networks may lead to loss of information, so for
+compensating the detail and edge information and allowing convolutional neural
+networks to pay more attention to seek the lesion region, we propose a
+multi-stages attention architecture based on NSNP neurons with autapses. First,
+unlike the single-scale attention acquisition methods of existing methods, we
+set up spatial attention acquisition at each feature map scale of the
+convolutional network to obtain an fusion global information on attention
+guidance. Then we introduce a new type of NSNP variants called NSNP neurons
+with autapses. Specifically, NSNP systems are modularized as feature encoders,
+recoding the features extracted from convolutional neural network as well as
+the fusion of attention information and preserve the key characteristic
+elements in feature maps. This ensures the retention of valuable data while
+gradually transforming high-dimensional complicated info into low-dimensional
+ones. The proposed method is evaluated on the public dataset BreakHis at
+various magnifications and classification tasks. It achieves a classification
+accuracy of 96.32% at all magnification cases, outperforming state-of-the-art
+methods. Ablation studies are also performed, verifying the proposed model's
+efficacy. The source code is available at
+XhuBobYoung/Breast-cancer-Classification.
+
+
+
+
+
+
+
+ ☆ SLP-Net:An efficient lightweight network for segmentation of skin
+ lesions
+
+
+
+
+
+
+
+
+ Bo Yang, Hong Peng, Chenggang Guo, Xiaohui Luo, Jun Wang, Xianzhong Long
+
+
+ Prompt treatment for melanoma is crucial. To assist physicians in identifying
+lesion areas precisely in a quick manner, we propose a novel skin lesion
+segmentation technique namely SLP-Net, an ultra-lightweight segmentation
+network based on the spiking neural P(SNP) systems type mechanism. Most
+existing convolutional neural networks achieve high segmentation accuracy while
+neglecting the high hardware cost. SLP-Net, on the contrary, has a very small
+number of parameters and a high computation speed. We design a lightweight
+multi-scale feature extractor without the usual encoder-decoder structure.
+Rather than a decoder, a feature adaptation module is designed to replace it
+and implement multi-scale information decoding. Experiments at the ISIC2018
+challenge demonstrate that the proposed model has the highest Acc and DSC among
+the state-of-the-art methods, while experiments on the PH2 dataset also
+demonstrate a favorable generalization ability. Finally, we compare the
+computational complexity as well as the computational speed of the models in
+experiments, where SLP-Net has the highest overall superiority
+
+
+
+
+
+
+
+ ☆ Segmenting Messy Text: Detecting Boundaries in Text Derived from
+ Historical Newspaper Images
+
+
+ Text segmentation, the task of dividing a document into sections, is often a
+prerequisite for performing additional natural language processing tasks.
+Existing text segmentation methods have typically been developed and tested
+using clean, narrative-style text with segments containing distinct topics.
+Here we consider a challenging text segmentation task: dividing newspaper
+marriage announcement lists into units of one announcement each. In many cases
+the information is not structured into sentences, and adjacent segments are not
+topically distinct from each other. In addition, the text of the announcements,
+which is derived from images of historical newspapers via optical character
+recognition, contains many typographical errors. As a result, these
+announcements are not amenable to segmentation with existing techniques. We
+present a novel deep learning-based model for segmenting such text and show
+that it significantly outperforms an existing state-of-the-art method on our
+task.
+
+
+
+
+
+
+
+
+ Jingwen Ye, Ruonan Yu, Songhua Liu, Xinchao Wang
+
+
+ Adversarial attacks constitute a notable threat to machine learning systems,
+given their potential to induce erroneous predictions and classifications.
+However, within real-world contexts, the essential specifics of the deployed
+model are frequently treated as a black box, consequently mitigating the
+vulnerability to such attacks. Thus, enhancing the transferability of the
+adversarial samples has become a crucial area of research, which heavily relies
+on selecting appropriate surrogate models. To address this challenge, we
+propose a novel approach that generates adversarial attacks in a
+mutual-modality optimization scheme. Our approach is accomplished by leveraging
+the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean
+image that causes semantic perturbations on the aligned embedding space with
+the other textual modality. Then, we apply the corresponding defense on the
+textual modality by updating the prompts, which forces the re-matching on the
+perturbed embedding space. Finally, to enhance the attack transferability, we
+utilize the iterative training strategy on the visual attack and the textual
+defense, where the two processes optimize from each other. We evaluate our
+approach on several benchmark datasets and demonstrate that our mutual-modal
+attack strategy can effectively produce high-transferable attacks, which are
+stable regardless of the target networks. Our approach outperforms
+state-of-the-art attack methods and can be readily deployed as a plug-and-play
+solution.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ☆ AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition
+ and Fusion
+
+
+ Generating realistic human motion sequences from text descriptions is a
+challenging task that requires capturing the rich expressiveness of both
+natural language and human motion.Recent advances in diffusion models have
+enabled significant progress in human motion synthesis.However, existing
+methods struggle to handle text inputs that describe complex or long motions.In
+this paper, we propose the Adaptable Motion Diffusion (AMD) model, which
+leverages a Large Language Model (LLM) to parse the input text into a sequence
+of concise and interpretable anatomical scripts that correspond to the target
+motion.This process exploits the LLM's ability to provide anatomical guidance
+for complex motion synthesis.We then devise a two-branch fusion scheme that
+balances the influence of the input text and the anatomical scripts on the
+inverse diffusion process, which adaptively ensures the semantic fidelity and
+diversity of the synthesized motion.Our method can effectively handle texts
+with complex or long motion descriptions, where existing methods often fail.
+Experiments on datasets with relatively more complex motions, such as CLCD1 and
+CLCD2, demonstrate that our AMD significantly outperforms existing
+state-of-the-art models.
+
+
+ Recently, CLIP has found practical utility in the domain of pixel-level
+zero-shot segmentation tasks. The present landscape features two-stage
+methodologies beset by issues such as intricate pipelines and elevated
+computational costs. While current one-stage approaches alleviate these
+concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's
+generalization capacity, they still fall short in fully harnessing CLIP's
+potential for pixel-level unseen class demarcation and precise pixel
+predictions. To further stimulate CLIP's zero-shot dense prediction capability,
+we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from
+image to pixel. Specifically, we initially introduce Spectral Prompt Tuning
+(SPT), incorporating spectral prompts into the CLIP visual encoder's shallow
+layers to capture structural intricacies of images, thereby enhancing
+comprehension of unseen classes. Subsequently, we introduce the Spectral Guided
+Decoder (SGD), utilizing both high and low-frequency information to steer the
+network's spatial focus towards more prominent classification features,
+enabling precise pixel-level prediction outcomes. Through extensive experiments
+on two public datasets, we demonstrate the superiority of our method over
+state-of-the-art approaches, performing well across all classes and
+particularly excelling in handling unseen classes. Code is available
+at:https://github.com/clearxu/SPT.
+
+
+
+ comment: AAAI2024 Accepted
+
+
+
+
+
+
+ ☆ PointeNet: A Lightweight Framework for Effective and Efficient Point
+ Cloud Analysis
+
+
+ Current methodologies in point cloud analysis predominantly explore 3D
+geometries, often achieved through the introduction of intricate learnable
+geometric extractors in the encoder or by deepening networks with repeated
+blocks. However, these approaches inevitably lead to a significant number of
+learnable parameters, resulting in substantial computational costs and imposing
+memory burdens on CPU/GPU. Additionally, the existing strategies are primarily
+tailored for object-level point cloud classification and segmentation tasks,
+with limited extensions to crucial scene-level applications, such as autonomous
+driving. In response to these limitations, we introduce PointeNet, an efficient
+network designed specifically for point cloud analysis. PointeNet distinguishes
+itself with its lightweight architecture, low training cost, and plug-and-play
+capability, effectively capturing representative features. The network consists
+of a Multivariate Geometric Encoding (MGE) module and an optional
+Distance-aware Semantic Enhancement (DSE) module. The MGE module employs
+operations of sampling, grouping, and multivariate geometric aggregation to
+lightweightly capture and adaptively aggregate multivariate geometric features,
+providing a comprehensive depiction of 3D geometries. The DSE module, designed
+for real-world autonomous driving scenarios, enhances the semantic perception
+of point clouds, particularly for distant points. Our method demonstrates
+flexibility by seamlessly integrating with a classification/segmentation head
+or embedding into off-the-shelf 3D object detection networks, achieving notable
+performance improvements at a minimal cost. Extensive experiments on
+object-level datasets, including ModelNet40, ScanObjectNN, ShapeNetPart, and
+the scene-level dataset KITTI, demonstrate the superior performance of
+PointeNet over state-of-the-art methods in point cloud analysis.
+
+
+ This work introduces a new Transformer model called Cached Transformer, which
+uses Gated Recurrent Cached (GRC) attention to extend the self-attention
+mechanism with a differentiable memory cache of tokens. GRC attention enables
+attending to both past and current tokens, increasing the receptive field of
+attention and allowing for exploring long-range dependencies. By utilizing a
+recurrent gating unit to continuously update the cache, our model achieves
+significant advancements in \textbf{six} language and vision tasks, including
+language modeling, machine translation, ListOPs, image classification, object
+detection, and instance segmentation. Furthermore, our approach surpasses
+previous memory-based techniques in tasks such as language modeling and
+displays the ability to be applied to a broader range of situations.
+
+
+ Semantic segmentation of remote sensing images plays a vital role in a wide
+range of Earth Observation (EO) applications, such as land use land cover
+mapping, environment monitoring, and sustainable development. Driven by rapid
+developments in Artificial Intelligence (AI), deep learning (DL) has emerged as
+the mainstream tool for semantic segmentation and achieved many breakthroughs
+in the field of remote sensing. However, the existing DL-based methods mainly
+focus on unimodal visual data while ignoring the rich multimodal information
+involved in the real world, usually demonstrating weak reliability and
+generlization. Inspired by the success of Vision Transformers and large
+language models, we propose a novel metadata-collaborative multimodal
+segmentation network (MetaSegNet) that applies vision-language representation
+learning for semantic segmentation of remote sensing images. Unlike the common
+model structure that only uses unimodal visual data, we extract the key
+characteristic (i.e. the climate zone) from freely available remote sensing
+image metadata and transfer it into knowledge-based text prompts via the
+generic ChatGPT. Then, we construct an image encoder, a text encoder and a
+crossmodal attention fusion subnetwork to extract the image and text feature
+and apply image-text interaction. Benefiting from such a design, the proposed
+MetaSegNet demonstrates superior generalization and achieves competitive
+accuracy with state-of-the-art semantic segmentation methods on the large-scale
+OpenEarthMap dataset (68.6% mIoU) and Potsdam dataset (93.3% mean F1 score) as
+well as LoveDA dataset (52.2% mIoU).
+
+
+
+
+
+
+
+ ☆ A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
+
+
+
+
+
+
+
+
+ Julio Silva-Rodriguez, Sina Hajimiri, Ismail Ben Ayed, Jose Dolz
+
+
+ Efficient transfer learning (ETL) is receiving increasing attention to adapt
+large pre-trained language-vision models on downstream tasks with a few labeled
+samples. While significant progress has been made, we reveal that
+state-of-the-art ETL approaches exhibit strong performance only in
+narrowly-defined experimental setups, and with a careful adjustment of
+hyperparameters based on a large corpus of labeled samples. In particular, we
+make two interesting, and surprising empirical observations. First, to
+outperform a simple Linear Probing baseline, these methods require to optimize
+their hyper-parameters on each target task. And second, they typically
+underperform -- sometimes dramatically -- standard zero-shot predictions in the
+presence of distributional drifts. Motivated by the unrealistic assumptions
+made in the existing literature, i.e., access to a large validation set and
+case-specific grid-search for optimal hyperparameters, we propose a novel
+approach that meets the requirements of real-world scenarios. More concretely,
+we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing
+term is optimized via an adaptation of the general Augmented Lagrangian method
+tailored to this context. We comprehensively evaluate CLAP on a broad span of
+datasets and scenarios, demonstrating that it consistently outperforms SoTA
+approaches, while yet being a much more efficient alternative.
+
+
+
+ comment: Code available at https://github.com/jusiro/CLAP
+
+
+
+
+
+
+
+ Haoxing Chen, Yaohui Li, Zhangxuan Gu, Zhuoer Xu, Jun Lan, Huaxiong Li
+
+
+ Image harmonization is a crucial technique in image composition that aims to
+seamlessly match the background by adjusting the foreground of composite
+images. Current methods adopt either global-level or pixel-level feature
+matching. Global-level feature matching ignores the proximity prior, treating
+foreground and background as separate entities. On the other hand, pixel-level
+feature matching loses contextual information. Therefore, it is necessary to
+use the information from semantic maps that describe different objects to guide
+harmonization. In this paper, we propose Semantic-guided Region-aware Instance
+Normalization (SRIN) that can utilize the semantic segmentation maps output by
+a pre-trained Segment Anything Model (SAM) to guide the visual consistency
+learning of foreground and background features. Abundant experiments
+demonstrate the superiority of our method for image harmonization over
+state-of-the-art methods.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ☆ Reducing Shape-Radiance Ambiguity in Radiance Fields with a Closed-Form
+ Color Estimation Method NeurIPS 2023
+
+
+ Neural radiance field (NeRF) enables the synthesis of cutting-edge realistic
+novel view images of a 3D scene. It includes density and color fields to model
+the shape and radiance of a scene, respectively. Supervised by the photometric
+loss in an end-to-end training manner, NeRF inherently suffers from the
+shape-radiance ambiguity problem, i.e., it can perfectly fit training views but
+does not guarantee decoupling the two fields correctly. To deal with this
+issue, existing works have incorporated prior knowledge to provide an
+independent supervision signal for the density field, including total variation
+loss, sparsity loss, distortion loss, etc. These losses are based on general
+assumptions about the density field, e.g., it should be smooth, sparse, or
+compact, which are not adaptive to a specific scene. In this paper, we propose
+a more adaptive method to reduce the shape-radiance ambiguity. The key is a
+rendering method that is only based on the density field. Specifically, we
+first estimate the color field based on the density field and posed images in a
+closed form. Then NeRF's rendering process can proceed. We address the problems
+in estimating the color field, including occlusion and non-uniformly
+distributed views. Afterward, it is applied to regularize NeRF's density field.
+As our regularization is guided by photometric loss, it is more adaptive
+compared to existing ones. Experimental results show that our method improves
+the density field of NeRF both qualitatively and quantitatively. Our code is
+available at https://github.com/qihangGH/Closed-form-color-field.
+
+
+
+ comment: This work has been published in NeurIPS 2023
+
+
+
+
+
+
+ ☆ Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual
+ Question Answering
+
+
+
+
+
+
+
+
+ Chengxiang Yin, Zhengping Che, Kun Wu, Zhiyuan Xu, Jian Tang
+
+
+ Visual Question Answering (VQA) has emerged as one of the most challenging
+tasks in artificial intelligence due to its multi-modal nature. However, most
+existing VQA methods are incapable of handling Knowledge-based Visual Question
+Answering (KB-VQA), which requires external knowledge beyond visible contents
+to answer questions about a given image. To address this issue, we propose a
+novel framework that endows the model with capabilities of answering more
+general questions, and achieves a better exploitation of external knowledge
+through generating Multiple Clues for Reasoning with Memory Neural Networks
+(MCR-MemNN). Specifically, a well-defined detector is adopted to predict
+image-question related relation phrases, each of which delivers two
+complementary clues to retrieve the supporting facts from external knowledge
+base (KB), which are further encoded into a continuous embedding space using a
+content-addressable memory. Afterwards, mutual interactions between
+visual-semantic representation and the supporting facts stored in memory are
+captured to distill the most relevant information in three modalities (i.e.,
+image, question, and KB). Finally, the optimal answer is predicted by choosing
+the supporting fact with the highest score. We conduct extensive experiments on
+two widely-used benchmarks. The experimental results well justify the
+effectiveness of MCR-MemNN, as well as its superiority over other KB-VQA
+methods.
+
+
+
+
+
+
+
+ ☆ Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class
+ Incremental Learning AAAI 2024
+
+
+ Non-exemplar class incremental learning aims to learn both the new and old
+tasks without accessing any training data from the past. This strict
+restriction enlarges the difficulty of alleviating catastrophic forgetting
+since all techniques can only be applied to current task data. Considering this
+challenge, we propose a novel framework of fine-grained knowledge selection and
+restoration. The conventional knowledge distillation-based methods place too
+strict constraints on the network parameters and features to prevent
+forgetting, which limits the training of new tasks. To loose this constraint,
+we proposed a novel fine-grained selective patch-level distillation to
+adaptively balance plasticity and stability. Some task-agnostic patches can be
+used to preserve the decision boundary of the old task. While some patches
+containing the important foreground are favorable for learning the new task.
+ Moreover, we employ a task-agnostic mechanism to generate more realistic
+prototypes of old tasks with the current task sample for reducing classifier
+bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100,
+TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method.
+Code is available at https://github.com/scok30/vit-cil.
+
+
+
+ comment: to appear at AAAI 2024
+
+
+
+
+
+
+ ☆ Cross-Modal Reasoning with Event Correlation for Video Question
+ Answering
+
+
+
+
+
+
+
+
+ Chengxiang Yin, Zhengping Che, Kun Wu, Zhiyuan Xu, Qinru Qiu, Jian Tang
+
+
+ Video Question Answering (VideoQA) is a very attractive and challenging
+research direction aiming to understand complex semantics of heterogeneous data
+from two domains, i.e., the spatio-temporal video content and the word sequence
+in question. Although various attention mechanisms have been utilized to manage
+contextualized representations by modeling intra- and inter-modal relationships
+of the two modalities, one limitation of the predominant VideoQA methods is the
+lack of reasoning with event correlation, that is, sensing and analyzing
+relationships among abundant and informative events contained in the video. In
+this paper, we introduce the dense caption modality as a new auxiliary and
+distill event-correlated information from it to infer the correct answer. To
+this end, we propose a novel end-to-end trainable model, Event-Correlated Graph
+Neural Networks (EC-GNNs), to perform cross-modal reasoning over information
+from the three modalities (i.e., caption, video, and question). Besides the
+exploitation of a brand new modality, we employ cross-modal reasoning modules
+for explicitly modeling inter-modal relationships and aggregating relevant
+information across different modalities, and we propose a question-guided
+self-adaptive multi-modal fusion module to collect the question-oriented and
+event-correlated evidence through multi-step reasoning. We evaluate our model
+on two widely-used benchmark datasets and conduct an ablation study to justify
+the effectiveness of each proposed component.
+
+
+
+
+
+
+
+ ☆ AdvST: Revisiting Data Augmentations for Single Domain Generalization AAAI 2024
+
+
+ Single domain generalization (SDG) aims to train a robust model against
+unknown target domain shifts using data from a single source domain. Data
+augmentation has been proven an effective approach to SDG. However, the utility
+of standard augmentations, such as translate, or invert, has not been fully
+exploited in SDG; practically, these augmentations are used as a part of a data
+preprocessing procedure. Although it is intuitive to use many such
+augmentations to boost the robustness of a model to out-of-distribution domain
+shifts, we lack a principled approach to harvest the benefit brought from
+multiple these augmentations. Here, we conceptualize standard data
+augmentations with learnable parameters as semantics transformations that can
+manipulate certain semantics of a sample, such as the geometry or color of an
+image. Then, we propose Adversarial learning with Semantics Transformations
+(AdvST) that augments the source domain data with semantics transformations and
+learns a robust model with the augmented data. We theoretically show that AdvST
+essentially optimizes a distributionally robust optimization objective defined
+on a set of semantics distributions induced by the parameters of semantics
+transformations. We demonstrate that AdvST can produce samples that expand the
+coverage on target domain data. Compared with the state-of-the-art methods,
+AdvST, despite being a simple method, is surprisingly competitive and achieves
+the best average SDG performance on the Digits, PACS, and DomainNet datasets.
+Our code is available at https://github.com/gtzheng/AdvST.
+
+
+
+
+
+
+
+
+ Yunye Gong, Robik Shrestha, Jared Claypoole, Michael Cogswell, Arijit Ray, Christopher Kanan, Ajay Divakaran
+
+
+ We propose a novel VQA dataset, based on picture stories designed for
+educating young children, that aims to facilitate comprehensive evaluation and
+characterization of vision-language models on comprehension tasks. Unlike
+current VQA datasets that often focus on fact-based memorization and simple
+reasoning tasks without principled scientific grounding, we collect data
+containing tasks reflecting different levels of comprehension and underlying
+cognitive processes, as laid out in Bloom's Taxonomy, a classic framework
+widely adopted in education research. The proposed BloomVQA dataset can be
+mapped to a hierarchical graph-based representation of visual stories, enabling
+automatic data augmentation and novel measures characterizing model consistency
+across the underlying taxonomy. We demonstrate graded evaluation and
+reliability analysis based on our proposed consistency metrics on
+state-of-the-art vision-language models. Our results suggest that, while
+current models achieve the most gain on low-level comprehension tasks, they
+generally fall short on high-level tasks requiring more advanced comprehension
+and cognitive skills, as 38.0% drop in VQA accuracy is observed comparing
+lowest and highest level tasks. Furthermore, current models show consistency
+patterns misaligned with human comprehension in various scenarios, suggesting
+emergent structures of model behaviors.
+
+
+
+
+
+
+
+ ☆ How Good Are Deep Generative Models for Solving Inverse Problems?
+
+
+
+
+
+
+
+
+ Shichong Peng, Alireza Moazeni, Ke Li
+
+
+ Deep generative models, such as diffusion models, GANs, and IMLE, have shown
+impressive capability in tackling inverse problems. However, the validity of
+model-generated solutions w.r.t. the forward problem and the reliability of
+associated uncertainty estimates remain understudied. This study evaluates
+recent diffusion-based, GAN-based, and IMLE-based methods on three inverse
+problems, i.e., $16\times$ super-resolution, colourization, and image
+decompression. We assess the validity of these models' outputs as solutions to
+the inverse problems and conduct a thorough analysis of the reliability of the
+models' estimates of uncertainty over the solution. Overall, we find that the
+IMLE-based CHIMLE method outperforms other methods in terms of producing valid
+solutions and reliable uncertainty estimates.
+
+
+
+
+
+
+
+ ☆ Trajectory Approximation of Video Based on Phase Correlation for Forward
+ Facing Camera
+
+
+ In this paper, we introduce an innovative approach for extracting
+trajectories from a camera sensor in GPS-denied environments, leveraging visual
+odometry. The system takes video footage captured by a forward-facing camera
+mounted on a vehicle as input, with the output being a chain code representing
+the camera's trajectory. The proposed methodology involves several key steps.
+Firstly, we employ phase correlation between consecutive frames of the video to
+extract essential information. Subsequently, we introduce a novel chain code
+method termed "dynamic chain code," which is based on the x-shift values
+derived from the phase correlation. The third step involves determining
+directional changes (forward, left, right) by establishing thresholds and
+extracting the corresponding chain code. This extracted code is then stored in
+a buffer for further processing. Notably, our system outperforms traditional
+methods reliant on spatial features, exhibiting greater speed and robustness in
+noisy environments. Importantly, our approach operates without external camera
+calibration information. Moreover, by incorporating visual odometry, our system
+enhances its accuracy in estimating camera motion, providing a more
+comprehensive understanding of trajectory dynamics. Finally, the system
+culminates in the visualization of the normalized camera motion trajectory.
+
+
+
+
+
+
+
+ ☆ Embedded Shape Matching in Photogrammetry Data for Modeling Making
+ Knowledge
+
+
+ In three-dimensional models obtained by photogrammetry of existing
+structures, all of the shapes that the eye can select cannot always find their
+equivalents in the geometric components of the model. However, the matching of
+meaningful parts and assemblages with the records acquired with rapid and
+detailed documentation methods will provide an advantage for the creation of
+information models of existing structures. While aiming to produce answers to
+this problem and in order to overcome the difficulties of pattern recognition
+in three-dimensional models, we used two-dimensional samples obtained by
+projection. Processing techniques such as ambient occlusion, curvature and
+normal maps are commonly used in modern computer graphics applications that
+enable the representation of three-dimensional surface properties in
+two-dimensional data sets. The method we propose is based on the recognition of
+patterns through these mappings instead of the usual light-based visualization.
+The first stage of the application is photogrammetric capture of a few examples
+of Zeugma mosaics and three-dimensional digital modeling of a set of Seljuk era
+brick walls based on knowledge obtained through architectural history
+literature. The second stage covers the creation of digital models byprocessing
+the surface representation obtained from this data using Alice Vision,
+OpenCV-Python, and Autodesk Maya to include information on aspects of the
+making of the walls. What is envisioned for the next stages is that the mapping
+data contributes and supports the knowledge for rule-based design and making
+processesof cultural heritage.
+
+
+
+ comment: 9 pages, in Turkish language. 6 figures. In: MSTAS 2019 - (XIII.
+ Computational Design in Architecture National Symposium) pp. 313-326.,
+ Kocaeli, Turkey (2019)
+
+ We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that
+integrates learning-based sparse visual odometry for low-latency camera
+tracking and a neural radiance scene representation for sophisticated dense
+reconstruction and novel view synthesis. Our system initializes camera poses
+using sparse visual odometry and obtains view-dependent dense geometry priors
+from a monocular depth prediction network. We harmonize the scale of poses and
+dense geometry, treating them as supervisory cues to train a neural implicit
+scene representation. NeRF-VO demonstrates exceptional performance in both
+photometric and geometric fidelity of the scene representation by jointly
+optimizing a sliding window of keyframed poses and the underlying dense
+geometry, which is accomplished through training the radiance field with volume
+rendering. We surpass state-of-the-art methods in pose estimation accuracy,
+novel view synthesis fidelity, and dense reconstruction quality across a
+variety of synthetic and real-world datasets, while achieving a higher camera
+tracking frequency and consuming less GPU memory.
+
+
+
+ comment: 10 tables, 4 figures
+
+
+
+
+
+
+ ☆ Neural feels with neural fields: Visuo-tactile perception for in-hand
+ manipulation
+
+
+
+
+
+
+
+
+ Sudharshan Suresh, Haozhi Qi, Tingfan Wu, Taosha Fan, Luis Pineda, Mike Lambeta, Jitendra Malik, Mrinal Kalakrishnan, Roberto Calandra, Michael Kaess, Joseph Ortiz, Mustafa Mukadam
+
+
+ To achieve human-level dexterity, robots must infer spatial awareness from
+multimodal sensing to reason over contact interactions. During in-hand
+manipulation of novel objects, such spatial awareness involves estimating the
+object's pose and shape. The status quo for in-hand perception primarily
+employs vision, and restricts to tracking a priori known objects. Moreover,
+visual occlusion of objects in-hand is imminent during manipulation, preventing
+current systems to push beyond tasks without occlusion. We combine vision and
+touch sensing on a multi-fingered hand to estimate an object's pose and shape
+during in-hand manipulation. Our method, NeuralFeels, encodes object geometry
+by learning a neural field online and jointly tracks it by optimizing a pose
+graph problem. We study multimodal in-hand perception in simulation and the
+real-world, interacting with different objects via a proprioception-driven
+policy. Our experiments show final reconstruction F-scores of $81$% and average
+pose drifts of $4.7\,\text{mm}$, further reduced to $2.3\,\text{mm}$ with known
+CAD models. Additionally, we observe that under heavy visual occlusion we can
+achieve up to $94$% improvements in tracking compared to vision-only methods.
+Our results demonstrate that touch, at the very least, refines and, at the very
+best, disambiguates visual estimates during in-hand manipulation. We release
+our evaluation dataset of 70 experiments, FeelSight, as a step towards
+benchmarking in this domain. Our neural representation driven by multimodal
+sensing can serve as a perception backbone towards advancing robot dexterity.
+Videos can be found on our project website
+https://suddhu.github.io/neural-feels/
+
+
+ Detecting lane lines from sensors is becoming an increasingly significant
+part of autonomous driving systems. However, less development has been made on
+high-definition lane-level mapping based on aerial images, which could
+automatically build and update offline maps for auto-driving systems. To this
+end, our work focuses on extracting fine-level detailed lane lines together
+with their topological structures. This task is challenging since it requires
+large amounts of data covering different lane types, terrain and regions. In
+this paper, we introduce for the first time a large-scale aerial image dataset
+built for lane detection, with high-quality polyline lane annotations on
+high-resolution images of around 80 kilometers of road. Moreover, we developed
+a baseline deep learning lane detection method from aerial images, called
+AerialLaneNet, consisting of two stages. The first stage is to produce
+coarse-grained results at point level, and the second stage exploits the
+coarse-grained results and feature to perform the vertex-matching task,
+producing fine-grained lanes with topology. The experiments show our approach
+achieves significant improvement compared with the state-of-the-art methods on
+our new dataset. Our code and new dataset are available at
+https://github.com/Jiawei-Yao0812/AerialLaneNet.
+
+
+ Geometric transformations have been widely used to augment the size of
+training images. Existing methods often assume a unimodal distribution of the
+underlying transformations between images, which limits their power when data
+with multimodal distributions occur. In this paper, we propose a novel model,
+Multimodal Geometric Augmentation (MGAug), that for the first time generates
+augmenting transformations in a multimodal latent space of geometric
+deformations. To achieve this, we first develop a deep network that embeds the
+learning of latent geometric spaces of diffeomorphic transformations (a.k.a.
+diffeomorphisms) in a variational autoencoder (VAE). A mixture of multivariate
+Gaussians is formulated in the tangent space of diffeomorphisms and serves as a
+prior to approximate the hidden distribution of image transformations. We then
+augment the original training dataset by deforming images using randomly
+sampled transformations from the learned multimodal latent space of VAE. To
+validate the efficiency of our model, we jointly learn the augmentation
+strategy with two distinct domain-specific tasks: multi-class classification on
+2D synthetic datasets and segmentation on real 3D brain magnetic resonance
+images (MRIs). We also compare MGAug with state-of-the-art transformation-based
+image augmentation algorithms. Experimental results show that our proposed
+approach outperforms all baselines by significantly improved prediction
+accuracy. Our code is publicly available at
+https://github.com/tonmoy-hossain/MGAug.
+
+
+
+
+
+
+
+ ☆ Texture Matching GAN for CT Image Enhancement
+
+
+
+
+
+
+
+
+ Madhuri Nagare, Gregery T. Buzzard, Charles A. Bouman
+
+
+ Deep neural networks (DNN) are commonly used to denoise and sharpen X-ray
+computed tomography (CT) images with the goal of reducing patient X-ray dosage
+while maintaining reconstruction quality. However, naive application of
+DNN-based methods can result in image texture that is undesirable in clinical
+applications. Alternatively, generative adversarial network (GAN) based methods
+can produce appropriate texture, but naive application of GANs can introduce
+inaccurate or even unreal image detail. In this paper, we propose a texture
+matching generative adversarial network (TMGAN) that enhances CT images while
+generating an image texture that can be matched to a target texture. We use
+parallel generators to separate anatomical features from the generated texture,
+which allows the GAN to be trained to match the desired texture without
+directly affecting the underlying CT image. We demonstrate that TMGAN generates
+enhanced image quality while also producing image texture that is desirable for
+clinical application.
+
+
+
+ comment: Submitted to IEEE Transactions on Medical Imaging
+
+
+
+
+
+
+ ☆ EPNet: An Efficient Pyramid Network for Enhanced Single-Image
+ Super-Resolution with Reduced Computational Requirements
+
+
+ Single-image super-resolution (SISR) has seen significant advancements
+through the integration of deep learning. However, the substantial
+computational and memory requirements of existing methods often limit their
+practical application. This paper introduces a new Efficient Pyramid Network
+(EPNet) that harmoniously merges an Edge Split Pyramid Module (ESPM) with a
+Panoramic Feature Extraction Module (PFEM) to overcome the limitations of
+existing methods, particularly in terms of computational efficiency. The ESPM
+applies a pyramid-based channel separation strategy, boosting feature
+extraction while maintaining computational efficiency. The PFEM, a novel fusion
+of CNN and Transformer structures, enables the concurrent extraction of local
+and global features, thereby providing a panoramic view of the image landscape.
+Our architecture integrates the PFEM in a manner that facilitates the
+streamlined exchange of feature information and allows for the further
+refinement of image texture details. Experimental results indicate that our
+model outperforms existing state-of-the-art methods in image resolution
+quality, while considerably decreasing computational and memory costs. This
+research contributes to the ongoing evolution of efficient and practical SISR
+methodologies, bearing broader implications for the field of computer vision.
+
+
+
+
+
+
+
+
+ David Pujol-Perich, Albert Clapés, Sergio Escalera
+
+
+ Temporal Action Localization (TAL) is a complex task that poses relevant
+challenges, particularly when attempting to generalize on new -- unseen --
+domains in real-world applications. These scenarios, despite realistic, are
+often neglected in the literature, exposing these solutions to important
+performance degradation. In this work, we tackle this issue by introducing, for
+the first time, an approach for Unsupervised Domain Adaptation (UDA) in sparse
+TAL, which we refer to as Semantic Adversarial unsupervised Domain Adaptation
+(SADA). Our contribution is threefold: (1) we pioneer the development of a
+domain adaptation model that operates on realistic sparse action detection
+benchmarks; (2) we tackle the limitations of global-distribution alignment
+techniques by introducing a novel adversarial loss that is sensitive to local
+class distributions, ensuring finer-grained adaptation; and (3) we present a
+novel experimental setup, based on EpicKitchens100, that evaluates multiple
+types of domain shifts in a comprehensive manner. Our experimental results
+indicate that SADA improves the adaptation across domains when compared to
+fully supervised state-of-the-art and alternative UDA methods, attaining a
+relative performance boost of up to 14%.
+
+
+
+
+
+
+
+ ♻ ☆ Integrating Human Vision Perception in Vision Transformers for
+ Classifying Waste Items
+
+
+ In this paper, we propose an novel methodology aimed at simulating the
+learning phenomenon of nystagmus through the application of differential
+blurring on datasets. Nystagmus is a biological phenomenon that influences
+human vision throughout life, notably by diminishing head shake from infancy to
+adulthood. Leveraging this concept, we address the issue of waste
+classification, a pressing global concern. The proposed framework comprises two
+modules, with the second module closely resembling the original Vision
+Transformer, a state-of-the-art model model in classification tasks. The
+primary motivation behind our approach is to enhance the model's precision and
+adaptability, mirroring the real-world conditions that the human visual system
+undergoes. This novel methodology surpasses the standard Vision Transformer
+model in waste classification tasks, exhibiting an improvement with a margin of
+2%. This improvement underscores the potential of our methodology in improving
+model precision by drawing inspiration from human vision perception. Further
+research in the proposed methodology could yield greater performance results,
+and can be extrapolated to other global issues.
+
+
+ We propose a PnP algorithm for a camera constrained to two-dimensional
+movement (applicable, for instance, to many wheeled robotics platforms).
+Leveraging this assumption allows performance improvements over 3D PnP
+algorithms due to the reduction in search space dimensionality. It also reduces
+the incidence of ambiguous pose estimates (as, in most cases, the spurious
+solutions fall outside the plane of movement). Our algorithm finds an
+approximate solution using geometric criteria and refines its prediction
+iteratively. We compare this algorithm to existing 3D PnP algorithms in terms
+of accuracy, performance, and robustness to noise.
+
+
+
+ comment: 4 pages, 3 figures. Improved testing figures from version 1
+
+
+
+
+
+
+
+ Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
+
+
+ The recent popularity of text-to-image diffusion models (DM) can largely be
+attributed to the intuitive interface they provide to users. The intended
+generation can be expressed in natural language, with the model producing
+faithful interpretations of text prompts. However, expressing complex or
+nuanced ideas in text alone can be difficult. To ease image generation, we
+propose MultiFusion that allows one to express complex and nuanced concepts
+with arbitrarily interleaved inputs of multiple modalities and languages.
+MutliFusion leverages pre-trained models and aligns them for integration into a
+cohesive system, thereby avoiding the need for extensive training from scratch.
+Our experimental results demonstrate the efficient transfer of capabilities
+from individual modules to the downstream model. Specifically, the fusion of
+all independent components allows the image generation module to utilize
+multilingual, interleaved multimodal inputs despite being trained solely on
+monomodal data in a single language.
+
+
+
+ comment: Proceedings of Advances in Neural Information Processing Systems:
+ Annual Conference on Neural Information Processing Systems (NeurIPS)
+
+
+
+
+
+
+
+ Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason
+
+
+ We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for
+evaluating language-guided agents navigating in a persistent environment over
+time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the
+agent's memory at the beginning of every episode, testing the ability to
+perform cold-start navigation with no prior information. However, deployed
+robots occupy the same environment for long periods of time. The IVLN paradigm
+addresses this disparity by training and evaluating VLN agents that maintain
+memory across tours of scenes that consist of up to 100 ordered
+instruction-following Room-to-Room (R2R) episodes, each defined by an
+individual language instruction and a target path. We present discrete and
+continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours
+each in 80 indoor scenes. We find that extending the implicit memory of
+high-performing transformer VLN agents is not sufficient for IVLN, but agents
+that build maps can benefit from environment persistence, motivating a renewed
+focus on map-building agents in VLN.
+
+
+
+ comment: Accepted by CVPR 2023
+
+
+
+
+
+
+ ♻ ☆ Re-Evaluating LiDAR Scene Flow for Autonomous Driving WACV 2024
+
+
+
+
+
+
+
+
+ Nathaniel Chodosh, Deva Ramanan, Simon Lucey
+
+
+ Popular benchmarks for self-supervised LiDAR scene flow (stereoKITTI, and
+FlyingThings3D) have unrealistic rates of dynamic motion, unrealistic
+correspondences, and unrealistic sampling patterns. As a result, progress on
+these benchmarks is misleading and may cause researchers to focus on the wrong
+problems. We evaluate a suite of top methods on a suite of real-world datasets
+(Argoverse 2.0, Waymo, and NuScenes) and report several conclusions. First, we
+find that performance on stereoKITTI is negatively correlated with performance
+on real-world data. Second, we find that one of this task's key components --
+removing the dominant ego-motion -- is better solved by classic ICP than any
+tested method. Finally, we show that despite the emphasis placed on learning,
+most performance gains are caused by pre- and post-processing steps:
+piecewise-rigid refinement and ground removal. We demonstrate this through a
+baseline method that combines these processing steps with a learning-free
+test-time flow optimization. This baseline outperforms every evaluated method.
+
+
+
+ comment: WACV 2024
+
+
+
+
+
+
+ ♻ ☆ In Search of Projectively Equivariant Networks
+
+
+
+
+
+
+
+
+ Georg Bökman, Axel Flinth, Fredrik Kahl
+
+
+ Equivariance of linear neural network layers is well studied. In this work,
+we relax the equivariance condition to only be true in a projective sense. We
+propose a way to construct a projectively equivariant neural network through
+building a standard equivariant network where the linear group representations
+acting on each intermediate feature space are "multiplicatively modified lifts"
+of projective group representations. By theoretically studying the relation of
+projectively and linearly equivariant linear layers, we show that our approach
+is the most general possible when building a network out of linear layers. The
+theory is showcased in two simple experiments.
+
+
+
+ comment: v3: Another significant rewrite. Accepted for publication in TMLR.
+ v2: Significant rewrite. The title has been changed: "neural network" ->
+ "network". More general description of projectively equivariant linear
+ layers, with new proposed architectures, and a completely new accompanying
+ experiment section, as a result
+
+
+
+
+
+
+
+ Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov
+
+
+ Multimedia generation approaches occupy a prominent place in artificial
+intelligence research. Text-to-image models achieved high-quality results over
+the last few years. However, video synthesis methods recently started to
+develop. This paper presents a new two-stage latent diffusion text-to-video
+generation architecture based on the text-to-image diffusion model. The first
+stage concerns keyframes synthesis to figure the storyline of a video, while
+the second one is devoted to interpolation frames generation to make movements
+of the scene and objects smooth. We compare several temporal conditioning
+approaches for keyframes generation. The results show the advantage of using
+separate temporal blocks over temporal layers in terms of metrics reflecting
+video generation quality aspects and human preference. The design of our
+interpolation model significantly reduces computational costs compared to other
+masked frame interpolation approaches. Furthermore, we evaluate different
+configurations of MoVQ-based video decoding scheme to improve consistency and
+achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our
+pipeline with existing solutions and achieve top-2 scores overall and top-1
+among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:
+https://ai-forever.github.io/kandinsky-video/
+
+
+ Semantic segmentation of remote sensing imagery plays a pivotal role in
+extracting precise information for diverse down-stream applications. Recent
+development of the Segment Anything Model (SAM), an advanced general-purpose
+segmentation model, has revolutionized this field, presenting new avenues for
+accurate and efficient segmentation. However, SAM is limited to generating
+segmentation results without class information. Consequently, the utilization
+of such a powerful general vision model for semantic segmentation in remote
+sensing images has become a focal point of research. In this paper, we present
+a streamlined framework aimed at leveraging the raw output of SAM by exploiting
+two novel concepts called SAM-Generated Object (SGO) and SAM-Generated Boundary
+(SGB). More specifically, we propose a novel object loss and further introduce
+a boundary loss as augmentative components to aid in model optimization in a
+general semantic segmentation framework. Taking into account the content
+characteristics of SGO, we introduce the concept of object consistency to
+leverage segmented regions lacking semantic information. By imposing
+constraints on the consistency of predicted values within objects, the object
+loss aims to enhance semantic segmentation performance. Furthermore, the
+boundary loss capitalizes on the distinctive features of SGB by directing the
+model's attention to the boundary information of the object. Experimental
+results on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban,
+demonstrate the effectiveness of our proposed method. The source code for this
+work will be accessible at https://github.com/sstary/SSRS.
+
+
+
+
+
+
+
+
+ Kai Liu, Sheng Jin, Zhihang Fu, Ze Chen, Rongxin Jiang, Jieping Ye
+
+
+ Without manually annotated identities, unsupervised multi-object trackers are
+inferior to learning reliable feature embeddings. It causes the
+similarity-based inter-frame association stage also be error-prone, where an
+uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents
+trackers from learning the consistent feature embedding against time variation.
+To avoid this uncertainty problem, recent self-supervised techniques are
+adopted, whereas they failed to capture temporal relations. The interframe
+uncertainty still exists. In fact, this paper argues that though the
+uncertainty problem is inevitable, it is possible to leverage the uncertainty
+itself to improve the learned consistency in turn. Specifically, an
+uncertainty-based metric is developed to verify and rectify the risky
+associations. The resulting accurate pseudo-tracklets boost learning the
+feature consistency. And accurate tracklets can incorporate temporal
+information into spatial transformation. This paper proposes a tracklet-guided
+augmentation strategy to simulate tracklets' motion, which adopts a
+hierarchical uncertainty-based sampling mechanism for hard sample mining. The
+ultimate unsupervised MOT framework, namely U2MOT, is proven effective on
+MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance
+among the published supervised and unsupervised trackers.
+
+
+
+ comment: Accepted by International Conference on Computer Vision (ICCV) 2023.
+ Code is available at https://github.com/alibaba/u2mot/
+
+
+
+
+
+
+ ♻ ☆ GaussianEditor: Swift and Controllable 3D Editing with Gaussian
+ Splatting
+
+
+
+
+
+
+
+
+ Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, Guosheng Lin
+
+
+ 3D editing plays a crucial role in many areas such as gaming and virtual
+reality. Traditional 3D editing methods, which rely on representations like
+meshes and point clouds, often fall short in realistically depicting complex
+scenes. On the other hand, methods based on implicit 3D representations, like
+Neural Radiance Field (NeRF), render complex scenes effectively but suffer from
+slow processing speeds and limited control over specific scene areas. In
+response to these challenges, our paper presents GaussianEditor, an innovative
+and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D
+representation. GaussianEditor enhances precision and control in editing
+through our proposed Gaussian semantic tracing, which traces the editing target
+throughout the training process. Additionally, we propose Hierarchical Gaussian
+splatting (HGS) to achieve stabilized and fine results under stochastic
+generative guidance from 2D diffusion models. We also develop editing
+strategies for efficient object removal and integration, a challenging task for
+existing methods. Our comprehensive experiments demonstrate GaussianEditor's
+superior control, efficacy, and rapid performance, marking a significant
+advancement in 3D editing. Project Page:
+https://buaacyw.github.io/gaussian-editor/
+
+
+ Purpose: Manual annotations for training deep learning (DL) models in
+auto-segmentation are time-intensive. This study introduces a hybrid
+representation-enhanced sampling strategy that integrates both density and
+diversity criteria within an uncertainty-based Bayesian active learning (BAL)
+framework to reduce annotation efforts by selecting the most informative
+training samples. Methods: The experiments are performed on two lower extremity
+(LE) datasets of MRI and CT images, focusing on the segmentation of the femur,
+pelvis, sacrum, quadriceps femoris, hamstrings, adductors, sartorius, and
+iliopsoas, utilizing a U-net-based BAL framework. Our method selects uncertain
+samples with high density and diversity for manual revision, optimizing for
+maximal similarity to unlabeled instances and minimal similarity to existing
+training data. We assess the accuracy and efficiency using Dice and a proposed
+metric called reduced annotation cost (RAC), respectively. We further evaluate
+the impact of various acquisition rules on BAL performance and design an
+ablation study for effectiveness estimation. Results: In MRI and CT datasets,
+our method was superior or comparable to existing ones, achieving a 0.8\% Dice
+and 1.0\% RAC increase in CT (statistically significant), and a 0.8\% Dice and
+1.1\% RAC increase in MRI (not statistically significant) in volume-wise
+acquisition. Our ablation study indicates that combining density and diversity
+criteria enhances the efficiency of BAL in musculoskeletal segmentation
+compared to using either criterion alone. Conclusion: Our sampling method is
+proven efficient in reducing annotation costs in image segmentation tasks. The
+combination of the proposed method and our BAL framework provides a
+semi-automatic way for efficient annotation of medical image datasets.
+
+
+
+ comment: 15 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ 3D Object Detection from Images for Autonomous Driving: A Survey
+
+
+ 3D object detection from images, one of the fundamental and challenging
+problems in autonomous driving, has received increasing attention from both
+industry and academia in recent years. Benefiting from the rapid development of
+deep learning technologies, image-based 3D detection has achieved remarkable
+progress. Particularly, more than 200 works have studied this problem from 2015
+to 2021, encompassing a broad spectrum of theories, algorithms, and
+applications. However, to date no recent survey exists to collect and organize
+this knowledge. In this paper, we fill this gap in the literature and provide
+the first comprehensive survey of this novel and continuously growing research
+field, summarizing the most commonly used pipelines for image-based 3D
+detection and deeply analyzing each of their components. Additionally, we also
+propose two new taxonomies to organize the state-of-the-art methods into
+different categories, with the intent of providing a more systematic review of
+existing methods and facilitating fair comparisons with future works. In
+retrospect of what has been achieved so far, we also analyze the current
+challenges in the field and discuss future directions for image-based 3D
+detection research.
+
+
+
+ comment: Accepted by T-PAMI
+
+
+
+
+
+
+ ♻ ☆ SGFormer: Semantic Graph Transformer for Point Cloud-based 3D Scene
+ Graph Generation AAAI
+
+
+ In this paper, we propose a novel model called SGFormer, Semantic Graph
+TransFormer for point cloud-based 3D scene graph generation. The task aims to
+parse a point cloud-based scene into a semantic structural graph, with the core
+challenge of modeling the complex global structure. Existing methods based on
+graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and
+can only propagate information from limited neighboring nodes. In contrast,
+SGFormer uses Transformer layers as the base building block to allow global
+information passing, with two types of newly-designed layers tailored for the
+3D scene graph generation task. Specifically, we introduce the graph embedding
+layer to best utilize the global information in graph edges while maintaining
+comparable computation costs. Furthermore, we propose the semantic injection
+layer to leverage linguistic knowledge from large-scale language model (i.e.,
+ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the
+established 3DSSG dataset and achieve a 40.94% absolute improvement in
+relationship prediction's R@50 and an 88.36% boost on the subset with complex
+scenes over the state-of-the-art. Our analyses further show SGFormer's
+superiority in the long-tail and zero-shot scenarios. Our source code is
+available at https://github.com/Andy20178/SGFormer.
+
+
+
+ comment: To be published in Thirty-Eighth AAAI Conference on Artificial
+ Intelligence
+
+ Weakly-supervised temporal action localization aims to locate action regions
+and identify action categories in untrimmed videos simultaneously by taking
+only video-level labels as the supervision. Pseudo label generation is a
+promising strategy to solve the challenging problem, but the current methods
+ignore the natural temporal structure of the video that can provide rich
+information to assist such a generation process. In this paper, we propose a
+novel weakly-supervised temporal action localization method by inferring
+salient snippet-feature. First, we design a saliency inference module that
+exploits the variation relationship between temporal neighbor snippets to
+discover salient snippet-features, which can reflect the significant dynamic
+change in the video. Secondly, we introduce a boundary refinement module that
+enhances salient snippet-features through the information interaction unit.
+Then, a discrimination enhancement module is introduced to enhance the
+discriminative nature of snippet-features. Finally, we adopt the refined
+snippet-features to produce high-fidelity pseudo labels, which could be used to
+supervise the training of the action localization network. Extensive
+experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet
+v1.3, demonstrate our proposed method achieves significant improvements
+compared to the state-of-the-art methods.
+
+
+
+
+
+
+
+ ♻ ☆ A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
+
+
+ The surge of interest towards Multi-modal Large Language Models (MLLMs),
+e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
+academia and industry. They endow Large Language Models (LLMs) with powerful
+capabilities in visual understanding, enabling them to tackle diverse
+multi-modal tasks. Very recently, Google released Gemini, its newest and most
+capable MLLM built from the ground up for multi-modality. In light of the
+superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
+in multi-modal learning? In this paper, we present a preliminary exploration of
+Gemini Pro's visual understanding proficiency, which comprehensively covers
+four domains: fundamental perception, advanced cognition, challenging vision
+tasks, and various expert capacities. We compare Gemini Pro with the
+state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
+open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
+black-box systems. The qualitative samples indicate that, while GPT-4V and
+Gemini showcase different answering styles and preferences, they can exhibit
+comparable visual reasoning capabilities, and Sphinx still trails behind them
+concerning domain generalizability. Specifically, GPT-4V tends to elaborate
+detailed explanations and intermediate steps, and Gemini prefers to output a
+direct and concise answer. The quantitative evaluation on the popular MME
+benchmark also demonstrates the potential of Gemini to be a strong challenger
+to GPT-4V. Our early investigation of Gemini also observes some common issues
+of MLLMs, indicating that there still remains a considerable distance towards
+artificial general intelligence. Our project for tracking the progress of MLLM
+is released at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Total 120 pages. See our project at
+ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+
+
+
+
+
+
+ Ahmed Ghorbel, Wassim Hamidouche, Luce Morin
+
+
+ Recently, the performance of neural image compression (NIC) has steadily
+improved thanks to the last line of study, reaching or outperforming
+state-of-the-art conventional codecs. Despite significant progress, current NIC
+methods still rely on ConvNet-based entropy coding, limited in modeling
+long-range dependencies due to their local connectivity and the increasing
+number of architectural biases and priors, resulting in complex underperforming
+models with high decoding latency. Motivated by the efficiency investigation of
+the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose
+to enhance the latter, as first, with a more straightforward yet effective
+Tranformer-based channel-wise auto-regressive prior model, resulting in an
+absolute image compression transformer (ICT). Through the proposed ICT, we can
+capture both global and local contexts from the latent representations and
+better parameterize the distribution of the quantized latents. Further, we
+leverage a learnable scaling module with a sandwich ConvNeXt-based
+pre-/post-processor to accurately extract more compact latent codes while
+reconstructing higher-quality images. Extensive experimental results on
+benchmark datasets showed that the proposed framework significantly improves
+the trade-off between coding efficiency and decoder complexity over the
+versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec
+SwinT-ChARM. Moreover, we provide model scaling studies to verify the
+computational efficiency of our approach and conduct several objective and
+subjective analyses to bring to the fore the performance gap between the
+adaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.
+
+
+
+
+
+
+
+
+ Jacopo Bonato, Francesco Pelosin, Luigi Sabetta, Alessandro Nicolosi
+
+
+ The recent surge of pervasive devices that generate dynamic data streams has
+underscored the necessity for learning systems to adapt continually to data
+distributional shifts. To tackle this challenge, the research community has put
+forth a spectrum of methodologies, including the demanding pursuit of
+class-incremental learning without replay data. In this study, we present MIND,
+a parameter isolation method that aims to significantly enhance the performance
+of replay-free solutions and achieve state-of-the-art results on several widely
+studied datasets. Our approach introduces two main contributions: two
+alternative distillation procedures that significantly improve the efficiency
+of MIND increasing the accumulated knowledge of each sub-network, and the
+optimization of the BachNorm layers across tasks inside the sub-networks.
+Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free
+Class-Incremental learning (with an increment in classification accuracy of
+approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx.
++40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each
+contribution to demonstrate its impact on performance improvement. Our results
+showcase the superior performance of MIND indicating its potential for
+addressing the challenges posed by Class-incremental and Domain-Incremental
+learning in resource-constrained environments.
+
+
+
+ comment: Accepted at the 38th AAAI Conference on Artificial Intelligence
+
+
+
+
+
+
+
+ Alexis Goujon, Sebastian Neumayer, Michael Unser
+
+
+ We propose to learn non-convex regularizers with a prescribed upper bound on
+their weak-convexity modulus. Such regularizers give rise to variational
+denoisers that minimize a convex energy. They rely on few parameters (less than
+15,000) and offer a signal-processing interpretation as they mimic handcrafted
+sparsity-promoting regularizers. Through numerical experiments, we show that
+such denoisers outperform convex-regularization methods as well as the popular
+BM3D denoiser. Additionally, the learned regularizer can be deployed to solve
+inverse problems with iterative schemes that provably converge. For both CT and
+MRI reconstruction, the regularizer generalizes well and offers an excellent
+tradeoff between performance, number of parameters, guarantees, and
+interpretability when compared to other data-driven approaches.
+
+
+
+
+
+
+
+ ♻ ☆ RS-Corrector: Correcting the Racial Stereotypes in Latent Diffusion
+ Models
+
+
+ Recent text-conditioned image generation models have demonstrated an
+exceptional capacity to produce diverse and creative imagery with high visual
+quality. However, when pre-trained on billion-sized datasets randomly collected
+from the Internet, where potential biased human preferences exist, these models
+tend to produce images with common and recurring stereotypes, particularly for
+certain racial groups. In this paper, we conduct an initial analysis of the
+publicly available Stable Diffusion model and its derivatives, highlighting the
+presence of racial stereotypes. These models often generate distorted or biased
+images for certain racial groups, emphasizing stereotypical characteristics. To
+address these issues, we propose a framework called "RS-Corrector", designed to
+establish an anti-stereotypical preference in the latent space and update the
+latent code for refined generated results. The correction process occurs during
+the inference stage without requiring fine-tuning of the original model.
+Extensive empirical evaluations demonstrate that the introduced \themodel
+effectively corrects the racial stereotypes of the well-trained Stable
+Diffusion model while leaving the original model unchanged.
+
+
+
+ comment: 16 pages, 15 figures, conference
+
+
+
+
+
+
+ ♻ ☆ Data Roaming and Quality Assessment for Composed Image Retrieval AAAI 2024
+
+
+ The task of Composed Image Retrieval (CoIR) involves queries that combine
+image and text modalities, allowing users to express their intent more
+effectively. However, current CoIR datasets are orders of magnitude smaller
+compared to other vision and language (V&L) datasets. Additionally, some of
+these datasets have noticeable issues, such as queries containing redundant
+modalities. To address these shortcomings, we introduce the Large Scale
+Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times
+larger than existing ones. Pre-training on our LaSCo, shows a noteworthy
+improvement in performance, even in zero-shot. Furthermore, we propose a new
+approach for analyzing CoIR datasets and methods, which detects modality
+redundancy or necessity, in queries. We also introduce a new CoIR baseline, the
+Cross-Attention driven Shift Encoder (CASE). This baseline allows for early
+fusion of modalities using a cross-attention module and employs an additional
+auxiliary task during training. Our experiments demonstrate that this new
+baseline outperforms the current state-of-the-art methods on established
+benchmarks like FashionIQ and CIRR.
+
+
+
+ comment: Camera Ready version for AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Hybrid Sample Synthesis-based Debiasing of Classifier in Limited Data
+ Setting WACV 2024
+
+
+ Deep learning models are known to suffer from the problem of bias, and
+researchers have been exploring methods to address this issue. However, most of
+these methods require prior knowledge of the bias and are not always practical.
+In this paper, we focus on a more practical setting with no prior information
+about the bias. Generally, in this setting, there are a large number of
+bias-aligned samples that cause the model to produce biased predictions and a
+few bias-conflicting samples that do not conform to the bias. If the training
+data is limited, the influence of the bias-aligned samples may become even
+stronger on the model predictions, and we experimentally demonstrate that
+existing debiasing techniques suffer severely in such cases. In this paper, we
+examine the effects of unknown bias in small dataset regimes and present a
+novel approach to mitigate this issue. The proposed approach directly addresses
+the issue of the extremely low occurrence of bias-conflicting samples in
+limited data settings through the synthesis of hybrid samples that can be used
+to reduce the effect of bias. We perform extensive experiments on several
+benchmark datasets and experimentally demonstrate the effectiveness of our
+proposed approach in addressing any unknown bias in the presence of limited
+data. Specifically, our approach outperforms the vanilla, LfF, LDD, and DebiAN
+debiasing methods by absolute margins of 10.39%, 9.08%, 8.07%, and 9.67% when
+only 10% of the Corrupted CIFAR-10 Type 1 dataset is available with a
+bias-conflicting sample ratio of 0.05.
+
+
+ In rapidly-evolving domains such as autonomous driving, the use of multiple
+sensors with different modalities is crucial to ensure high operational
+precision and stability. To correctly exploit the provided information by each
+sensor in a single common frame, it is essential for these sensors to be
+accurately calibrated. In this paper, we leverage the ability of Neural
+Radiance Fields (NeRF) to represent different sensors modalities in a common
+volumetric representation to achieve robust and accurate spatio-temporal sensor
+calibration. By designing a partitioning approach based on the visible part of
+the scene for each sensor, we formulate the calibration problem using only the
+overlapping areas. This strategy results in a more robust and accurate
+calibration that is less prone to failure. We demonstrate that our approach
+works on outdoor urban scenes by validating it on multiple established driving
+datasets. Results show that our method is able to get better accuracy and
+robustness compared to existing methods.
+
+
+
+ comment: Paper + Supplementary, under review. Project page:
+ https://qherau.github.io/SOAC/
+
+ Real-world image de-weathering aims at removing various undesirable
+weather-related artifacts. Owing to the impossibility of capturing image pairs
+concurrently, existing real-world de-weathering datasets often exhibit
+inconsistent illumination, position, and textures between the ground-truth
+images and the input degraded images, resulting in imperfect supervision. Such
+non-ideal supervision negatively affects the training process of learning-based
+de-weathering methods. In this work, we attempt to address the problem with a
+unified solution for various inconsistencies. Specifically, inspired by
+information bottleneck theory, we first develop a Consistent Label Constructor
+(CLC) to generate a pseudo-label as consistent as possible with the input
+degraded image while removing most weather-related degradations. In particular,
+multiple adjacent frames of the current input are also fed into CLC to enhance
+the pseudo-label. Then we combine the original imperfect labels and
+pseudo-labels to jointly supervise the de-weathering model by the proposed
+Information Allocation Strategy (IAS). During testing, only the de-weathering
+model is used for inference. Experiments on two real-world de-weathering
+datasets show that our method helps existing de-weathering models achieve
+better performance. Codes are available at
+https://github.com/1180300419/imperfect-deweathering.
+
+
+ Reconstructing 3D objects from extremely sparse views is a long-standing and
+challenging problem. While recent techniques employ image diffusion models for
+generating plausible images at novel viewpoints or for distilling pre-trained
+diffusion priors into 3D representations using score distillation sampling
+(SDS), these methods often struggle to simultaneously achieve high-quality,
+consistent, and detailed results for both novel-view synthesis (NVS) and
+geometry. In this work, we present Sparse3D, a novel 3D reconstruction method
+tailored for sparse view inputs. Our approach distills robust priors from a
+multiview-consistent diffusion model to refine a neural radiance field.
+Specifically, we employ a controller that harnesses epipolar features from
+input views, guiding a pre-trained diffusion model, such as Stable Diffusion,
+to produce novel-view images that maintain 3D consistency with the input. By
+tapping into 2D priors from powerful image diffusion models, our integrated
+model consistently delivers high-quality results, even when faced with
+open-world objects. To address the blurriness introduced by conventional SDS,
+we introduce the category-score distillation sampling (C-SDS) to enhance
+detail. We conduct experiments on CO3DV2 which is a multi-view dataset of
+real-world objects. Both quantitative and qualitative evaluations demonstrate
+that our approach outperforms previous state-of-the-art works on the metrics
+regarding NVS and geometry reconstruction.
+
+
+
+
+
+
+
+ ♻ ☆ Rich Action-semantic Consistent Knowledge for Early Action Prediction
+
+
+
+
+
+
+
+
+ Xiaoli Liu, Jianqin Yin, Di Guo, Huaping Liu
+
+
+ Early action prediction (EAP) aims to recognize human actions from a part of
+action execution in ongoing videos, which is an important task for many
+practical applications. Most prior works treat partial or full videos as a
+whole, ignoring rich action knowledge hidden in videos, i.e., semantic
+consistencies among different partial videos. In contrast, we partition
+original partial or full videos to form a new series of partial videos and mine
+the Action-Semantic Consistent Knowledge (ASCK) among these new partial videos
+evolving in arbitrary progress levels. Moreover, a novel Rich Action-semantic
+Consistent Knowledge network (RACK) under the teacher-student framework is
+proposed for EAP. Firstly, we use a two-stream pre-trained model to extract
+features of videos. Secondly, we treat the RGB or flow features of the partial
+videos as nodes and their action semantic consistencies as edges. Next, we
+build a bi-directional semantic graph for the teacher network and a
+single-directional semantic graph for the student network to model rich ASCK
+among partial videos. The MSE and MMD losses are incorporated as our
+distillation loss to enrich the ASCK of partial videos from the teacher to the
+student network. Finally, we obtain the final prediction by summering the
+logits of different subnetworks and applying a softmax layer. Extensive
+experiments and ablative studies have been conducted, demonstrating the
+effectiveness of modeling rich ASCK for EAP. With the proposed RACK, we have
+achieved state-of-the-art performance on three benchmarks. The code is
+available at https://github.com/lily2lab/RACK.git.
+
+
+
+ comment: Accepted by IEEE TIP,15pages
+
+
+
+
+
+
+ ♻ ☆ CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation
+
+
+ Current text-to-image editing models often encounter challenges with smoothly
+manipulating multiple attributes using a single instruction. Taking inspiration
+from the Chain-of-Thought prompting technique utilized in language models, we
+present an innovative concept known as Chain-of-Instruct Editing (CoIE), which
+enhances the capabilities of these models through step-by-step editing using a
+series of instructions. In particular, in the context of face manipulation, we
+leverage the contextual learning abilities of a pretrained Large Language Model
+(LLM), such as GPT-4, to generate a sequence of instructions from the original
+input, utilizing a purpose-designed 1-shot template. To further improve the
+precision of each editing step, we conduct fine-tuning on the editing models
+using our self-constructed instruction-guided face editing dataset,
+Instruct-CelebA. And additionally, we incorporate a super-resolution module to
+mitigate the adverse effects of editability and quality degradation.
+Experimental results across various challenging cases confirm the significant
+boost in multi-attribute facial image manipulation using chain-of-instruct
+editing. This is evident in enhanced editing success rates, measured by CLIPSim
+and Coverage metrics, improved by 17.86% and 85.45% respectively, and
+heightened controllability indicated by Preserve L1 and Quality metrics,
+improved by 11.58% and 4.93% respectively.
+
+
+ This study introduces an efficient and effective method, MeDM, that utilizes
+pre-trained image Diffusion Models for video-to-video translation with
+consistent temporal flow. The proposed framework can render videos from scene
+position information, such as a normal G-buffer, or perform text-guided editing
+on videos captured in real-world scenarios. We employ explicit optical flows to
+construct a practical coding that enforces physical constraints on generated
+frames and mediates independent frame-wise scores. By leveraging this coding,
+maintaining temporal consistency in the generated videos can be framed as an
+optimization problem with a closed-form solution. To ensure compatibility with
+Stable Diffusion, we also suggest a workaround for modifying observation-space
+scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning
+or test-time optimization of the Diffusion Models. Through extensive
+qualitative, quantitative, and subjective experiments on various benchmarks,
+the study demonstrates the effectiveness and superiority of the proposed
+approach. Our project page can be found at https://medm2023.github.io
+
+
+
+ comment: Accepted as a conference paper in AAAI 2024. Project page:
+ https://medm2023.github.io
+
+
+
+
+
+
+ ♻ ☆ Scalable Geometric Fracture Assembly via Co-creation Space among
+ Assemblers AAAI2024
+
+
+ Geometric fracture assembly presents a challenging practical task in
+archaeology and 3D computer vision. Previous methods have focused solely on
+assembling fragments based on semantic information, which has limited the
+quantity of objects that can be effectively assembled. Therefore, there is a
+need to develop a scalable framework for geometric fracture assembly without
+relying on semantic information. To improve the effectiveness of assembling
+geometric fractures without semantic information, we propose a co-creation
+space comprising several assemblers capable of gradually and unambiguously
+assembling fractures. Additionally, we introduce a novel loss function, i.e.,
+the geometric-based collision loss, to address collision issues during the
+fracture assembly process and enhance the results. Our framework exhibits
+better performance on both PartNet and Breaking Bad datasets compared to
+existing state-of-the-art frameworks. Extensive experiments and quantitative
+comparisons demonstrate the effectiveness of our proposed framework, which
+features linear computational complexity, enhanced abstraction, and improved
+generalization. Our code is publicly available at
+https://github.com/Ruiyuan-Zhang/CCS.
+
+
+ Image captioning models are known to perpetuate and amplify harmful societal
+bias in the training set. In this work, we aim to mitigate such gender bias in
+image captioning models. While prior work has addressed this problem by forcing
+models to focus on people to reduce gender misclassification, it conversely
+generates gender-stereotypical words at the expense of predicting the correct
+gender. From this observation, we hypothesize that there are two types of
+gender bias affecting image captioning models: 1) bias that exploits context to
+predict gender, and 2) bias in the probability of generating certain (often
+stereotypical) words because of gender. To mitigate both types of gender
+biases, we propose a framework, called LIBRA, that learns from synthetically
+biased samples to decrease both types of biases, correcting gender
+misclassification and changing gender-stereotypical words to more neutral ones.
+Code is available at https://github.com/rebnej/LIBRA.
+
+
+
+ comment: CVPR 2023
+
+
+
+
+
+
+ ♻ ☆ RED-PSM: Regularization by Denoising of Partially Separable Models for
+ Dynamic Imaging
+
+
+
+
+
+
+
+
+ Berk Iskender, Marc L. Klasky, Yoram Bresler
+
+
+ Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at
+each time instant using its undersampled measurements. In particular, in the
+case of dynamic tomography, only a single projection at a single view angle may
+be available at a time, making the problem severely ill-posed. In this work, we
+propose an approach, RED-PSM, which combines for the first time two powerful
+techniques to address this challenging imaging problem. The first, are
+partially separable models, which have been used to efficiently introduce a
+low-rank prior for the spatio-temporal object. The second is the recent
+\textit{Regularization by Denoising (RED)}, which provides a flexible framework
+to exploit the impressive performance of state-of-the-art image denoising
+algorithms, for various inverse problems. We propose a partially separable
+objective with RED and a computationally efficient and scalable optimization
+scheme with variable splitting and ADMM. Theoretical analysis proves the
+convergence of our objective to a value corresponding to a stationary point
+satisfying the first-order optimality conditions. Convergence is accelerated by
+a particular projection-domain-based initialization. We demonstrate the
+performance and computational improvements of our proposed RED-PSM with a
+learned image denoiser by comparing it to a recent deep-prior-based method
+known as TD-DIP. Although the main focus is on dynamic tomography, we also show
+performance advantages of RED-PSM in a cardiac dynamic MRI setting.
+
+
+
+
+
+
+
+ ♻ ☆ AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and
+ Reconstruction with Canonical Score Distillation
+
+
+
+
+
+
+
+
+ Xinzhou Wang, Yikai Wang, Junliang Ye, Zhengyi Wang, Fuchun Sun, Pengkun Liu, Ling Wang, Kai Sun, Xintong Wang, Bin He
+
+
+ Text-to-3D model adaptations have advanced static 3D model quality, but
+sequential 3D model generation, particularly for animatable objects with large
+motions, is still scarce. Our work proposes AnimatableDreamer, a text-to-4D
+generation framework capable of generating diverse categories of non-rigid
+objects while adhering to the object motions extracted from a monocular video.
+At its core, AnimatableDreamer is equipped with our novel optimization design
+dubbed Canonical Score Distillation (CSD), which simplifies the generation
+dimension from 4D to 3D by denoising over different frames in the time-varying
+camera spaces while conducting the distillation process in a unique canonical
+space shared per video. Concretely, CSD ensures that score gradients
+back-propagate to the canonical space through differentiable warping, hence
+guaranteeing the time-consistent generation and maintaining morphological
+plausibility across different poses. By lifting the 3D generator to 4D with
+warping functions, AnimatableDreamer offers a novel perspective on non-rigid 3D
+model generation and reconstruction. Besides, with inductive knowledge from a
+multi-view consistent diffusion model, CSD regularizes reconstruction from
+novel views, thus cyclically enhancing the generation process. Extensive
+experiments demonstrate the capability of our method in generating
+high-flexibility text-guided 3D models from the monocular video, while also
+showing improved reconstruction performance over typical non-rigid
+reconstruction methods. Project page https://AnimatableDreamer.github.io.
+
+
+
+
+
+
+
+
+ Amira Guesmi, Muhammad Abdullah Hanif, Bassem Ouni, Muhammad Shafique
+
+
+ In this paper, we investigate the vulnerability of MDE to adversarial
+patches. We propose a novel \underline{S}tealthy \underline{A}dversarial
+\underline{A}ttacks on \underline{M}DE (SAAM) that compromises MDE by either
+corrupting the estimated distance or causing an object to seamlessly blend into
+its surroundings. Our experiments, demonstrate that the designed stealthy patch
+successfully causes a DNN-based MDE to misestimate the depth of objects. In
+fact, our proposed adversarial patch achieves a significant 60\% depth error
+with 99\% ratio of the affected region. Importantly, despite its adversarial
+nature, the patch maintains a naturalistic appearance, making it inconspicuous
+to human observers. We believe that this work sheds light on the threat of
+adversarial attacks in the context of MDE on edge devices. We hope it raises
+awareness within the community about the potential real-life harm of such
+attacks and encourages further research into developing more robust and
+adaptive defense mechanisms.
+
+
+
+
+
+
+
+ ♻ ☆ Rethinking the Up-Sampling Operations in CNN-based Generative Network
+ for Generalizable Deepfake Detection
+
+
+ Recently, the proliferation of highly realistic synthetic images, facilitated
+through a variety of GANs and Diffusions, has significantly heightened the
+susceptibility to misuse. While the primary focus of deepfake detection has
+traditionally centered on the design of detection algorithms, an investigative
+inquiry into the generator architectures has remained conspicuously absent in
+recent years. This paper contributes to this lacuna by rethinking the
+architectures of CNN-based generators, thereby establishing a generalized
+representation of synthetic artifacts. Our findings illuminate that the
+up-sampling operator can, beyond frequency-based artifacts, produce generalized
+forgery artifacts. In particular, the local interdependence among image pixels
+caused by upsampling operators is significantly demonstrated in synthetic
+images generated by GAN or diffusion. Building upon this observation, we
+introduce the concept of Neighboring Pixel Relationships(NPR) as a means to
+capture and characterize the generalized structural artifacts stemming from
+up-sampling operations. A comprehensive analysis is conducted on an open-world
+dataset, comprising samples generated by \tft{28 distinct generative models}.
+This analysis culminates in the establishment of a novel state-of-the-art
+performance, showcasing a remarkable \tft{11.6\%} improvement over existing
+methods. The code is available at
+https://github.com/chuangchuangtan/NPR-DeepfakeDetection.
+
+
+
+ comment: 10 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ A Survey of Reasoning with Foundation Models: Concepts, Methodologies,
+ and Outlook
+
+
+ Reasoning, a crucial ability for complex problem-solving, plays a pivotal
+role in various real-world settings such as negotiation, medical diagnosis, and
+criminal investigation. It serves as a fundamental methodology in the field of
+Artificial General Intelligence (AGI). With the ongoing development of
+foundation models, there is a growing interest in exploring their abilities in
+reasoning tasks. In this paper, we introduce seminal foundation models proposed
+or adaptable for reasoning, highlighting the latest advancements in various
+reasoning tasks, methods, and benchmarks. We then delve into the potential
+future directions behind the emergence of reasoning abilities within foundation
+models. We also discuss the relevance of multimodal learning, autonomous
+agents, and super alignment in the context of reasoning. By discussing these
+future research directions, we hope to inspire researchers in their exploration
+of this field, stimulate further advancements in reasoning with foundation
+models, and contribute to the development of AGI.
+
+
+
+
+
+
+
+ ♻ ☆ 3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion
+
+
+
+
+
+
+
+
+ Yu-Jhe Li, Tao Xu, Ji Hou, Bichen Wu, Xiaoliang Dai, Albert Pumarola, Peizhao Zhang, Peter Vajda, Kris Kitani
+
+
+ We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs
+(NeRFs that generate 3D objects given input latent code). Recent works such as
+DreamFusion and Magic3D have shown great success in generating 3D content using
+NeRFs and text prompts, but the current approach of optimizing a NeRF for every
+text prompt is 1) extremely time-consuming and 2) often leads to low-resolution
+outputs. To address these challenges, we propose a novel method named
+3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs
+fast 3D content creation in less than a minute. In particular, we introduce a
+latent diffusion prior network for learning the w latent from the input CLIP
+text/image embeddings. This pipeline allows us to produce the w latent without
+further optimization during inference and the pre-trained NeRF is able to
+perform multi-view high-resolution 3D synthesis based on the latent. We note
+that the novelty of our model lies in that we introduce contrastive learning
+during training the diffusion prior which enables the generation of the valid
+view-invariant latent code. We demonstrate through experiments the
+effectiveness of our proposed view-invariant diffusion process for fast
+text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our
+model is able to serve as the role of a plug-and-play tool for text-to-3D with
+pre-trained NeRFs.
+
+
+
+ comment: 15 pages
+
+
+
+
+
+
+ ♻ ☆ Masked and Permuted Implicit Context Learning for Scene Text Recognition
+
+
+ Scene Text Recognition (STR) is difficult because of the variations in text
+styles, shapes, and backgrounds. Though the integration of linguistic
+information enhances models' performance, existing methods based on either
+permuted language modeling (PLM) or masked language modeling (MLM) have their
+pitfalls. PLM's autoregressive decoding lacks foresight into subsequent
+characters, while MLM overlooks inter-character dependencies. Addressing these
+problems, we propose a masked and permuted implicit context learning network
+for STR, which unifies PLM and MLM within a single decoder, inheriting the
+advantages of both approaches. We utilize the training procedure of PLM, and to
+integrate MLM, we incorporate word length information into the decoding process
+and replace the undetermined characters with mask tokens. Besides, perturbation
+training is employed to train a more robust model against potential length
+prediction errors. Our empirical evaluations demonstrate the performance of our
+model. It not only achieves superior performance on the common benchmarks but
+also achieves a substantial improvement of $9.1\%$ on the more challenging
+Union14M-Benchmark.
+
+
+
+
+
+
+
+
+ Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-yi Lee, Jyh-Shing Roger Jang
+
+
+ Audio-visual synchronization aims to determine whether the mouth movements
+and speech in the video are synchronized. VocaLiST reaches state-of-the-art
+performance by incorporating multimodal Transformers to model audio-visual
+interact information. However, it requires high computing resources, making it
+impractical for real-world applications. This paper proposed an MTDVocaLiST
+model, which is trained by our proposed multimodal Transformer distillation
+(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the
+cross-attention distribution and value-relation in the Transformer of VocaLiST.
+Additionally, we harness uncertainty weighting to fully exploit the interaction
+information across all layers. Our proposed method is effective in two aspects:
+From the distillation method perspective, MTD loss outperforms other strong
+distillation baselines. From the distilled model's performance perspective: 1)
+MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match
+models by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST
+by 83.52%, yet still maintaining similar performance.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ Devignet: High-Resolution Vignetting Removal via a Dual Aggregated
+ Fusion Transformer With Adaptive Channel Expansion AAAI
+
+
+ Vignetting commonly occurs as a degradation in images resulting from factors
+such as lens design, improper lens hood usage, and limitations in camera
+sensors. This degradation affects image details, color accuracy, and presents
+challenges in computational photography. Existing vignetting removal algorithms
+predominantly rely on ideal physics assumptions and hand-crafted parameters,
+resulting in the ineffective removal of irregular vignetting and suboptimal
+results. Moreover, the substantial lack of real-world vignetting datasets
+hinders the objective and comprehensive evaluation of vignetting removal. To
+address these challenges, we present Vigset, a pioneering dataset for
+vignetting removal. Vigset includes 983 pairs of both vignetting and
+vignetting-free high-resolution ($5340\times3697$) real-world images under
+various conditions. In addition, We introduce DeVigNet, a novel frequency-aware
+Transformer architecture designed for vignetting removal. Through the Laplacian
+Pyramid decomposition, we propose the Dual Aggregated Fusion Transformer to
+handle global features and remove vignetting in the low-frequency domain.
+Additionally, we propose the Adaptive Channel Expansion Module to enhance
+details in the high-frequency domain. The experiments demonstrate that the
+proposed model outperforms existing state-of-the-art methods. The code, models,
+and dataset are available at \url{https://github.com/CXH-Research/DeVigNet}.
+
+
+
+ comment: Accepted by AAAI Conference on Artificial Intelligence 2024 (AAAI
+ 2024)
+
+
+
+
+
+
+ ♻ ☆ Personalization as a Shortcut for Few-Shot Backdoor Attack against
+ Text-to-Image Diffusion Models AAAI 2024
+
+
+
+
+
+
+
+
+ Yihao Huang, Felix Juefei-Xu, Qing Guo, Jie Zhang, Yutong Wu, Ming Hu, Tianlin Li, Geguang Pu, Yang Liu
+
+
+ Although recent personalization methods have democratized high-resolution
+image synthesis by enabling swift concept acquisition with minimal examples and
+lightweight computation, they also present an exploitable avenue for high
+accessible backdoor attacks. This paper investigates a critical and unexplored
+aspect of text-to-image (T2I) diffusion models - their potential vulnerability
+to backdoor attacks via personalization. Our study focuses on a zero-day
+backdoor vulnerability prevalent in two families of personalization methods,
+epitomized by Textual Inversion and DreamBooth.Compared to traditional backdoor
+attacks, our proposed method can facilitate more precise, efficient, and easily
+accessible attacks with a lower barrier to entry. We provide a comprehensive
+review of personalization in T2I diffusion models, highlighting the operation
+and exploitation potential of this backdoor vulnerability. To be specific, by
+studying the prompt processing of Textual Inversion and DreamBooth, we have
+devised dedicated backdoor attacks according to the different ways of dealing
+with unseen tokens and analyzed the influence of triggers and concept images on
+the attack effect. Through comprehensive empirical study, we endorse the
+utilization of the nouveau-token backdoor attack due to its impressive
+effectiveness, stealthiness, and integrity, markedly outperforming the
+legacy-token backdoor attack.
+
+
+ Reconstructing a dynamic human with loose clothing is an important but
+difficult task. To address this challenge, we propose a method named DLCA-Recon
+to create human avatars from monocular videos. The distance from loose clothing
+to the underlying body rapidly changes in every frame when the human freely
+moves and acts. Previous methods lack effective geometric initialization and
+constraints for guiding the optimization of deformation to explain this
+dramatic change, resulting in the discontinuous and incomplete reconstruction
+surface. To model the deformation more accurately, we propose to initialize an
+estimated 3D clothed human in the canonical space, as it is easier for
+deformation fields to learn from the clothed human than from SMPL. With both
+representations of explicit mesh and implicit SDF, we utilize the physical
+connection information between consecutive frames and propose a dynamic
+deformation field (DDF) to optimize deformation fields. DDF accounts for
+contributive forces on loose clothing to enhance the interpretability of
+deformations and effectively capture the free movement of loose clothing.
+Moreover, we propagate SMPL skinning weights to each individual and refine pose
+and skinning weights during the optimization to improve skinning
+transformation. Based on more reasonable initialization and DDF, we can
+simulate real-world physics more accurately. Extensive experiments on public
+and our own datasets validate that our method can produce superior results for
+humans with loose clothing compared to the SOTA methods.
+
+
+
+
+
+
+
+ ♻ ☆ BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
+
+
+ The recently emerging text-to-motion advances have spired numerous attempts
+for convenient and interactive human motion generation. Yet, existing methods
+are largely limited to generating body motions only without considering the
+rich two-hand motions, let alone handling various conditions like body dynamics
+or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal
+dataset for two-hand motion generation. Our dataset includes accurate motion
+tracking for the human body and hands and provides pair-wised finger-level hand
+annotations and body descriptions. We further provide a strong baseline method,
+BOTH2Hands, for the novel task: generating vivid two-hand motions from both
+implicit body dynamics and explicit text prompts. We first warm up two parallel
+body-to-hand and text-to-hand diffusion models and then utilize the
+cross-attention transformer for motion blending. Extensive experiments and
+cross-validations demonstrate the effectiveness of our approach and dataset for
+generating convincing two-hand motions from the hybrid body-and-textual
+conditions. Our dataset and code will be disseminated to the community for
+future research.
+
+
+
+
+
+
+
+ ♻ ☆ MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention
+
+
+ Efficiently capturing multi-scale information and building long-range
+dependencies among pixels are essential for medical image segmentation because
+of the various sizes and shapes of the lesion regions or organs. In this paper,
+we present Multi-scale Cross-axis Attention (MCA) to solve the above
+challenging issues based on the efficient axial attention. Instead of simply
+connecting axial attention along the horizontal and vertical directions
+sequentially, we propose to calculate dual cross attentions between two
+parallel axial attentions to capture global information better. To process the
+significant variations of lesion regions or organs in individual sizes and
+shapes, we also use multiple convolutions of strip-shape kernels with different
+kernel sizes in each axial attention path to improve the efficiency of the
+proposed MCA in encoding spatial information. We build the proposed MCA upon
+the MSCAN backbone, yielding our network, termed MCANet. Our MCANet with only
+4M+ parameters performs even better than most previous works with heavy
+backbones (e.g., Swin Transformer) on four challenging tasks, including skin
+lesion segmentation, nuclei segmentation, abdominal multi-organ segmentation,
+and polyp segmentation. Code is available at
+https://github.com/haoshao-nku/medical_seg.
+
+
+
+
+
+
+
+ ♻ ☆ Temporal Conditioning Spiking Latent Variable Models of the Neural
+ Response to Natural Visual Scenes NeurIPS 2023
+
+
+ Developing computational models of neural response is crucial for
+understanding sensory processing and neural computations. Current
+state-of-the-art neural network methods use temporal filters to handle temporal
+dependencies, resulting in an unrealistic and inflexible processing paradigm.
+Meanwhile, these methods target trial-averaged firing rates and fail to capture
+important features in spike trains. This work presents the temporal
+conditioning spiking latent variable models (TeCoS-LVM) to simulate the neural
+response to natural visual stimuli. We use spiking neurons to produce spike
+outputs that directly match the recorded trains. This approach helps to avoid
+losing information embedded in the original spike trains. We exclude the
+temporal dimension from the model parameter space and introduce a temporal
+conditioning operation to allow the model to adaptively explore and exploit
+temporal dependencies in stimuli sequences in a {\it natural paradigm}. We show
+that TeCoS-LVM models can produce more realistic spike activities and
+accurately fit spike statistics than powerful alternatives. Additionally,
+learned TeCoS-LVM models can generalize well to longer time scales. Overall,
+while remaining computationally tractable, our model effectively captures key
+features of neural coding systems. It thus provides a useful tool for building
+accurate predictive computational accounts for various sensory perception
+circuits.
+
+
+ Federated learning with noisy labels (F-LNL) aims at seeking an optimal
+server model via collaborative distributed learning by aggregating multiple
+client models trained with local noisy or clean samples. On the basis of a
+federated learning framework, recent advances primarily adopt label noise
+filtering to separate clean samples from noisy ones on each client, thereby
+mitigating the negative impact of label noise. However, these prior methods do
+not learn noise filters by exploiting knowledge across all clients, leading to
+sub-optimal and inferior noise filtering performance and thus damaging training
+stability. In this paper, we present FedDiv to tackle the challenges of F-LNL.
+Specifically, we propose a global noise filter called Federated Noise Filter
+for effectively identifying samples with noisy labels on every client, thereby
+raising stability during local training sessions. Without sacrificing data
+privacy, this is achieved by modeling the global distribution of label noise
+across all clients. Then, in an effort to make the global model achieve higher
+performance, we introduce a Predictive Consistency based Sampler to identify
+more credible local data for local model training, thus preventing noise
+memorization and further boosting the training stability. Extensive experiments
+on CIFAR-10, CIFAR-100, and Clothing1M demonstrate that \texttt{FedDiv}
+achieves superior performance over state-of-the-art F-LNL methods under
+different label noise settings for both IID and non-IID data partitions. Source
+code is publicly available at https://github.com/lijichang/FLNL-FedDiv.
+
+
+
+ comment: To appear in AAAI-2024; correct minor typos
+
+ In recent years, the task of learned point cloud compression has gained
+prominence. An important type of point cloud, the spinning LiDAR point cloud,
+is generated by spinning LiDAR on vehicles. This process results in numerous
+circular shapes and azimuthal angle invariance features within the point
+clouds. However, these two features have been largely overlooked by previous
+methodologies. In this paper, we introduce a model-agnostic method called
+Spherical-Coordinate-based learned Point cloud compression (SCP), designed to
+leverage the aforementioned features fully. Additionally, we propose a
+multi-level Octree for SCP to mitigate the reconstruction error for distant
+areas within the Spherical-coordinate-based Octree. SCP exhibits excellent
+universality, making it applicable to various learned point cloud compression
+techniques. Experimental results demonstrate that SCP surpasses previous
+state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.
+
+
+ Events describe happenings in our world that are of importance. Naturally,
+understanding events mentioned in multimedia content and how they are related
+forms an important way of comprehending our world. Existing literature can
+infer if events across textual and visual (video) domains are identical (via
+grounding) and thus, on the same semantic level. However, grounding fails to
+capture the intricate cross-event relations that exist due to the same events
+being referred to on many semantic levels. For example, in Figure 1, the
+abstract event of "war" manifests at a lower semantic level through subevents
+"tanks firing" (in video) and airplane "shot" (in text), leading to a
+hierarchical, multimodal relationship between the events.
+ In this paper, we propose the task of extracting event hierarchies from
+multimodal (video and text) data to capture how the same event manifests itself
+in different modalities at different semantic levels. This reveals the
+structure of events and is critical to understanding them. To support research
+on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve)
+dataset. Unlike prior video-language datasets, MultiHiEve is composed of news
+video-article pairs, which makes it rich in event hierarchies. We densely
+annotate a part of the dataset to construct the test benchmark. We show the
+limitations of state-of-the-art unimodal and multimodal baselines on this task.
+Further, we address these limitations via a new weakly supervised model,
+leveraging only unannotated video-article pairs from MultiHiEve. We perform a
+thorough evaluation of our proposed method which demonstrates improved
+performance on this task and highlight opportunities for future research.
+
+
+
+
+
+
+
+
+ Chaojian Li, Bichen Wu, Peter Vajda, Yingyan, Lin
+
+
+ Neural Radiance Field (NeRF) has emerged as a leading technique for novel
+view synthesis, owing to its impressive photorealistic reconstruction and
+rendering capability. Nevertheless, achieving real-time NeRF rendering in
+large-scale scenes has presented challenges, often leading to the adoption of
+either intricate baked mesh representations with a substantial number of
+triangles or resource-intensive ray marching in baked representations. We
+challenge these conventions, observing that high-quality geometry, represented
+by meshes with substantial triangles, is not necessary for achieving
+photorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF
+representation that includes a low-quality mesh, a view-dependent displacement
+map, and a compressed NeRF model. This design effectively harnesses the
+capabilities of existing graphics hardware, thus enabling real-time NeRF
+rendering on edge devices. Leveraging a highly-optimized WebGL-based rendering
+framework, our proposed MixRT attains real-time rendering speeds on edge
+devices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop),
+better rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360
+datasets), and a smaller storage size (less than 80% compared to
+state-of-the-art methods).
+
+
+
+ comment: Accepted by 3DV'24. Project Page: https://licj15.github.io/MixRT/
+
+
+
+
+
+
+ ♻ ☆ ReShader: View-Dependent Highlights for Single Image View-Synthesis SIGGRAPH
+
+
+ In recent years, novel view synthesis from a single image has seen
+significant progress thanks to the rapid advancements in 3D scene
+representation and image inpainting techniques. While the current approaches
+are able to synthesize geometrically consistent novel views, they often do not
+handle the view-dependent effects properly. Specifically, the highlights in
+their synthesized images usually appear to be glued to the surfaces, making the
+novel views unrealistic. To address this major problem, we make a key
+observation that the process of synthesizing novel views requires changing the
+shading of the pixels based on the novel camera, and moving them to appropriate
+locations. Therefore, we propose to split the view synthesis process into two
+independent tasks of pixel reshading and relocation. During the reshading
+process, we take the single image as the input and adjust its shading based on
+the novel camera. This reshaded image is then used as the input to an existing
+view synthesis method to relocate the pixels and produce the final novel view
+image. We propose to use a neural network to perform reshading and generate a
+large set of synthetic input-reshaded pairs to train our network. We
+demonstrate that our approach produces plausible novel view images with
+realistic moving highlights on a variety of real world scenes.
+
+
+
+ comment: SIGGRAPH Asia 2023. Project page at
+ https://people.engr.tamu.edu/nimak/Papers/SIGAsia2023_Reshader/index.html and
+ video at https://www.youtube.com/watch?v=XW-tl48D3Ok
+
+
+
+
+
+
+ ♻ ☆ CiT-Net: Convolutional Neural Networks Hand in Hand with Vision
+ Transformers for Medical Image Segmentation
+
+
+
+
+
+
+
+
+ Tao Lei, Rui Sun, Xuan Wang, Yingbo Wang, Xi He, Asoke Nandi
+
+
+ The hybrid architecture of convolutional neural networks (CNNs) and
+Transformer are very popular for medical image segmentation. However, it
+suffers from two challenges. First, although a CNNs branch can capture the
+local image features using vanilla convolution, it cannot achieve adaptive
+feature learning. Second, although a Transformer branch can capture the global
+features, it ignores the channel and cross-dimensional self-attention,
+resulting in a low segmentation accuracy on complex-content images. To address
+these challenges, we propose a novel hybrid architecture of convolutional
+neural networks hand in hand with vision Transformers (CiT-Net) for medical
+image segmentation. Our network has two advantages. First, we design a dynamic
+deformable convolution and apply it to the CNNs branch, which overcomes the
+weak feature extraction ability due to fixed-size convolution kernels and the
+stiff design of sharing kernel parameters among different inputs. Second, we
+design a shifted-window adaptive complementary attention module and a compact
+convolutional projection. We apply them to the Transformer branch to learn the
+cross-dimensional long-term dependency for medical images. Experimental results
+show that our CiT-Net provides better medical image segmentation results than
+popular SOTA methods. Besides, our CiT-Net requires lower parameters and less
+computational costs and does not rely on pre-training. The code is publicly
+available at https://github.com/SR0920/CiT-Net.
+
+
+ The hybrid architecture of convolution neural networks (CNN) and Transformer
+has been the most popular method for medical image segmentation. However, the
+existing networks based on the hybrid architecture suffer from two problems.
+First, although the CNN branch can capture image local features by using
+convolution operation, the vanilla convolution is unable to achieve adaptive
+extraction of image features. Second, although the Transformer branch can model
+the global information of images, the conventional self-attention only focuses
+on the spatial self-attention of images and ignores the channel and
+cross-dimensional self-attention leading to low segmentation accuracy for
+medical images with complex backgrounds. To solve these problems, we propose
+vision Transformer embrace convolutional neural networks for medical image
+segmentation (TEC-Net). Our network has two advantages. First, dynamic
+deformable convolution (DDConv) is designed in the CNN branch, which not only
+overcomes the difficulty of adaptive feature extraction using fixed-size
+convolution kernels, but also solves the defect that different inputs share the
+same convolution kernel parameters, effectively improving the feature
+expression ability of CNN branch. Second, in the Transformer branch, a
+(shifted)-window adaptive complementary attention module ((S)W-ACAM) and
+compact convolutional projection are designed to enable the network to fully
+learn the cross-dimensional long-range dependency of medical images with few
+parameters and calculations. Experimental results show that the proposed
+TEC-Net provides better medical image segmentation results than SOTA methods
+including CNN and Transformer networks. In addition, our TEC-Net requires fewer
+parameters and computational costs and does not rely on pre-training. The code
+is publicly available at https://github.com/SR0920/TEC-Net.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2306.03373
+
+
+
+
+
+
+ ♻ ☆ Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse
+ Problems
+
+
+ Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial
+technologies in the field of medical imaging. Score-based models have proven to
+be effective in addressing different inverse problems encountered in CT and
+MRI, such as sparse-view CT and fast MRI reconstruction. However, these models
+face challenges in achieving accurate three dimensional (3D) volumetric
+reconstruction. The existing score-based models primarily focus on
+reconstructing two dimensional (2D) data distribution, leading to
+inconsistencies between adjacent slices in the reconstructed 3D volumetric
+images. To overcome this limitation, we propose a novel two-and-a-half order
+score-based model (TOSM). During the training phase, our TOSM learns data
+distributions in 2D space, which reduces the complexity of training compared to
+directly working on 3D volumes. However, in the reconstruction phase, the TOSM
+updates the data distribution in 3D space, utilizing complementary scores along
+three directions (sagittal, coronal, and transaxial) to achieve a more precise
+reconstruction. The development of TOSM is built on robust theoretical
+principles, ensuring its reliability and efficacy. Through extensive
+experimentation on large-scale sparse-view CT and fast MRI datasets, our method
+demonstrates remarkable advancements and attains state-of-the-art results in
+solving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively
+addresses the inter-slice inconsistency issue, resulting in high-quality 3D
+volumetric reconstruction.
+
+
+
+ comment: 10 pages, 13 figures
+
+
+
+
+
+
+ ♻ ☆ Label-Efficient Deep Learning in Medical Image Analysis: Challenges and
+ Future Directions
+
+
+ Deep learning has seen rapid growth in recent years and achieved
+state-of-the-art performance in a wide range of applications. However, training
+models typically requires expensive and time-consuming collection of large
+quantities of labeled data. This is particularly true within the scope of
+medical imaging analysis (MIA), where data are limited and labels are expensive
+to be acquired. Thus, label-efficient deep learning methods are developed to
+make comprehensive use of the labeled data as well as the abundance of
+unlabeled and weak-labeled data. In this survey, we extensively investigated
+over 300 recent papers to provide a comprehensive overview of recent progress
+on label-efficient learning strategies in MIA. We first present the background
+of label-efficient learning and categorize the approaches into different
+schemes. Next, we examine the current state-of-the-art methods in detail
+through each scheme. Specifically, we provide an in-depth investigation,
+covering not only canonical semi-supervised, self-supervised, and
+multi-instance learning schemes, but also recently emerged active and
+annotation-efficient learning strategies. Moreover, as a comprehensive
+contribution to the field, this survey not only elucidates the commonalities
+and unique features of the surveyed methods but also presents a detailed
+analysis of the current challenges in the field and suggests potential avenues
+for future research.
+
+
+
+
+
+
+
+
+ Jiachen Zhou, Peizhuo Lv, Yibing Lan, Guozhu Meng, Kai Chen, Hualong Ma
+
+
+ Dataset sanitization is a widely adopted proactive defense against
+poisoning-based backdoor attacks, aimed at filtering out and removing poisoned
+samples from training datasets. However, existing methods have shown limited
+efficacy in countering the ever-evolving trigger functions, and often leading
+to considerable degradation of benign accuracy. In this paper, we propose
+DataElixir, a novel sanitization approach tailored to purify poisoned datasets.
+We leverage diffusion models to eliminate trigger features and restore benign
+features, thereby turning the poisoned samples into benign ones. Specifically,
+with multiple iterations of the forward and reverse process, we extract
+intermediary images and their predicted labels for each sample in the original
+dataset. Then, we identify anomalous samples in terms of the presence of label
+transition of the intermediary images, detect the target label by quantifying
+distribution discrepancy, select their purified images considering pixel and
+feature distance, and determine their ground-truth labels by training a benign
+model. Experiments conducted on 9 popular attacks demonstrates that DataElixir
+effectively mitigates various complex attacks while exerting minimal impact on
+benign accuracy, surpassing the performance of baseline defense methods.
+
+
+
+
+
+
+
+
+ Wenhao Wang, Yifan Sun, Wei Li, Yi Yang
+
+
+ This paper explores a hierarchical prompting mechanism for the hierarchical
+image classification (HIC) task. Different from prior HIC methods, our
+hierarchical prompting is the first to explicitly inject ancestor-class
+information as a tokenized hint that benefits the descendant-class
+discrimination. We think it well imitates human visual recognition, i.e.,
+humans may use the ancestor class as a prompt to draw focus on the subtle
+differences among descendant classes. We model this prompting mechanism into a
+Transformer with Hierarchical Prompting (TransHP). TransHP consists of three
+steps: 1) learning a set of prompt tokens to represent the coarse (ancestor)
+classes, 2) on-the-fly predicting the coarse class of the input image at an
+intermediate block, and 3) injecting the prompt token of the predicted coarse
+class into the intermediate feature. Though the parameters of TransHP maintain
+the same for all input images, the injected coarse-class prompt conditions
+(modifies) the subsequent feature extraction and encourages a dynamic focus on
+relatively subtle differences among the descendant classes. Extensive
+experiments show that TransHP improves image classification on accuracy (e.g.,
+improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data
+efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and
+model explainability. Moreover, TransHP also performs favorably against prior
+HIC methods, showing that TransHP well exploits the hierarchical information.
+The code is available at: https://github.com/WangWenhao0716/TransHP.
+
+
+
+ comment: Accepted to NeurIPS 2023; Released code
+
+
+
+
+
+
+ ♻ ☆ M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios
+
+
+
+
+
+
+
+
+ Ning Liao, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian
+
+
+ In realistic open-set scenarios where labels of a part of testing data are
+totally unknown, when vision-language (VL) prompt learning methods encounter
+inputs related to unknown classes (i.e., not seen during training), they always
+predict them as one of the training classes. The exhibited label bias causes
+difficulty in open set recognition (OSR), in which an image should be correctly
+predicted as one of the known classes or the unknown one. To achieve this goal,
+we propose a vision-language prompt tuning method with mitigated label bias
+(M-Tuning). It introduces open words from the WordNet to extend the range of
+words forming the prompt texts from only closed-set label words to more, and
+thus prompts are tuned in a simulated open-set scenario. Besides, inspired by
+the observation that classifying directly on large datasets causes a much
+higher false positive rate than on small datasets, we propose a Combinatorial
+Tuning and Testing (CTT) strategy for improving performance. CTT decomposes
+M-Tuning on large datasets as multiple independent group-wise tuning on fewer
+classes, then makes accurate and comprehensive predictions by selecting the
+optimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the
+literature, especially for prompt methods, we contribute new baselines for fair
+comparisons. Our method achieves the best performance on datasets with various
+scales, and extensive ablation studies also validate its effectiveness.
+
+
+
+
+
+
+
+ ♻ ☆ Consensus, dissensus and synergy between clinicians and specialist
+ foundation models in radiology report generation
+
+
+
+
+
+
+
+
+ Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Yossi Matias, Joelle Barral, Ali Eslami, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam, Ira Ktena
+
+
+ Radiology reports are an instrumental part of modern medicine, informing key
+clinical decisions such as diagnosis and treatment. The worldwide shortage of
+radiologists, however, restricts access to expert care and imposes heavy
+workloads, contributing to avoidable errors and delays in report delivery.
+While recent progress in automated report generation with vision-language
+models offer clear potential in ameliorating the situation, the path to
+real-world adoption has been stymied by the challenge of evaluating the
+clinical quality of AI-generated reports. In this study, we build a
+state-of-the-art report generation system for chest radiographs,
+$\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation
+model on radiology data. To evaluate the quality of the AI-generated reports, a
+group of 16 certified radiologists provide detailed evaluations of AI-generated
+and human written reports for chest X-rays from an intensive care setting in
+the United States and an inpatient setting in India. At least one radiologist
+(out of two per case) preferred the AI report to the ground truth report in
+over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated
+reports that contain errors, the most frequently cited reasons were related to
+the location and finding, whereas for human written reports, most mistakes were
+related to severity and finding. This disparity suggested potential
+complementarity between our AI system and human experts, prompting us to
+develop an assistive scenario in which Flamingo-CXR generates a first-draft
+report, which is subsequently revised by a clinician. This is the first
+demonstration of clinician-AI collaboration for report writing, and the
+resultant reports are assessed to be equivalent or preferred by at least one
+radiologist to reports written by experts alone in 80$\%$ of in-patient cases
+and 60$\%$ of intensive care cases.
+
+
+
+
+
+
+
+
+ Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
+
+
+ In recent years, the deployment of large-scale pre-trained models in
+audio-visual downstream tasks has yielded remarkable outcomes. However, these
+models, primarily trained on single-modality unconstrained datasets, still
+encounter challenges in feature extraction for multi-modal tasks, leading to
+suboptimal performance. This limitation arises due to the introduction of
+irrelevant modality-specific information during encoding, which adversely
+affects the performance of downstream tasks. To address this challenge, this
+paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention
+mechanism. This mechanism leverages audio and visual modalities as soft prompts
+to dynamically adjust the parameters of pre-trained models based on the current
+multi-modal input features. Specifically, the DG-SCT module incorporates
+trainable cross-modal interaction layers into pre-trained audio-visual
+encoders, allowing adaptive extraction of crucial information from the current
+modality across spatial, channel, and temporal dimensions, while preserving the
+frozen parameters of large-scale pre-trained models. Experimental evaluations
+demonstrate that our proposed model achieves state-of-the-art results across
+multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our
+model exhibits promising performance in challenging few-shot and zero-shot
+scenarios. The source code and pre-trained models are available at
+https://github.com/haoyi-duan/DG-SCT.
+
+
+
+ comment: Accepted to NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling
+
+
+
+
+
+
+
+
+ Wei-Yin Ko, Daniel D'souza, Karina Nguyen, Randall Balestriero, Sara Hooker
+
+
+ Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way
+to improve top-line metrics and to outperform a larger single model. In this
+work, we go beyond top-line metrics and instead explore the impact of
+ensembling on subgroup performances. Surprisingly, we observe that even with a
+simple homogeneous ensemble -- all the individual DNNs share the same training
+set, architecture, and design choices -- the minority group performance
+disproportionately improves with the number of models compared to the majority
+group, i.e. fairness naturally emerges from ensembling. Even more surprising,
+we find that this gain keeps occurring even when a large number of models is
+considered, e.g. $20$, despite the fact that the average performance of the
+ensemble plateaus with fewer models. Our work establishes that simple DNN
+ensembles can be a powerful tool for alleviating disparate impact from DNN
+classifiers, thus curbing algorithmic harm. We also explore why this is the
+case. We find that even in homogeneous ensembles, varying the sources of
+stochasticity through parameter initialization, mini-batch sampling, and
+data-augmentation realizations, results in different fairness outcomes.
+
+
+
+
+
+
+
+ ♻ ☆ How to Efficiently Annotate Images for Best-Performing Deep Learning
+ Based Segmentation Models: An Empirical Study with Weak and Noisy Annotations
+ and Segment Anything Model
+
+
+
+
+
+
+
+
+ Yixin Zhang, Shen Zhao, Hanxue Gu, Maciej A. Mazurowski
+
+
+ Deep neural networks (DNNs) have been deployed for many image segmentation
+tasks and achieved outstanding performance. However, preparing a dataset for
+training segmentation DNNs is laborious and costly since typically pixel-level
+annotations are provided for each object of interest. To alleviate this issue,
+one can provide only weak labels such as bounding boxes or scribbles, or less
+accurate (noisy) annotations of the objects. These are significantly faster to
+generate and thus result in more annotated images given the same time budget.
+However, the reduction in quality might negatively affect the segmentation
+performance of the resulting model. In this study, we perform a thorough
+cost-effectiveness evaluation of several weak and noisy labels. We considered
+11 variants of annotation strategies and 4 datasets. We conclude that the
+common practice of accurately outlining the objects of interest is virtually
+never the optimal approach when the annotation time is limited, even if notable
+annotation time is available (10s of hours). Annotation approaches that stood
+out in such scenarios were (1) contour-based annotation with rough continuous
+traces, (2) polygon-based annotation with few vertices, and (3) box annotations
+combined with the Segment Anything Model (SAM). In situations where unlimited
+annotation time was available, precise annotations still lead to the highest
+segmentation model performance.
+
+
+
+
+
+
+
+ ♻ ☆ AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
+ Masked Autoencoder ICTAI
+
+
+ Learning high-quality video representation has shown significant applications
+in computer vision and remains challenging. Previous work based on mask
+autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of
+learning representations in images and videos through reconstruction strategy
+in the visual modality. However, these models exhibit inherent limitations,
+particularly in scenarios where extracting features solely from the visual
+modality proves challenging, such as when dealing with low-resolution and
+blurry original videos. Based on this, we propose AV-MaskEnhancer for learning
+high-quality video representation by combining visual and audio information.
+Our approach addresses the challenge by demonstrating the complementary nature
+of audio and video features in cross-modality content. Moreover, our result of
+the video classification task on the UCF101 dataset outperforms the existing
+work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a
+top-5 accuracy of 99.9%.
+
+
+
+ comment: 2023 IEEE 35th International Conference on Tools with Artificial
+ Intelligence (ICTAI)
+
+
+
+
+
+
+ ♻ ☆ Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic
+ Segmentation
+
+
+ Medical image segmentation modeling is a high-stakes task where understanding
+of uncertainty is crucial for addressing visual ambiguity. Prior work has
+developed segmentation models utilizing probabilistic or generative mechanisms
+to infer uncertainty from labels where annotators draw a singular boundary.
+However, as these annotations cannot represent an individual annotator's
+uncertainty, models trained on them produce uncertainty maps that are difficult
+to interpret. We propose a novel segmentation representation, Confidence
+Contours, which uses high- and low-confidence ``contours'' to capture
+uncertainty directly, and develop a novel annotation system for collecting
+contours. We conduct an evaluation on the Lung Image Dataset Consortium (LIDC)
+and a synthetic dataset. From an annotation study with 30 participants, results
+show that Confidence Contours provide high representative capacity without
+considerably higher annotator effort. We also find that general-purpose
+segmentation models can learn Confidence Contours at the same performance level
+as standard singular annotations. Finally, from interviews with 5 medical
+experts, we find that Confidence Contour maps are more interpretable than
+Bayesian maps due to representation of structural uncertainty.
+
+
+
+
+
+
+
+ ♻ ☆ Adversarial Purification with the Manifold Hypothesis AAAI 2024
+
+
+
+
+
+
+
+
+ Zhaoyuan Yang, Zhiwei Xu, Jing Zhang, Richard Hartley, Peter Tu
+
+
+ In this work, we formulate a novel framework for adversarial robustness using
+the manifold hypothesis. This framework provides sufficient conditions for
+defending against adversarial examples. We develop an adversarial purification
+method with this framework. Our method combines manifold learning with
+variational inference to provide adversarial robustness without the need for
+expensive adversarial training. Experimentally, our approach can provide
+adversarial robustness even if attackers are aware of the existence of the
+defense. In addition, our method can also serve as a test-time defense
+mechanism for variational autoencoders.
+
+
+
+ comment: Extended version of paper accepted at AAAI 2024 with supplementary
+ materials
+
+
+
+
+
+
+ ♻ ☆ KitBit: A New AI Model for Solving Intelligence Tests and Numerical
+ Series
+
+
+
+
+
+
+
+
+ Víctor Corsino, José Manuel Gilpérez, Luis Herrera
+
+
+ The resolution of intelligence tests, in particular numerical sequences, has
+been of great interest in the evaluation of AI systems. We present a new
+computational model called KitBit that uses a reduced set of algorithms and
+their combinations to build a predictive model that finds the underlying
+pattern in numerical sequences, such as those included in IQ tests and others
+of much greater complexity. We present the fundamentals of the model and its
+application in different cases. First, the system is tested on a set of number
+series used in IQ tests collected from various sources. Next, our model is
+successfully applied on the sequences used to evaluate the models reported in
+the literature. In both cases, the system is capable of solving these types of
+problems in less than a second using standard computing power. Finally,
+KitBit's algorithms have been applied for the first time to the complete set of
+entire sequences of the well-known OEIS database. We find a pattern in the form
+of a list of algorithms and predict the following terms in the largest number
+of series to date. These results demonstrate the potential of KitBit to solve
+complex problems that could be represented numerically.
+
+
+
+ comment: 11 pages
+
+
+
+
+
+
+ ♻ ☆ Skeletal Video Anomaly Detection using Deep Learning: Survey, Challenges
+ and Future Directions
+
+
+
+
+
+
+
+
+ Pratik K. Mishra, Alex Mihailidis, Shehroz S. Khan
+
+
+ The existing methods for video anomaly detection mostly utilize videos
+containing identifiable facial and appearance-based features. The use of videos
+with identifiable faces raises privacy concerns, especially when used in a
+hospital or community-based setting. Appearance-based features can also be
+sensitive to pixel-based noise, straining the anomaly detection methods to
+model the changes in the background and making it difficult to focus on the
+actions of humans in the foreground. Structural information in the form of
+skeletons describing the human motion in the videos is privacy-protecting and
+can overcome some of the problems posed by appearance-based features. In this
+paper, we present a survey of privacy-protecting deep learning anomaly
+detection methods using skeletons extracted from videos. We present a novel
+taxonomy of algorithms based on the various learning approaches. We conclude
+that skeleton-based approaches for anomaly detection can be a plausible
+privacy-protecting alternative for video anomaly detection. Lastly, we identify
+major open research questions and provide guidelines to address them.
+
+
+
+
+
+
+
+ ♻ ☆ Basis Scaling and Double Pruning for Efficient Inference in
+ Network-Based Transfer Learning
+
+
+
+
+
+
+
+
+ Ken C. L. Wong, Satyananda Kashyap, Mehdi Moradi
+
+
+ Network-based transfer learning allows the reuse of deep learning features
+with limited data, but the resulting models can be unnecessarily large.
+Although network pruning can improve inference efficiency, existing algorithms
+usually require fine-tuning that may not be suitable for small datasets. In
+this paper, using the singular value decomposition, we decompose a
+convolutional layer into two layers: a convolutional layer with the orthonormal
+basis vectors as the filters, and a "BasisScalingConv" layer which is
+responsible for rescaling the features and transforming them back to the
+original space. As the filters in each decomposed layer are linearly
+independent, when using the proposed basis scaling factors with the Taylor
+approximation of importance, pruning can be more effective and fine-tuning
+individual weights is unnecessary. Furthermore, as the numbers of input and
+output channels of the original convolutional layer remain unchanged after
+basis pruning, it is applicable to virtually all architectures and can be
+combined with existing pruning algorithms for double pruning to further
+increase the pruning capability. When transferring knowledge from ImageNet
+pre-trained models to different target domains, with less than 1% reduction in
+classification accuracies, we can achieve pruning ratios up to 74.6% for
+CIFAR-10 and 98.9% for MNIST in model parameters.
+
+
+
+ comment: This paper was accepted by Pattern Recognition Letters
+
+
+
+
+
+
+
+
+
+ Information Retrieval 18
+
+
+
+
+
+ ☆ dIR -- Discrete Information Retrieval: Conversational Search over
+ Unstructured (and Structured) Data with Large Language Models
+
+
+
+
+
+
+
+
+ Pablo M. Rodriguez Bertorello, Jean Rodmond Junior Laguerre
+
+
+ Data is stored in both structured and unstructured form. Querying both, to
+power natural language conversations, is a challenge. This paper introduces
+dIR, Discrete Information Retrieval, providing a unified interface to query
+both free text and structured knowledge. Specifically, a Large Language Model
+(LLM) transforms text into expressive representation. After the text is
+extracted into columnar form, it can then be queried via a text-to-SQL Semantic
+Parser, with an LLM converting natural language into SQL. Where desired, such
+conversation may be effected by a multi-step reasoning conversational agent. We
+validate our approach via a proprietary question/answer data set, concluding
+that dIR makes a whole new class of queries on free text possible when compared
+to traditionally fine-tuned dense-embedding-model-based Information Retrieval
+(IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR
+can succeed where no other method stands a chance.
+
+
+
+ comment: 8 pages, 5 figures, Association for Computational Linguistics
+
+
+
+
+
+
+ ☆ BSL: Understanding and Improving Softmax Loss for Recommendation
+
+
+ Loss functions steer the optimization direction of recommendation models and
+are critical to model performance, but have received relatively little
+attention in recent recommendation research. Among various losses, we find
+Softmax loss (SL) stands out for not only achieving remarkable accuracy but
+also better robustness and fairness. Nevertheless, the current literature lacks
+a comprehensive explanation for the efficacy of SL. Toward addressing this
+research gap, we conduct theoretical analyses on SL and uncover three insights:
+1) Optimizing SL is equivalent to performing Distributionally Robust
+Optimization (DRO) on the negative data, thereby learning against perturbations
+on the negative distribution and yielding robustness to noisy negatives. 2)
+Comparing with other loss functions, SL implicitly penalizes the prediction
+variance, resulting in a smaller gap between predicted values and and thus
+producing fairer results. Building on these insights, we further propose a
+novel loss function Bilateral SoftMax Loss (BSL) that extends the advantage of
+SL to both positive and negative sides. BSL augments SL by applying the same
+Log-Expectation-Exp structure to positive examples as is used for negatives,
+making the model robust to the noisy positives as well. Remarkably, BSL is
+simple and easy-to-implement -- requiring just one additional line of code
+compared to SL. Experiments on four real-world datasets and three
+representative backbones demonstrate the effectiveness of our proposal. The
+code is available at https://github.com/junkangwu/BSL
+
+
+
+
+
+
+
+ ☆ Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems AAAI2024
+
+
+
+
+
+
+
+
+ Zhiguang Yang, Lu Wang, Chun Gan, Liufang Sang, Haoran Wang, Wenlong Chen, Jie He, Changping Peng, Zhangang Lin, Jingping Shao
+
+
+ "Creativity is the heart and soul of advertising services". Effective
+creatives can create a win-win scenario: advertisers can reach target users and
+achieve marketing objectives more effectively, users can more quickly find
+products of interest, and platforms can generate more advertising revenue. With
+the advent of AI-Generated Content, advertisers now can produce vast amounts of
+creative content at a minimal cost. The current challenge lies in how
+advertising systems can select the most pertinent creative in real-time for
+each user personally. Existing methods typically perform serial ranking of ads
+or creatives, limiting the creative module in terms of both effectiveness and
+efficiency. In this paper, we propose for the first time a novel architecture
+for online parallel estimation of ads and creatives ranking, as well as the
+corresponding offline joint optimization model. The online architecture enables
+sophisticated personalized creative modeling while reducing overall latency.
+The offline joint model for CTR estimation allows mutual awareness and
+collaborative optimization between ads and creatives. Additionally, we optimize
+the offline evaluation metrics for the implicit feedback sorting task involved
+in ad creative ranking. We conduct extensive experiments to compare ours with
+two state-of-the-art approaches. The results demonstrate the effectiveness of
+our approach in both offline evaluations and real-world advertising platforms
+online in terms of response time, CTR, and CPM.
+
+
+
+ comment: 9 pages, 4 figures, AAAI2024
+
+
+
+
+
+
+ ☆ Fine-tuning Large Language Models for Adaptive Machine Translation
+
+
+
+
+
+
+
+
+ Yasmin Moslem, Rejwanul Haque, Andy Way
+
+
+ This paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose
+large language model (LLM), for adaptive machine translation (MT). The
+fine-tuning process involves utilising a combination of zero-shot and one-shot
+translation prompts within the medical domain. The primary objective is to
+enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt
+translations to the required domain at inference time. The results,
+particularly for Spanish-to-English MT, showcase the efficacy of the fine-tuned
+model, demonstrating quality improvements in both zero-shot and one-shot
+translation scenarios, surpassing Mistral 7B's baseline performance. Notably,
+the fine-tuned Mistral outperforms ChatGPT "gpt-3.5-turbo" in zero-shot
+translation while achieving comparable one-shot translation quality. Moreover,
+the zero-shot translation of the fine-tuned Mistral matches NLLB 3.3B's
+performance, and its one-shot translation quality surpasses that of NLLB 3.3B.
+These findings emphasise the significance of fine-tuning efficient LLMs like
+Mistral 7B to yield high-quality zero-shot translations comparable to
+task-oriented models like NLLB 3.3B. Additionally, the adaptive gains achieved
+in one-shot translation are comparable to those of commercial LLMs such as
+ChatGPT. Our experiments demonstrate that, with a relatively small dataset of
+20,000 segments that incorporate a mix of zero-shot and one-shot prompts,
+fine-tuning significantly enhances Mistral's in-context learning ability,
+especially for real-time adaptive MT.
+
+
+
+
+
+
+
+ ☆ Lookahead: An Inference Acceleration Framework for Large Language Model
+ with Lossless Generation Accuracy
+
+
+ As Large Language Models (LLMs) have made significant advancements across
+various tasks, such as question answering, translation, text summarization, and
+dialogue systems, the need for accuracy in information becomes crucial,
+especially for serious financial products serving billions of users like
+Alipay. To address this, Alipay has developed a Retrieval-Augmented Generation
+(RAG) system that grounds LLMs on the most accurate and up-to-date information.
+However, for a real-world product serving millions of users, the inference
+speed of LLMs becomes a critical factor compared to a mere experimental model.
+ Hence, this paper presents a generic framework for accelerating the inference
+process, resulting in a substantial increase in speed and cost reduction for
+our RAG system, with lossless generation accuracy. In the traditional inference
+process, each token is generated sequentially by the LLM, leading to a time
+consumption proportional to the number of generated tokens. To enhance this
+process, our framework, named \textit{lookahead}, introduces a
+\textit{multi-branch} strategy. Instead of generating a single token at a time,
+we propose a \textit{Trie-based Retrieval} (TR) process that enables the
+generation of multiple branches simultaneously, each of which is a sequence of
+tokens. Subsequently, for each branch, a \textit{Verification and Accept} (VA)
+process is performed to identify the longest correct sub-sequence as the final
+output. Our strategy offers two distinct advantages: (1) it guarantees absolute
+correctness of the output, avoiding any approximation algorithms, and (2) the
+worst-case performance of our approach is equivalent to the conventional
+process. We conduct extensive experiments to demonstrate the significant
+improvements achieved by applying our inference acceleration framework.
+
+
+
+ comment: 10 pages, 6 figures
+
+
+
+
+
+
+ ☆ Categorical, Ratio, and Professorial Data: The Case for Reciprocal Rank
+
+
+ Search engine results pages are usually abstracted as binary relevance
+vectors and hence are categorical data, meaning that only a limited set of
+operations is permitted, most notably tabulation of occurrence frequencies,
+with determination of medians and averages not possible. To compare retrieval
+systems it is thus usual to make use of a categorical-to-numeric effectiveness
+mapping. A previous paper has argued that any desired categorical-to-numeric
+mapping may be used, provided only that there is an argued connection between
+each category of SERP and the score that is assigned to that category by the
+mapping. Further, once that plausible connection has been established, then the
+mapped values can be treated as real-valued observations on a ratio scale,
+allowing the computation of averages. This article is written in support of
+that point of view, and to respond to ongoing claims that SERP scores may only
+be averaged if very restrictive conditions are imposed on the effectiveness
+mapping.
+
+
+
+
+
+
+
+ ☆ Accuracy vs Memory Advantage in the Quantum Simulation of Stochastic
+ Processes
+
+
+ Many inference scenarios rely on extracting relevant information from known
+data in order to make future predictions. When the underlying stochastic
+process satisfies certain assumptions, there is a direct mapping between its
+exact classical and quantum simulators, with the latter asymptotically using
+less memory. Here we focus on studying whether such quantum advantage persists
+when those assumptions are not satisfied, and the model is doomed to have
+imperfect accuracy. By studying the trade-off between accuracy and memory
+requirements, we show that quantum models can reach the same accuracy with less
+memory, or alternatively, better accuracy with the same memory. Finally, we
+discuss the implications of this result for learning tasks.
+
+
+
+
+
+
+
+ ☆ Zero-1-to-3: Domain-level Zero-shot Cognitive Diagnosis via One Batch of
+ Early-bird Students towards Three Diagnostic Objectives AAAI2024
+
+
+ Cognitive diagnosis seeks to estimate the cognitive states of students by
+exploring their logged practice quiz data. It plays a pivotal role in
+personalized learning guidance within intelligent education systems. In this
+paper, we focus on an important, practical, yet often underexplored task:
+domain-level zero-shot cognitive diagnosis (DZCD), which arises due to the
+absence of student practice logs in newly launched domains. Recent cross-domain
+diagnostic models have been demonstrated to be a promising strategy for DZCD.
+These methods primarily focus on how to transfer student states across domains.
+However, they might inadvertently incorporate non-transferable information into
+student representations, thereby limiting the efficacy of knowledge transfer.
+To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive
+diagnosis framework via one batch of early-bird students towards three
+diagnostic objectives. Our approach initiates with pre-training a diagnosis
+model with dual regularizers, which decouples student states into domain-shared
+and domain-specific parts. The shared cognitive signals can be transferred to
+the target domain, enriching the cognitive priors for the new domain, which
+ensures the cognitive state propagation objective. Subsequently, we devise a
+strategy to generate simulated practice logs for cold-start students through
+analyzing the behavioral patterns from early-bird students, fulfilling the
+domain-adaption goal. Consequently, we refine the cognitive states of
+cold-start students as diagnostic outcomes via virtual data, aligning with the
+diagnosis-oriented goal. Finally, extensive experiments on six real-world
+datasets highlight the efficacy of our model for DZCD and its practical
+application in question recommendation.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ☆ VADIS -- a VAriable Detection, Interlinking and Summarization system ECIR 2024
+
+
+
+
+
+
+
+
+ Yavuz Selim Kartal, Muhammad Ahsan Shahid, Sotaro Takeshita, Tornike Tsereteli, Andrea Zielinski, Benjamin Zapilko, Philipp Mayr
+
+
+ The VADIS system addresses the demand of providing enhanced information
+access in the domain of the social sciences. This is achieved by allowing users
+to search and use survey variables in context of their underlying research data
+and scholarly publications which have been interlinked with each other.
+
+
+
+ comment: It is 4 pages and 2 figures. This paper has recently been accepted by
+ ECIR 2024 Demo Track and this version is the camera-ready version of the
+ paper
+
+ Session-based recommendation, which aims to predict the next item of users'
+interest as per an existing sequence interaction of items, has attracted
+growing applications of Contrastive Learning (CL) with improved user and item
+representations. However, these contrastive objectives: (1) serve a similar
+role as the cross-entropy loss while ignoring the item representation space
+optimisation; and (2) commonly require complicated modelling, including complex
+positive/negative sample constructions and extra data augmentation. In this
+work, we introduce Self-Contrastive Learning (SCL), which simplifies the
+application of CL and enhances the performance of state-of-the-art CL-based
+recommendation techniques. Specifically, SCL is formulated as an objective
+function that directly promotes a uniform distribution among item
+representations and efficiently replaces all the existing contrastive objective
+components of state-of-the-art models. Unlike previous works, SCL eliminates
+the need for any positive/negative sample construction or data augmentation,
+leading to enhanced interpretability of the item representation space and
+facilitating its extensibility to existing recommender systems. Through
+experiments on three benchmark datasets, we demonstrate that SCL consistently
+improves the performance of state-of-the-art models with statistical
+significance. Notably, our experiments show that SCL improves the performance
+of two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and
+11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks.
+Additionally, our analysis elucidates the improvement in terms of alignment and
+uniformity of representations, as well as the effectiveness of SCL with a low
+computational cost.
+
+
+
+ comment: ECIR 2024 (Full Paper) Camera-ready Version. Code is available at
+ https://github.com/ZhengxiangShi/SelfContrastiveLearningRecSys
+
+
+
+
+
+
+ ♻ ☆ A Unified Framework for Multi-Domain CTR Prediction via Large Language
+ Models
+
+
+ Click-Through Rate (CTR) prediction is a crucial task in online
+recommendation platforms as it involves estimating the probability of user
+engagement with advertisements or items by clicking on them. Given the
+availability of various services like online shopping, ride-sharing, food
+delivery, and professional services on commercial platforms, recommendation
+systems in these platforms are required to make CTR predictions across multiple
+domains rather than just a single domain. However, multi-domain click-through
+rate (MDCTR) prediction remains a challenging task in online recommendation due
+to the complex mutual influence between domains. Traditional MDCTR models
+typically encode domains as discrete identifiers, ignoring rich semantic
+information underlying. Consequently, they can hardly generalize to new
+domains. Besides, existing models can be easily dominated by some specific
+domains, which results in significant performance drops in the other domains
+(\ie the ``seesaw phenomenon``). In this paper, we propose a novel solution
+Uni-CTR to address the above challenges. Uni-CTR leverages a backbone Large
+Language Model (LLM) to learn layer-wise semantic representations that capture
+commonalities between domains. Uni-CTR also uses several domain-specific
+networks to capture the characteristics of each domain. Note that we design a
+masked loss strategy so that these domain-specific networks are decoupled from
+backbone LLM. This allows domain-specific networks to remain unchanged when
+incorporating new or removing domains, thereby enhancing the flexibility and
+scalability of the system significantly. Experimental results on three public
+datasets show that Uni-CTR outperforms the state-of-the-art (SOTA) MDCTR models
+significantly. Furthermore, Uni-CTR demonstrates remarkable effectiveness in
+zero-shot prediction. We have applied Uni-CTR in industrial scenarios,
+confirming its efficiency.
+
+
+ Ensuring fairness in Recommendation Systems (RSs) across demographic groups
+is critical due to the increased integration of RSs in applications such as
+personalized healthcare, finance, and e-commerce. Graph-based RSs play a
+crucial role in capturing intricate higher-order interactions among entities.
+However, integrating these graph models into the Federated Learning (FL)
+paradigm with fairness constraints poses formidable challenges as this requires
+access to the entire interaction graph and sensitive user information (such as
+gender, age, etc.) at the central server. This paper addresses the pervasive
+issue of inherent bias within RSs for different demographic groups without
+compromising the privacy of sensitive user attributes in FL environment with
+the graph-based model. To address the group bias, we propose F2PGNN (Fair
+Federated Personalized Graph Neural Network), a novel framework that leverages
+the power of Personalized Graph Neural Network (GNN) coupled with fairness
+considerations. Additionally, we use differential privacy techniques to fortify
+privacy protection. Experimental evaluation on three publicly available
+datasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47%
+- 99% compared to the state-of-the-art while preserving privacy and maintaining
+the utility. The results validate the significance of our framework in
+achieving equitable and personalized recommendations using GNN within the FL
+landscape.
+
+
+
+ comment: To appear as a full paper in AAAI 2024
+
+ Cross-Domain Sequential Recommendation (CDSR) methods aim to tackle the data
+sparsity and cold-start problems present in Single-Domain Sequential
+Recommendation (SDSR). Existing CDSR works design their elaborate structures
+relying on overlapping users to propagate the cross-domain information.
+However, current CDSR methods make closed-world assumptions, assuming fully
+overlapping users across multiple domains and that the data distribution
+remains unchanged from the training environment to the test environment. As a
+result, these methods typically result in lower performance on online
+real-world platforms due to the data distribution shifts. To address these
+challenges under open-world assumptions, we design an \textbf{A}daptive
+\textbf{M}ulti-\textbf{I}nterest \textbf{D}ebiasing framework for cross-domain
+sequential recommendation (\textbf{AMID}), which consists of a multi-interest
+information module (\textbf{MIM}) and a doubly robust estimator (\textbf{DRE}).
+Our framework is adaptive for open-world environments and can improve the model
+of most off-the-shelf single-domain sequential backbone models for CDSR. Our
+MIM establishes interest groups that consider both overlapping and
+non-overlapping users, allowing us to effectively explore user intent and
+explicit interest. To alleviate biases across multiple domains, we developed
+the DRE for the CDSR methods. We also provide a theoretical analysis that
+demonstrates the superiority of our proposed estimator in terms of bias and
+tail bound, compared to the IPS estimator used in previous work.
+
+
+
+
+
+
+
+ ♻ ☆ A novel diffusion recommendation algorithm based on multi-scale cnn and
+ residual lstm
+
+
+ Sequential recommendation aims to infer user preferences from historical
+interaction sequences and predict the next item that users may be interested in
+the future. The current mainstream design approach is to represent items as
+fixed vectors, capturing the underlying relationships between items and user
+preferences based on the order of interactions. However, relying on a single
+fixed-item embedding may weaken the modeling capability of the system, and the
+global dynamics and local saliency exhibited by user preferences need to be
+distinguished. To address these issues, this paper proposes a novel diffusion
+recommendation algorithm based on multi-scale cnn and residual lstm (AREAL). We
+introduce diffusion models into the recommend system, representing items as
+probability distributions instead of fixed vectors. This approach enables
+adaptive reflection of multiple aspects of the items and generates item
+distributions in a denoising manner. We use multi-scale cnn and residual lstm
+methods to extract the local and global dependency features of user history
+interactions, and use attention mechanism to distinguish weights as the guide
+features of reverse diffusion recovery. The effectiveness of the proposed
+method is validated through experiments conducted on two real-world datasets.
+Specifically, AREAL obtains improvements over the best baselines by 2.63% and
+4.25% in terms of HR@20 and 5.05% and 3.94% in terms of NDCG@20 on all
+datasets.
+
+
+
+ comment: This paper needs to be further modified, including the ablation
+ experiment, model framework and other information in Chapter 5. There are
+ some inaccuracies in the presentation of this paper. Two datasets are used
+ instead of three, and there are many inaccuracies in the presentation, which
+ need to be further corrected
+
+
+
+
+
+
+ ♻ ☆ GraphPro: Graph Pre-training and Prompt Learning for Recommendation
+
+
+ GNN-based recommenders have excelled in modeling intricate user-item
+interactions through multi-hop message passing. However, existing methods often
+overlook the dynamic nature of evolving user-item interactions, which impedes
+the adaption to changing user preferences and distribution shifts in newly
+arriving data. Thus, their scalability and performances in real-world dynamic
+environments are limited. In this study, we propose GraphPro, a framework that
+incorporates parameter-efficient and dynamic graph pre-training with prompt
+learning. This novel combination empowers GNNs to effectively capture both
+long-term user preferences and short-term behavior dynamics, enabling the
+delivery of accurate and timely recommendations. Our GraphPro framework
+addresses the challenge of evolving user preferences by seamlessly integrating
+a temporal prompt mechanism and a graph-structural prompt learning mechanism
+into the pre-trained GNN model. The temporal prompt mechanism encodes time
+information on user-item interaction, allowing the model to naturally capture
+temporal context, while the graph-structural prompt learning mechanism enables
+the transfer of pre-trained knowledge to adapt to behavior dynamics without the
+need for continuous incremental training. We further bring in a dynamic
+evaluation setting for recommendation to mimic real-world dynamic scenarios and
+bridge the offline-online gap to a better level. Our extensive experiments
+including a large-scale industrial deployment showcases the lightweight plug-in
+scalability of our GraphPro when integrated with various state-of-the-art
+recommenders, emphasizing the advantages of GraphPro in terms of effectiveness,
+robustness and efficiency.
+
+
+
+
+
+
+
+
+ Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-yi Lee, Jyh-Shing Roger Jang
+
+
+ Audio-visual synchronization aims to determine whether the mouth movements
+and speech in the video are synchronized. VocaLiST reaches state-of-the-art
+performance by incorporating multimodal Transformers to model audio-visual
+interact information. However, it requires high computing resources, making it
+impractical for real-world applications. This paper proposed an MTDVocaLiST
+model, which is trained by our proposed multimodal Transformer distillation
+(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the
+cross-attention distribution and value-relation in the Transformer of VocaLiST.
+Additionally, we harness uncertainty weighting to fully exploit the interaction
+information across all layers. Our proposed method is effective in two aspects:
+From the distillation method perspective, MTD loss outperforms other strong
+distillation baselines. From the distilled model's performance perspective: 1)
+MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and Perfect Match
+models by 15.65% and 3.35%; 2) MTDVocaLiST reduces the model size of VocaLiST
+by 83.52%, yet still maintaining similar performance.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ Budgeted Embedding Table For Recommender Systems WSDM 2024
+
+
+ At the heart of contemporary recommender systems (RSs) are latent factor
+models that provide quality recommendation experience to users. These models
+use embedding vectors, which are typically of a uniform and fixed size, to
+represent users and items. As the number of users and items continues to grow,
+this design becomes inefficient and hard to scale. Recent lightweight embedding
+methods have enabled different users and items to have diverse embedding sizes,
+but are commonly subject to two major drawbacks. Firstly, they limit the
+embedding size search to optimizing a heuristic balancing the recommendation
+quality and the memory complexity, where the trade-off coefficient needs to be
+manually tuned for every memory budget requested. The implicitly enforced
+memory complexity term can even fail to cap the parameter usage, making the
+resultant embedding table fail to meet the memory budget strictly. Secondly,
+most solutions, especially reinforcement learning based ones derive and
+optimize the embedding size for each each user/item on an instance-by-instance
+basis, which impedes the search efficiency. In this paper, we propose Budgeted
+Embedding Table (BET), a novel method that generates table-level actions (i.e.,
+embedding sizes for all users and items) that is guaranteed to meet
+pre-specified memory budgets. Furthermore, by leveraging a set-based action
+formulation and engaging set representation learning, we present an innovative
+action search strategy powered by an action fitness predictor that efficiently
+evaluates each table-level action. Experiments have shown state-of-the-art
+performance on two real-world datasets when BET is paired with three popular
+recommender models under different memory budgets.
+
+
+
+ comment: Accepted by WSDM 2024
+
+
+
+
+
+
+ ♻ ☆ Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
+
+
+ We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
+title reranking technique to achieve efficient title reranking 20x-40x faster
+than vanilla passage reranker. However, one of the challenges with the training
+of Efficient Title Reranker is the instability. Analyzing the issue, we found
+some very difficult ground truths might act as noisy labels causing accuracy to
+drop as well as some extreme values in model probability output causing nan. To
+address these issues, we introduce the Sigmoid Trick, a novel technique that
+reduces the gradient update of both cases resulting in better retrieval
+efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
+achieved four state-of-the-art positions on the kilt knowledge benchmark.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 169
+
+
+
+
+
+ ☆ dIR -- Discrete Information Retrieval: Conversational Search over
+ Unstructured (and Structured) Data with Large Language Models
+
+
+
+
+
+
+
+
+ Pablo M. Rodriguez Bertorello, Jean Rodmond Junior Laguerre
+
+
+ Data is stored in both structured and unstructured form. Querying both, to
+power natural language conversations, is a challenge. This paper introduces
+dIR, Discrete Information Retrieval, providing a unified interface to query
+both free text and structured knowledge. Specifically, a Large Language Model
+(LLM) transforms text into expressive representation. After the text is
+extracted into columnar form, it can then be queried via a text-to-SQL Semantic
+Parser, with an LLM converting natural language into SQL. Where desired, such
+conversation may be effected by a multi-step reasoning conversational agent. We
+validate our approach via a proprietary question/answer data set, concluding
+that dIR makes a whole new class of queries on free text possible when compared
+to traditionally fine-tuned dense-embedding-model-based Information Retrieval
+(IR) and SQL-based Knowledge Bases (KB). For sufficiently complex queries, dIR
+can succeed where no other method stands a chance.
+
+
+
+ comment: 8 pages, 5 figures, Association for Computational Linguistics
+
+
+
+
+
+
+ ☆ A note on regularised NTK dynamics with an application to PAC-Bayesian
+ training
+
+
+ We establish explicit dynamics for neural networks whose training objective
+has a regularising term that constrains the parameters to remain close to their
+initial value. This keeps the network in a lazy training regime, where the
+dynamics can be linearised around the initialisation. The standard neural
+tangent kernel (NTK) governs the evolution during the training in the
+infinite-width limit, although the regularisation yields an additional term
+appears in the differential equation describing the dynamics. This setting
+provides an appropriate framework to study the evolution of wide networks
+trained to optimise generalisation objectives such as PAC-Bayes bounds, and
+hence potentially contribute to a deeper theoretical understanding of such
+networks.
+
+
+
+
+
+
+
+ ☆ Conditional Image Generation with Pretrained Generative Model
+
+
+ In recent years, diffusion models have gained popularity for their ability to
+generate higher-quality images in comparison to GAN models. However, like any
+other large generative models, these models require a huge amount of data,
+computational resources, and meticulous tuning for successful training. This
+poses a significant challenge, rendering it infeasible for most individuals. As
+a result, the research community has devised methods to leverage pre-trained
+unconditional diffusion models with additional guidance for the purpose of
+conditional image generative. These methods enable conditional image
+generations on diverse inputs and, most importantly, circumvent the need for
+training the diffusion model. In this paper, our objective is to reduce the
+time-required and computational overhead introduced by the addition of guidance
+in diffusion models -- while maintaining comparable image quality. We propose a
+set of methods based on our empirical analysis, demonstrating a reduction in
+computation time by approximately threefold.
+
+
+
+
+
+
+
+ ☆ The role of data embedding in equivariant quantum convolutional neural
+ networks
+
+
+ Geometric deep learning refers to the scenario in which the symmetries of a
+dataset are used to constrain the parameter space of a neural network and thus,
+improve their trainability and generalization. Recently this idea has been
+incorporated into the field of quantum machine learning, which has given rise
+to equivariant quantum neural networks (EQNNs). In this work, we investigate
+the role of classical-to-quantum embedding on the performance of equivariant
+quantum convolutional neural networks (EQCNNs) for the classification of
+images. We discuss the connection between the data embedding method and the
+resulting representation of a symmetry group and analyze how changing
+representation affects the expressibility of an EQCNN. We numerically compare
+the classification accuracy of EQCNNs with three different basis-permuted
+amplitude embeddings to the one obtained from a non-equivariant quantum
+convolutional neural network (QCNN). Our results show that all the EQCNNs
+achieve higher classification accuracy than the non-equivariant QCNN for small
+numbers of training iterations, while for large iterations this improvement
+crucially depends on the used embedding. It is expected that the results of
+this work can be useful to the community for a better understanding of the
+importance of data embedding choice in the context of geometric quantum machine
+learning.
+
+
+
+ comment: 9 pages, 7 figures
+
+
+
+
+
+
+ ☆ Enhancing Neural Training via a Correlated Dynamics Model
+
+
+
+
+
+
+
+
+ Jonathan Brokman, Roy Betser, Rotem Turjeman, Tom Berkov, Ido Cohen, Guy Gilboa
+
+
+ As neural networks grow in scale, their training becomes both computationally
+demanding and rich in dynamics. Amidst the flourishing interest in these
+training dynamics, we present a novel observation: Parameters during training
+exhibit intrinsic correlations over time. Capitalizing on this, we introduce
+Correlation Mode Decomposition (CMD). This algorithm clusters the parameter
+space into groups, termed modes, that display synchronized behavior across
+epochs. This enables CMD to efficiently represent the training dynamics of
+complex networks, like ResNets and Transformers, using only a few modes.
+Moreover, test set generalization is enhanced. We introduce an efficient CMD
+variant, designed to run concurrently with training. Our experiments indicate
+that CMD surpasses the state-of-the-art method for compactly modeled dynamics
+on image classification. Our modeling can improve training efficiency and lower
+communication overhead, as shown by our preliminary experiments in the context
+of federated learning.
+
+
+
+
+
+
+
+
+ Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov
+
+
+ Diffusion models have gained traction as powerful algorithms for synthesizing
+high-quality images. Central to these algorithms is the diffusion process,
+which maps data to noise according to equations inspired by thermodynamics and
+can significantly impact performance. A widely held assumption is that the ELBO
+objective of a diffusion model is invariant to the noise process (Kingma et
+al.,2021). In this work, we dispel this assumption -- we propose multivariate
+learned adaptive noise (MuLAN), a learned diffusion process that applies
+Gaussian noise at different rates across an image. Our method consists of three
+components -- a multivariate noise schedule, instance-conditional diffusion,
+and auxiliary variables -- which ensure that the learning objective is no
+longer invariant to the choice of the noise schedule as in previous works. Our
+work is grounded in Bayesian inference and casts the learned diffusion process
+as an approximate variational posterior that yields a tighter lower bound on
+marginal likelihood. Empirically, MuLAN sets a new state-of-the-art in density
+estimation on CIFAR-10 and ImageNet compared to classical diffusion. Code is
+available at https://github.com/s-sahoo/MuLAN
+
+
+
+
+
+
+
+ ☆ Position Paper: Bridging the Gap Between Machine Learning and
+ Sensitivity Analysis
+
+
+
+
+
+
+
+
+ Christian A. Scholbeck, Julia Moosbauer, Giuseppe Casalicchio, Hoshin Gupta, Bernd Bischl, Christian Heumann
+
+
+ We argue that interpretations of machine learning (ML) models or the
+model-building process can bee seen as a form of sensitivity analysis (SA), a
+general methodology used to explain complex systems in many fields such as
+environmental modeling, engineering, or economics. We address both researchers
+and practitioners, calling attention to the benefits of a unified SA-based view
+of explanations in ML and the necessity to fully credit related work. We bridge
+the gap between both fields by formally describing how (a) the ML process is a
+system suitable for SA, (b) how existing ML interpretation methods relate to
+this perspective, and (c) how other SA techniques could be applied to ML.
+
+
+
+
+
+
+
+ ☆ FiFAR: A Fraud Detection Dataset for Learning to Defer
+
+
+
+
+
+
+
+
+ Jean V. Alves, Diogo Leitão, Sérgio Jesus, Marco O. P. Sampaio, Pedro Saleiro, Mário A. T. Figueiredo, Pedro Bizarro
+
+
+ Public dataset limitations have significantly hindered the development and
+benchmarking of learning to defer (L2D) algorithms, which aim to optimally
+combine human and AI capabilities in hybrid decision-making systems. In such
+systems, human availability and domain-specific concerns introduce
+difficulties, while obtaining human predictions for training and evaluation is
+costly. Financial fraud detection is a high-stakes setting where algorithms and
+human experts often work in tandem; however, there are no publicly available
+datasets for L2D concerning this important application of human-AI teaming. To
+fill this gap in L2D research, we introduce the Financial Fraud Alert Review
+Dataset (FiFAR), a synthetic bank account fraud detection dataset, containing
+the predictions of a team of 50 highly complex and varied synthetic fraud
+analysts, with varied bias and feature dependence. We also provide a realistic
+definition of human work capacity constraints, an aspect of L2D systems that is
+often overlooked, allowing for extensive testing of assignment systems under
+real-world conditions. We use our dataset to develop a capacity-aware L2D
+method and rejection learning approach under realistic data availability
+conditions, and benchmark these baselines under an array of 300 distinct
+testing scenarios. We believe that this dataset will serve as a pivotal
+instrument in facilitating a systematic, rigorous, reproducible, and
+transparent evaluation and comparison of L2D methods, thereby fostering the
+development of more synergistic human-AI collaboration in decision-making
+systems. The public dataset and detailed synthetic expert information are
+available at: https://github.com/feedzai/fifar-dataset
+
+
+
+ comment: The public dataset and detailed synthetic expert information are
+ available at: https://github.com/feedzai/fifar-dataset
+
+
+
+
+
+
+ ☆ A 3D super-resolution of wind fields via physics-informed pixel-wise
+ self-attention generative adversarial network NeurIPS 2023
+
+
+
+
+
+
+
+
+ Takuya Kurihana, Kyongmin Yeo, Daniela Szwarcman, Bruce Elmegreen, Karthik Mukkavilli, Johannes Schmude, Levente Klein
+
+
+ To mitigate global warming, greenhouse gas sources need to be resolved at a
+high spatial resolution and monitored in time to ensure the reduction and
+ultimately elimination of the pollution source. However, the complexity of
+computation in resolving high-resolution wind fields left the simulations
+impractical to test different time lengths and model configurations. This study
+presents a preliminary development of a physics-informed super-resolution (SR)
+generative adversarial network (GAN) that super-resolves the three-dimensional
+(3D) low-resolution wind fields by upscaling x9 times. We develop a pixel-wise
+self-attention (PWA) module that learns 3D weather dynamics via a
+self-attention computation followed by a 2D convolution. We also employ a loss
+term that regularizes the self-attention map during pretraining, capturing the
+vertical convection process from input wind data. The new PWA SR-GAN shows the
+high-fidelity super-resolved 3D wind data, learns a wind structure at the
+high-frequency domain, and reduces the computational cost of a high-resolution
+wind simulation by x89.7 times.
+
+
+
+
+
+
+
+
+ Hendrik Poulsen Nautrup, Hans J. Briegel
+
+
+ Measurement-based quantum computation (MBQC) is a paradigm for quantum
+computation where computation is driven by local measurements on a suitably
+entangled resource state. In this work we show that MBQC is related to a model
+of quantum computation based on Clifford quantum cellular automata (CQCA).
+Specifically, we show that certain MBQCs can be directly constructed from CQCAs
+which yields a simple and intuitive circuit model representation of MBQC in
+terms of quantum computation based on CQCA. We apply this description to
+construct various MBQC-based Ans\"atze for parameterized quantum circuits,
+demonstrating that the different Ans\"atze may lead to significantly different
+performances on different learning tasks. In this way, MBQC yields a family of
+Hardware-efficient Ans\"atze that may be adapted to specific problem settings
+and is particularly well suited for architectures with translationally
+invariant gates such as neutral atoms.
+
+
+
+ comment: 16 pages, 12 figures
+
+
+
+
+
+
+ ☆ Learning Fair Policies for Multi-stage Selection Problems from
+ Observational Data AAAI
+
+
+ We consider the problem of learning fair policies for multi-stage selection
+problems from observational data. This problem arises in several high-stakes
+domains such as company hiring, loan approval, or bail decisions where outcomes
+(e.g., career success, loan repayment, recidivism) are only observed for those
+selected. We propose a multi-stage framework that can be augmented with various
+fairness constraints, such as demographic parity or equal opportunity. This
+problem is a highly intractable infinite chance-constrained program involving
+the unknown joint distribution of covariates and outcomes. Motivated by the
+potential impact of selection decisions on people's lives and livelihoods, we
+propose to focus on interpretable linear selection rules. Leveraging tools from
+causal inference and sample average approximation, we obtain an asymptotically
+consistent solution to this selection problem by solving a mixed binary conic
+optimization problem, which can be solved using standard off-the-shelf solvers.
+We conduct extensive computational experiments on a variety of datasets adapted
+from the UCI repository on which we show that our proposed approaches can
+achieve an 11.6% improvement in precision and a 38% reduction in the measure of
+unfairness compared to the existing selection policy.
+
+
+
+
+
+
+
+ ☆ Gappy local conformal auto-encoders for heterogeneous data fusion: in
+ praise of rigidity
+
+
+
+
+
+
+
+
+ Erez Peterfreund, Iryna Burak, Ofir Lindenbaum, Jim Gimlett, Felix Dietrich, Ronald R. Coifman, Ioannis G. Kevrekidis
+
+
+ Fusing measurements from multiple, heterogeneous, partial sources, observing
+a common object or process, poses challenges due to the increasing availability
+of numbers and types of sensors. In this work we propose, implement and
+validate an end-to-end computational pipeline in the form of a
+multiple-auto-encoder neural network architecture for this task. The inputs to
+the pipeline are several sets of partial observations, and the result is a
+globally consistent latent space, harmonizing (rigidifying, fusing) all
+measurements. The key enabler is the availability of multiple slightly
+perturbed measurements of each instance:, local measurement, "bursts", that
+allows us to estimate the local distortion induced by each instrument. We
+demonstrate the approach in a sequence of examples, starting with simple
+two-dimensional data sets and proceeding to a Wi-Fi localization problem and to
+the solution of a "dynamical puzzle" arising in spatio-temporal observations of
+the solutions of Partial Differential Equations.
+
+
+ Stochastic differential equations (SDEs) have been widely used to model real
+world random phenomena. Existing works mainly focus on the case where the time
+series is modeled by a single SDE, which might be restrictive for modeling time
+series with distributional shift. In this work, we propose a change point
+detection algorithm for time series modeled as neural SDEs. Given a time series
+dataset, the proposed method jointly learns the unknown change points and the
+parameters of distinct neural SDE models corresponding to each change point.
+Specifically, the SDEs are learned under the framework of generative
+adversarial networks (GANs) and the change points are detected based on the
+output of the GAN discriminator in a forward pass. At each step of the proposed
+algorithm, the change points and the SDE model parameters are updated in an
+alternating fashion. Numerical results on both synthetic and real datasets are
+provided to validate the performance of our algorithm in comparison to
+classical change point detection benchmarks, standard GAN-based neural SDEs,
+and other state-of-the-art deep generative models for time series data.
+
+
+
+
+
+
+
+ ☆ Underwater Acoustic Signal Recognition Based on Salient Features
+
+
+ With the rapid advancement of technology, the recognition of underwater
+acoustic signals in complex environments has become increasingly crucial.
+Currently, mainstream underwater acoustic signal recognition relies primarily
+on time-frequency analysis to extract spectral features, finding widespread
+applications in the field. However, existing recognition methods heavily depend
+on expert systems, facing limitations such as restricted knowledge bases and
+challenges in handling complex relationships. These limitations stem from the
+complexity and maintenance difficulties associated with rules or inference
+engines. Recognizing the potential advantages of deep learning in handling
+intricate relationships, this paper proposes a method utilizing neural networks
+for underwater acoustic signal recognition. The proposed approach involves
+continual learning of features extracted from spectra for the classification of
+underwater acoustic signals. Deep learning models can automatically learn
+abstract features from data and continually adjust weights during training to
+enhance classification performance.
+
+
+
+
+
+
+
+ ☆ Augment on Manifold: Mixup Regularization with UMAP
+
+
+ Data augmentation techniques play an important role in enhancing the
+performance of deep learning models. Despite their proven benefits in computer
+vision tasks, their application in the other domains remains limited. This
+paper proposes a Mixup regularization scheme, referred to as UMAP Mixup,
+designed for "on-manifold" automated data augmentation for deep learning
+predictive models. The proposed approach ensures that the Mixup operations
+result in synthesized samples that lie on the data manifold of the features and
+labels by utilizing a dimensionality reduction technique known as uniform
+manifold approximation and projection. Evaluations across diverse regression
+tasks show that UMAP Mixup is competitive with or outperforms other Mixup
+variants, show promise for its potential as an effective tool for enhancing the
+generalization performance of deep learning models.
+
+
+ Graph neural networks (GNNs) have demonstrated promising performance across
+various chemistry-related tasks. However, conventional graphs only model the
+pairwise connectivity in molecules, failing to adequately represent
+higher-order connections like multi-center bonds and conjugated structures. To
+tackle this challenge, we introduce molecular hypergraphs and propose Molecular
+Hypergraph Neural Networks (MHNN) to predict the optoelectronic properties of
+organic semiconductors, where hyperedges represent conjugated structures. A
+general algorithm is designed for irregular high-order connections, which can
+efficiently operate on molecular hypergraphs with hyperedges of various orders.
+The results show that MHNN outperforms all baseline models on most tasks of
+OPV, OCELOTv1 and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D
+geometric information, surpassing the baseline model that utilizes atom
+positions. Moreover, MHNN achieves better performance than pretrained GNNs
+under limited training data, underscoring its excellent data efficiency. This
+work provides a new strategy for more general molecular representations and
+property prediction tasks related to high-order connections.
+
+
+
+
+
+
+
+ ☆ Scaling Compute Is Not All You Need for Adversarial Robustness
+
+
+ The last six years have witnessed significant progress in adversarially
+robust deep learning. As evidenced by the CIFAR-10 dataset category in
+RobustBench benchmark, the accuracy under $\ell_\infty$ adversarial
+perturbations improved from 44\% in \citet{Madry2018Towards} to 71\% in
+\citet{peng2023robust}. Although impressive, existing state-of-the-art is still
+far from satisfactory. It is further observed that best-performing models are
+often very large models adversarially trained by industrial labs with
+significant computational budgets. In this paper, we aim to understand: ``how
+much longer can computing power drive adversarial robustness advances?" To
+answer this question, we derive \emph{scaling laws for adversarial robustness}
+which can be extrapolated in the future to provide an estimate of how much cost
+we would need to pay to reach a desired level of robustness. We show that
+increasing the FLOPs needed for adversarial training does not bring as much
+advantage as it does for standard training in terms of performance
+improvements. Moreover, we find that some of the top-performing techniques are
+difficult to exactly reproduce, suggesting that they are not robust enough for
+minor changes in the training setup. Our analysis also uncovers potentially
+worthwhile directions to pursue in future research. Finally, we make our
+benchmarking framework (built on top of \texttt{timm}~\citep{rw2019timm})
+publicly available to facilitate future analysis in efficient robust deep
+learning.
+
+
+ To address the needs of modeling uncertainty in sensitive machine learning
+applications, the setup of distributionally robust optimization (DRO) seeks
+good performance uniformly across a variety of tasks. The recent
+multi-distribution learning (MDL) framework tackles this objective in a dynamic
+interaction with the environment, where the learner has sampling access to each
+target distribution. Drawing inspiration from the field of pure-exploration
+multi-armed bandits, we provide distribution-dependent guarantees in the MDL
+regime, that scale with suboptimality gaps and result in superior dependence on
+the sample size when compared to the existing distribution-independent
+analyses. We investigate two non-adaptive strategies, uniform and non-uniform
+exploration, and present non-asymptotic regret bounds using novel tools from
+empirical process theory. Furthermore, we devise an adaptive optimistic
+algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring
+the contrast between uniform and optimistic allocation in the multi-armed
+bandit literature.
+
+
+ The rampant occurrence of cybersecurity breaches imposes substantial
+limitations on the progress of network infrastructures, leading to compromised
+data, financial losses, potential harm to individuals, and disruptions in
+essential services. The current security landscape demands the urgent
+development of a holistic security assessment solution that encompasses
+vulnerability analysis and investigates the potential exploitation of these
+vulnerabilities as attack paths. In this paper, we propose Prometheus, an
+advanced system designed to provide a detailed analysis of the security posture
+of computing infrastructures. Using user-provided information, such as device
+details and software versions, Prometheus performs a comprehensive security
+assessment. This assessment includes identifying associated vulnerabilities and
+constructing potential attack graphs that adversaries can exploit. Furthermore,
+Prometheus evaluates the exploitability of these attack paths and quantifies
+the overall security posture through a scoring mechanism. The system takes a
+holistic approach by analyzing security layers encompassing hardware, system,
+network, and cryptography. Furthermore, Prometheus delves into the
+interconnections between these layers, exploring how vulnerabilities in one
+layer can be leveraged to exploit vulnerabilities in others. In this paper, we
+present the end-to-end pipeline implemented in Prometheus, showcasing the
+systematic approach adopted for conducting this thorough security analysis.
+
+
+ The transferability of adversarial examples is of central importance to
+transfer-based black-box adversarial attacks. Previous works for generating
+transferable adversarial examples focus on attacking \emph{given} pretrained
+surrogate models while the connections between surrogate models and adversarial
+trasferability have been overlooked. In this paper, we propose {\em Lipschitz
+Regularized Surrogate} (LRS) for transfer-based black-box attacks, a novel
+approach that transforms surrogate models towards favorable adversarial
+transferability. Using such transformed surrogate models, any existing
+transfer-based black-box attack can run without any change, yet achieving much
+better performance. Specifically, we impose Lipschitz regularization on the
+loss landscape of surrogate models to enable a smoother and more controlled
+optimization process for generating more transferable adversarial examples. In
+addition, this paper also sheds light on the connection between the inner
+properties of surrogate models and adversarial transferability, where three
+factors are identified: smaller local Lipschitz constant, smoother loss
+landscape, and stronger adversarial robustness. We evaluate our proposed LRS
+approach by attacking state-of-the-art standard deep neural networks and
+defense models. The results demonstrate significant improvement on the attack
+success rates and transferability. Our code is available at
+https://github.com/TrustAIoT/LRS.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ Pre-training of Molecular GNNs as Conditional Boltzmann Generator AAAI
+
+
+ Learning representations of molecular structures using deep learning is a
+fundamental problem in molecular property prediction tasks. Molecules
+inherently exist in the real world as three-dimensional structures;
+furthermore, they are not static but in continuous motion in the 3D Euclidean
+space, forming a potential energy surface. Therefore, it is desirable to
+generate multiple conformations in advance and extract molecular
+representations using a 4D-QSAR model that incorporates multiple conformations.
+However, this approach is impractical for drug and material discovery tasks
+because of the computational cost of obtaining multiple conformations. To
+address this issue, we propose a pre-training method for molecular GNNs using
+an existing dataset of molecular conformations to generate a latent vector
+universal to multiple conformations from a 2D molecular graph. Our method,
+called Boltzmann GNN, is formulated by maximizing the conditional marginal
+likelihood of a conditional generative model for conformations generation. We
+show that our model has a better prediction performance for molecular
+properties than existing pre-training methods using molecular graphs and
+three-dimensional molecular structures.
+
+
+
+ comment: 4 pages. Short paper submitted to AAAI workshop (AI2ASE) 2023
+
+
+
+
+
+
+ ☆ MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using
+ Differentiable Shading
+
+
+
+
+
+
+
+
+ Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amin Fadaeinejad, Rafael M. O. Cruz, Marc-Andre Carbonneau
+
+
+ Reconstructing an avatar from a portrait image has many applications in
+multimedia, but remains a challenging research problem. Extracting reflectance
+maps and geometry from one image is ill-posed: recovering geometry is a
+one-to-many mapping problem and reflectance and light are difficult to
+disentangle. Accurate geometry and reflectance can be captured under the
+controlled conditions of a light stage, but it is costly to acquire large
+datasets in this fashion. Moreover, training solely with this type of data
+leads to poor generalization with in-the-wild images. This motivates the
+introduction of MoSAR, a method for 3D avatar generation from monocular images.
+We propose a semi-supervised training scheme that improves generalization by
+learning from both light stage and in-the-wild datasets. This is achieved using
+a novel differentiable shading formulation. We show that our approach
+effectively disentangles the intrinsic face parameters, producing relightable
+avatars. As a result, MoSAR estimates a richer set of skin reflectance maps,
+and generates more realistic avatars than existing state-of-the-art methods. We
+also introduce a new dataset, named FFHQ-UV-Intrinsics, the first public
+dataset providing intrisic face attributes at scale (diffuse, specular, ambient
+occlusion and translucency maps) for a total of 10k subjects. The project
+website and the dataset are available on the following link:
+https://ubisoftlaforge.github.io/character/mosar
+
+
+ Users in many domains use machine learning (ML) predictions to help them make
+decisions. Effective ML-based decision-making often requires explanations of ML
+models and their predictions. While there are many algorithms that explain
+models, generating explanations in a format that is comprehensible and useful
+to decision-makers is a nontrivial task that can require extensive development
+overhead. We developed Pyreal, a highly extensible system with a corresponding
+Python implementation for generating a variety of interpretable ML
+explanations. Pyreal converts data and explanations between the feature spaces
+expected by the model, relevant explanation algorithms, and human users,
+allowing users to generate interpretable explanations in a low-code manner. Our
+studies demonstrate that Pyreal generates more useful explanations than
+existing systems while remaining both easy-to-use and efficient.
+
+
+
+ comment: 12 pages, 10 figures, 4 tables
+
+
+
+
+
+
+ ☆ Continuous-time Graph Representation with Sequential Survival Process AAAI
+
+
+ Over the past two decades, there has been a tremendous increase in the growth
+of representation learning methods for graphs, with numerous applications
+across various fields, including bioinformatics, chemistry, and the social
+sciences. However, current dynamic network approaches focus on discrete-time
+networks or treat links in continuous-time networks as instantaneous events.
+Therefore, these approaches have limitations in capturing the persistence or
+absence of links that continuously emerge and disappear over time for
+particular durations. To address this, we propose a novel stochastic process
+relying on survival functions to model the durations of links and their
+absences over time. This forms a generic new likelihood specification
+explicitly accounting for intermittent edge-persistent networks, namely GraSSP:
+Graph Representation with Sequential Survival Process. We apply the developed
+framework to a recent continuous time dynamic latent distance model
+characterizing network dynamics in terms of a sequence of piecewise linear
+movements of nodes in latent space. We quantitatively assess the developed
+framework in various downstream tasks, such as link prediction and network
+completion, demonstrating that the developed modeling framework accounting for
+link persistence and absence well tracks the intrinsic trajectories of nodes in
+a latent space and captures the underlying characteristics of evolving network
+structure.
+
+
+
+ comment: Accepted to the 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI24), Vancouver, British Columbia, 2024
+
+
+
+
+
+
+ ☆ AutoXPCR: Automated Multi-Objective Model Selection for Time Series
+ Forecasting
+
+
+ Automated machine learning (AutoML) streamlines the creation of ML models.
+While most methods select the "best" model based on predictive quality, it's
+crucial to acknowledge other aspects, such as interpretability and resource
+consumption. This holds particular importance in the context of deep neural
+networks (DNNs), as these models are often perceived as computationally
+intensive black boxes. In the challenging domain of time series forecasting,
+DNNs achieve stunning results, but specialized approaches for automatically
+selecting models are scarce. In this paper, we propose AutoXPCR - a novel
+method for automated and explainable multi-objective model selection. Our
+approach leverages meta-learning to estimate any model's performance along PCR
+criteria, which encompass (P)redictive error, (C)omplexity, and (R)esource
+demand. Explainability is addressed on multiple levels, as our interactive
+framework can prioritize less complex models and provide by-product
+explanations of recommendations. We demonstrate practical feasibility by
+deploying AutoXPCR on over 1000 configurations across 114 data sets from
+various domains. Our method clearly outperforms other model selection
+approaches - on average, it only requires 20% of computation costs for
+recommending models with 90% of the best-possible quality.
+
+
+ In this study, we present a deep learning-based approach for time-series
+respiration data classification. The dataset contains regular breathing
+patterns as well as various forms of abnormal breathing, obtained through
+non-contact incoherent light-wave sensing (LWS) technology. Given the
+one-dimensional (1D) nature of the data, we employed a 1D convolutional neural
+network (1D-CNN) for classification purposes. Genetic algorithm was employed to
+optimize the 1D-CNN architecture to maximize classification accuracy.
+Addressing the computational complexity associated with training the 1D-CNN
+across multiple generations, we implemented transfer learning from a
+pre-trained model. This approach significantly reduced the computational time
+required for training, thereby enhancing the efficiency of the optimization
+process. This study contributes valuable insights into the potential
+applications of deep learning methodologies for enhancing respiratory anomaly
+detection through precise and efficient respiration classification.
+
+
+
+ comment: 7 pages, 8 figures, to be submitted to IEEE conference
+
+
+
+
+
+
+ ☆ Explainable artificial intelligence approaches for brain-computer
+ interfaces: a review and design space
+
+
+ This review paper provides an integrated perspective of Explainable
+Artificial Intelligence techniques applied to Brain-Computer Interfaces. BCIs
+use predictive models to interpret brain signals for various high-stake
+applications. However, achieving explainability in these complex models is
+challenging as it compromises accuracy. The field of XAI has emerged to address
+the need for explainability across various stakeholders, but there is a lack of
+an integrated perspective in XAI for BCI (XAI4BCI) literature. It is necessary
+to differentiate key concepts like explainability, interpretability, and
+understanding in this context and formulate a comprehensive framework. To
+understand the need of XAI for BCI, we pose six key research questions for a
+systematic review and meta-analysis, encompassing its purposes, applications,
+usability, and technical feasibility. We employ the PRISMA methodology --
+preferred reporting items for systematic reviews and meta-analyses to review
+(n=1246) and analyze (n=84) studies published in 2015 and onwards for key
+insights. The results highlight that current research primarily focuses on
+interpretability for developers and researchers, aiming to justify outcomes and
+enhance model performance. We discuss the unique approaches, advantages, and
+limitations of XAI4BCI from the literature. We draw insights from philosophy,
+psychology, and social sciences. We propose a design space for XAI4BCI,
+considering the evolving need to visualize and investigate predictive model
+outcomes customised for various stakeholders in the BCI development and
+deployment lifecycle. This paper is the first to focus solely on reviewing
+XAI4BCI research articles. This systematic review and meta-analysis findings
+with the proposed design space prompt important discussions on establishing
+standards for BCI explanations, highlighting current limitations, and guiding
+the future of XAI in BCI.
+
+
+
+
+
+
+
+
+ Weigang Lu, Ziyu Guan, Wei Zhao, Long Jin
+
+
+ Graph Neural Networks (GNNs) have become mainstream methods for solving the
+semi-supervised node classification problem. However, due to the uneven
+location distribution of labeled nodes in the graph, labeled nodes are only
+accessible to a small portion of unlabeled nodes, leading to the
+\emph{under-reaching} issue. In this study, we firstly reveal under-reaching by
+conducting an empirical investigation on various well-known graphs. Then, we
+demonstrate that under-reaching results in unsatisfactory distribution
+alignment between labeled and unlabeled nodes through systematic experimental
+analysis, significantly degrading GNNs' performance. To tackle under-reaching
+for GNNs, we propose an architecture-agnostic method dubbed NodeMixup. The
+fundamental idea is to (1) increase the reachability of labeled nodes by
+labeled-unlabeled pairs mixup, (2) leverage graph structures via fusing the
+neighbor connections of intra-class node pairs to improve performance gains of
+mixup, and (3) use neighbor label distribution similarity incorporating node
+degrees to determine sampling weights for node mixup. Extensive experiments
+demonstrate the efficacy of NodeMixup in assisting GNNs in handling
+under-reaching. The source code is available at
+\url{https://github.com/WeigangLu/NodeMixup}.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ A self-attention-based differentially private tabular GAN with high data
+ utility
+
+
+ Generative Adversarial Networks (GANs) have become a ubiquitous technology
+for data generation, with their prowess in image generation being
+well-established. However, their application in generating tabular data has
+been less than ideal. Furthermore, attempting to incorporate differential
+privacy technology into these frameworks has often resulted in a degradation of
+data utility. To tackle these challenges, this paper introduces DP-SACTGAN, a
+novel Conditional Generative Adversarial Network (CGAN) framework for
+differentially private tabular data generation, aiming to surmount these
+obstacles. Experimental findings demonstrate that DP-SACTGAN not only
+accurately models the distribution of the original data but also effectively
+satisfies the requirements of differential privacy.
+
+
+
+
+
+
+
+
+ Byung Hyun Lee, Min-hwan Oh, Se Young Chun
+
+
+ Task-free online continual learning (TF-CL) is a challenging problem where
+the model incrementally learns tasks without explicit task information.
+Although training with entire data from the past, present as well as future is
+considered as the gold standard, naive approaches in TF-CL with the current
+samples may be conflicted with learning with samples in the future, leading to
+catastrophic forgetting and poor plasticity. Thus, a proactive consideration of
+an unseen future sample in TF-CL becomes imperative. Motivated by this
+intuition, we propose a novel TF-CL framework considering future samples and
+show that injecting adversarial perturbations on both input data and
+decision-making is effective. Then, we propose a novel method named Doubly
+Perturbed Continual Learning (DPCL) to efficiently implement these input and
+decision-making perturbations. Specifically, for input perturbation, we propose
+an approximate perturbation method that injects noise into the input data as
+well as the feature vector and then interpolates the two perturbed samples. For
+decision-making process perturbation, we devise multiple stochastic
+classifiers. We also investigate a memory management scheme and learning rate
+scheduling reflecting our proposed double perturbations. We demonstrate that
+our proposed method outperforms the state-of-the-art baseline methods by large
+margins on various TF-CL benchmarks.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ No More Shortcuts: Realizing the Potential of Temporal Self-Supervision AAAI 2024
+
+
+
+
+
+
+
+
+ Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
+
+
+ Self-supervised approaches for video have shown impressive results in video
+understanding tasks. However, unlike early works that leverage temporal
+self-supervision, current state-of-the-art methods primarily rely on tasks from
+the image domain (e.g., contrastive learning) that do not explicitly promote
+the learning of temporal features. We identify two factors that limit existing
+temporal self-supervision: 1) tasks are too simple, resulting in saturated
+training performance, and 2) we uncover shortcuts based on local appearance
+statistics that hinder the learning of high-level features. To address these
+issues, we propose 1) a more challenging reformulation of temporal
+self-supervision as frame-level (rather than clip-level) recognition tasks and
+2) an effective augmentation strategy to mitigate shortcuts. Our model extends
+a representation of single video frames, pre-trained through contrastive
+learning, with a transformer that we train through temporal self-supervision.
+We demonstrate experimentally that our more challenging frame-level task
+formulations and the removal of shortcuts drastically improve the quality of
+features learned through temporal self-supervision. The generalization
+capability of our self-supervised video method is evidenced by its
+state-of-the-art performance in a wide range of high-level semantic tasks,
+including video retrieval, action classification, and video attribute
+recognition (such as object and scene identification), as well as low-level
+temporal correspondence tasks like video object segmentation and pose tracking.
+Additionally, we show that the video representations learned through our method
+exhibit increased robustness to the input perturbations.
+
+
+
+ comment: AAAI 2024 (Main Technical Track)
+
+
+
+
+
+
+ ☆ Benchmarking and Analyzing In-context Learning, Fine-tuning and
+ Supervised Learning for Biomedical Knowledge Curation: a focused study on
+ chemical entities of biological interest
+
+
+ Automated knowledge curation for biomedical ontologies is key to ensure that
+they remain comprehensive, high-quality and up-to-date. In the era of
+foundational language models, this study compares and analyzes three NLP
+paradigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and
+supervised learning (ML). Using the Chemical Entities of Biological Interest
+(ChEBI) database as a model ontology, three curation tasks were devised. For
+ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT.
+PubmedBERT was chosen for the FT paradigm. For ML, six embedding models were
+utilized for training Random Forest and Long-Short Term Memory models. Five
+setups were designed to assess ML and FT model performance across different
+data availability scenarios.Datasets for curation tasks included: task 1
+(620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive
+versus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of
+0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML
+(trained on ~260,000 triples) outperformed ICL in accuracy across all tasks.
+(accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed
+similarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and
++.002), but worse in task 3 (-.048). Simulations revealed performance declines
+in both ML and FT models with smaller and higher imbalanced training data.
+where ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks
+1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed
+ML/FT in task 2.ICL-augmented foundation models can be good assistants for
+knowledge curation with correct prompting, however, not making ML and FT
+paradigms obsolete. The latter two require task-specific data to beat ICL. In
+such cases, ML relies on small pretrained embeddings, minimizing computational
+demands.
+
+
+
+ comment: 26 pages, 5 figures, 14 tables
+
+
+
+
+
+
+ ☆ Collaborative Optimization of the Age of Information under Partial
+ Observability
+
+
+
+
+
+
+
+
+ Anam Tahir, Kai Cui, Bastian Alt, Amr Rizk, Heinz Koeppl
+
+
+ The significance of the freshness of sensor and control data at the receiver
+side, often referred to as Age of Information (AoI), is fundamentally
+constrained by contention for limited network resources. Evidently, network
+congestion is detrimental for AoI, where this congestion is partly self-induced
+by the sensor transmission process in addition to the contention from other
+transmitting sensors. In this work, we devise a decentralized AoI-minimizing
+transmission policy for a number of sensor agents sharing capacity-limited,
+non-FIFO duplex channels that introduce random delays in communication with a
+common receiver. By implementing the same policy, however with no explicit
+inter-agent communication, the agents minimize the expected AoI in this
+partially observable system. We cater to the partial observability due to
+random channel delays by designing a bootstrap particle filter that
+independently maintains a belief over the AoI of each agent. We also leverage
+mean-field control approximations and reinforcement learning to derive scalable
+and optimal solutions for minimizing the expected AoI collaboratively.
+
+
+
+
+
+
+
+ ☆ Sparse Mean Field Load Balancing in Large Localized Queueing Systems
+
+
+ Scalable load balancing algorithms are of great interest in cloud networks
+and data centers, necessitating the use of tractable techniques to compute
+optimal load balancing policies for good performance. However, most existing
+scalable techniques, especially asymptotically scaling methods based on mean
+field theory, have not been able to model large queueing networks with strong
+locality. Meanwhile, general multi-agent reinforcement learning techniques can
+be hard to scale and usually lack a theoretical foundation. In this work, we
+address this challenge by leveraging recent advances in sparse mean field
+theory to learn a near-optimal load balancing policy in sparsely connected
+queueing networks in a tractable manner, which may be preferable to global
+approaches in terms of communication overhead. Importantly, we obtain a general
+load balancing framework for a large class of sparse bounded-degree topologies.
+By formulating a novel mean field control problem in the context of graphs with
+bounded degree, we reduce the otherwise difficult multi-agent problem to a
+single-agent problem. Theoretically, the approach is justified by approximation
+guarantees. Empirically, the proposed methodology performs well on several
+realistic and scalable network topologies. Moreover, we compare it with a
+number of well-known load balancing heuristics and with existing scalable
+multi-agent reinforcement learning methods. Overall, we obtain a tractable
+approach for load balancing in highly localized networks.
+
+
+
+
+
+
+
+ ☆ From Past to Future: Rethinking Eligibility Traces AAAI
+
+
+
+
+
+
+
+
+ Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, Bruno Castro da Silva
+
+
+ In this paper, we introduce a fresh perspective on the challenges of credit
+assignment and policy evaluation. First, we delve into the nuances of
+eligibility traces and explore instances where their updates may result in
+unexpected credit assignment to preceding states. From this investigation
+emerges the concept of a novel value function, which we refer to as the
+\emph{bidirectional value function}. Unlike traditional state value functions,
+bidirectional value functions account for both future expected returns (rewards
+anticipated from the current state onward) and past expected returns
+(cumulative rewards from the episode's start to the present). We derive
+principled update equations to learn this value function and, through
+experimentation, demonstrate its efficacy in enhancing the process of policy
+evaluation. In particular, our results indicate that the proposed learning
+approach can, in certain challenging contexts, perform policy evaluation more
+rapidly than TD($\lambda$) -- a method that learns forward value functions,
+$v^\pi$, \emph{directly}. Overall, our findings present a new perspective on
+eligibility traces and potential advantages associated with the novel value
+function it inspires, especially for policy evaluation.
+
+
+
+ comment: Accepted in The 38th Annual AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+ ☆ Class Conditional Time Series Generation with Structured Noise Space GAN
+
+
+
+
+
+
+
+
+ Hamidreza Gholamrezaei, Alireza Koochali, Andreas Dengel, Sheraz Ahmed
+
+
+ This paper introduces Structured Noise Space GAN (SNS-GAN), a novel approach
+in the field of generative modeling specifically tailored for class-conditional
+generation in both image and time series data. It addresses the challenge of
+effectively integrating class labels into generative models without requiring
+structural modifications to the network. The SNS-GAN method embeds class
+conditions within the generator's noise space, simplifying the training process
+and enhancing model versatility. The model's efficacy is demonstrated through
+qualitative validations in the image domain and superior performance in time
+series generation compared to baseline models. This research opens new avenues
+for the application of GANs in various domains, including but not limited to
+time series and image data generation.
+
+
+ This study investigates the misclassification excess risk bound in the
+context of 1-bit matrix completion, a significant problem in machine learning
+involving the recovery of an unknown matrix from a limited subset of its
+entries. Matrix completion has garnered considerable attention in the last two
+decades due to its diverse applications across various fields. Unlike
+conventional approaches that deal with real-valued samples, 1-bit matrix
+completion is concerned with binary observations. While prior research has
+predominantly focused on the estimation error of proposed estimators, our study
+shifts attention to the prediction error. This paper offers theoretical
+analysis regarding the prediction errors of two previous works utilizing the
+logistic regression model: one employing a max-norm constrained minimization
+and the other employing nuclear-norm penalization. Significantly, our findings
+demonstrate that the latter achieves the minimax-optimal rate without the need
+for an additional logarithmic term. These novel results contribute to a deeper
+understanding of 1-bit matrix completion by shedding light on the predictive
+performance of specific methodologies.
+
+
+
+
+
+
+
+ ☆ Robust Loss Functions for Training Decision Trees with Noisy Labels AAAI
+
+
+ We consider training decision trees using noisily labeled data, focusing on
+loss functions that can lead to robust learning algorithms. Our contributions
+are threefold. First, we offer novel theoretical insights on the robustness of
+many existing loss functions in the context of decision tree learning. We show
+that some of the losses belong to a class of what we call conservative losses,
+and the conservative losses lead to an early stopping behavior during training
+and noise-tolerant predictions during testing. Second, we introduce a framework
+for constructing robust loss functions, called distribution losses. These
+losses apply percentile-based penalties based on an assumed margin
+distribution, and they naturally allow adapting to different noise rates via a
+robustness parameter. In particular, we introduce a new loss called the
+negative exponential loss, which leads to an efficient greedy
+impurity-reduction learning algorithm. Lastly, our experiments on multiple
+datasets and noise settings validate our theoretical insight and the
+effectiveness of our adaptive negative exponential loss.
+
+
+
+ comment: Accepted at AAAI Conference on Artificial Intelligence 2024
+
+
+
+
+
+
+ ☆ Stability of Graph Convolutional Neural Networks through the lens of
+ small perturbation analysis ICASSP 2024
+
+
+ In this work, we study the problem of stability of Graph Convolutional Neural
+Networks (GCNs) under random small perturbations in the underlying graph
+topology, i.e. under a limited number of insertions or deletions of edges. We
+derive a novel bound on the expected difference between the outputs of
+unperturbed and perturbed GCNs. The proposed bound explicitly depends on the
+magnitude of the perturbation of the eigenpairs of the Laplacian matrix, and
+the perturbation explicitly depends on which edges are inserted or deleted.
+Then, we provide a quantitative characterization of the effect of perturbing
+specific edges on the stability of the network. We leverage tools from small
+perturbation analysis to express the bounds in closed, albeit approximate,
+form, in order to enhance interpretability of the results, without the need to
+compute any perturbed shift operator. Finally, we numerically evaluate the
+effectiveness of the proposed bound.
+
+
+
+ comment: Accepted for publication in Proc. of 2024 IEEE International
+ Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)
+
+
+
+
+
+
+ ☆ Energy-efficient Spiking Neural Network Equalization for IM/DD Systems
+ with Optimized Neural Encoding
+
+
+
+
+
+
+
+
+ Alexander von Bank, Eike-Manuel Edelmann, Laurent Schmalen
+
+
+ We propose an energy-efficient equalizer for IM/DD systems based on spiking
+neural networks. We optimize a neural spike encoding that boosts the
+equalizer's performance while decreasing energy consumption.
+
+
+
+ comment: Accepted for publication at OFC 2024
+
+
+
+
+
+
+ ☆ PGN: A perturbation generation network against deep reinforcement
+ learning
+
+
+
+
+
+
+
+
+ Xiangjuan Li, Feifan Li, Yang Li, Quan Pan
+
+
+ Deep reinforcement learning has advanced greatly and applied in many areas.
+In this paper, we explore the vulnerability of deep reinforcement learning by
+proposing a novel generative model for creating effective adversarial examples
+to attack the agent. Our proposed model can achieve both targeted attacks and
+untargeted attacks. Considering the specificity of deep reinforcement learning,
+we propose the action consistency ratio as a measure of stealthiness, and a new
+measurement index of effectiveness and stealthiness. Experiment results show
+that our method can ensure the effectiveness and stealthiness of attack
+compared with other algorithms. Moreover, our methods are considerably faster
+and thus can achieve rapid and efficient verification of the vulnerability of
+deep reinforcement learning.
+
+
+
+
+
+
+
+ ☆ A Minimal Control Family of Dynamical Syetem for Universal Approximation
+
+
+ The universal approximation property (UAP) of neural networks is a
+fundamental characteristic of deep learning. It is widely recognized that a
+composition of linear functions and non-linear functions, such as the rectified
+linear unit (ReLU) activation function, can approximate continuous functions on
+compact domains. In this paper, we extend this efficacy to the scenario of
+dynamical systems with controls. We prove that the control family
+$\mathcal{F}_1 = \mathcal{F}_0 \cup \{ \text{ReLU}(\cdot)\} $ is enough to
+generate flow maps that can uniformly approximate diffeomorphisms of
+$\mathbb{R}^d$ on any compact domain, where $\mathcal{F}_0 = \{x \mapsto Ax+b:
+A\in \mathbb{R}^{d\times d}, b \in \mathbb{R}^d\}$ is the set of linear maps
+and the dimension $d\ge2$. Since $\mathcal{F}_1$ contains only one nonlinear
+function and $\mathcal{F}_0$ does not hold the UAP, we call $\mathcal{F}_1$ a
+minimal control family for UAP. Based on this, some sufficient conditions, such
+as the affine invariance, on the control family are established and discussed.
+Our result reveals an underlying connection between the approximation power of
+neural networks and control systems.
+
+
+
+ comment: 19 pages
+
+
+
+
+
+
+ ☆ BSL: Understanding and Improving Softmax Loss for Recommendation
+
+
+ Loss functions steer the optimization direction of recommendation models and
+are critical to model performance, but have received relatively little
+attention in recent recommendation research. Among various losses, we find
+Softmax loss (SL) stands out for not only achieving remarkable accuracy but
+also better robustness and fairness. Nevertheless, the current literature lacks
+a comprehensive explanation for the efficacy of SL. Toward addressing this
+research gap, we conduct theoretical analyses on SL and uncover three insights:
+1) Optimizing SL is equivalent to performing Distributionally Robust
+Optimization (DRO) on the negative data, thereby learning against perturbations
+on the negative distribution and yielding robustness to noisy negatives. 2)
+Comparing with other loss functions, SL implicitly penalizes the prediction
+variance, resulting in a smaller gap between predicted values and and thus
+producing fairer results. Building on these insights, we further propose a
+novel loss function Bilateral SoftMax Loss (BSL) that extends the advantage of
+SL to both positive and negative sides. BSL augments SL by applying the same
+Log-Expectation-Exp structure to positive examples as is used for negatives,
+making the model robust to the noisy positives as well. Remarkably, BSL is
+simple and easy-to-implement -- requiring just one additional line of code
+compared to SL. Experiments on four real-world datasets and three
+representative backbones demonstrate the effectiveness of our proposal. The
+code is available at https://github.com/junkangwu/BSL
+
+
+
+
+
+
+
+ ☆ Testing the Segment Anything Model on radiology data
+
+
+
+
+
+
+
+
+ José Guilherme de Almeida, Nuno M. Rodrigues, Sara Silva, Nickolas Papanikolaou
+
+
+ Deep learning models trained with large amounts of data have become a recent
+and effective approach to predictive problem solving -- these have become known
+as "foundation models" as they can be used as fundamental tools for other
+applications. While the paramount examples of image classification (earlier)
+and large language models (more recently) led the way, the Segment Anything
+Model (SAM) was recently proposed and stands as the first foundation model for
+image segmentation, trained on over 10 million images and with recourse to over
+1 billion masks. However, the question remains -- what are the limits of this
+foundation? Given that magnetic resonance imaging (MRI) stands as an important
+method of diagnosis, we sought to understand whether SAM could be used for a
+few tasks of zero-shot segmentation using MRI data. Particularly, we wanted to
+know if selecting masks from the pool of SAM predictions could lead to good
+segmentations.
+ Here, we provide a critical assessment of the performance of SAM on magnetic
+resonance imaging data. We show that, while acceptable in a very limited set of
+cases, the overall trend implies that these models are insufficient for MRI
+segmentation across the whole volume, but can provide good segmentations in a
+few, specific slices. More importantly, we note that while foundation models
+trained on natural images are set to become key aspects of predictive
+modelling, they may prove ineffective when used on other imaging modalities.
+
+
+
+
+
+
+
+ ☆ Rule-Extraction Methods From Feedforward Neural Networks: A Systematic
+ Literature Review
+
+
+
+
+
+
+
+
+ Sara El Mekkaoui, Loubna Benabbou, Abdelaziz Berrado
+
+
+ Motivated by the interpretability question in ML models as a crucial element
+for the successful deployment of AI systems, this paper focuses on rule
+extraction as a means for neural networks interpretability. Through a
+systematic literature review, different approaches for extracting rules from
+feedforward neural networks, an important block in deep learning models, are
+identified and explored. The findings reveal a range of methods developed for
+over two decades, mostly suitable for shallow neural networks, with recent
+developments to meet deep learning models' challenges. Rules offer a
+transparent and intuitive means of explaining neural networks, making this
+study a comprehensive introduction for researchers interested in the field.
+While the study specifically addresses feedforward networks with supervised
+learning and crisp rules, future work can extend to other network types,
+machine learning methods, and fuzzy rule extraction.
+
+
+
+
+
+
+
+ ☆ Effect Size Estimation for Duration Recommendation in Online
+ Experiments: Leveraging Hierarchical Models and Objective Utility Approaches
+
+
+
+
+
+
+
+
+ Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song
+
+
+ The selection of the assumed effect size (AES) critically determines the
+duration of an experiment, and hence its accuracy and efficiency.
+Traditionally, experimenters determine AES based on domain knowledge. However,
+this method becomes impractical for online experimentation services managing
+numerous experiments, and a more automated approach is hence of great demand.
+We initiate the study of data-driven AES selection in for online
+experimentation services by introducing two solutions. The first employs a
+three-layer Gaussian Mixture Model considering the heteroskedasticity across
+experiments, and it seeks to estimate the true expected effect size among
+positive experiments. The second method, grounded in utility theory, aims to
+determine the optimal effect size by striking a balance between the
+experiment's cost and the precision of decision-making. Through comparisons
+with baseline methods using both simulated and real data, we showcase the
+superior performance of the proposed approaches.
+
+
+
+
+
+
+
+
+ Théo Vincent, Alberto Maria Metelli, Boris Belousov, Jan Peters, Marcello Restelli, Carlo D'Eramo
+
+
+ Approximate value iteration~(AVI) is a family of algorithms for reinforcement
+learning~(RL) that aims to obtain an approximation of the optimal value
+function. Generally, AVI algorithms implement an iterated procedure where each
+step consists of (i) an application of the Bellman operator and (ii) a
+projection step into a considered function space. Notoriously, the Bellman
+operator leverages transition samples, which strongly determine its behavior,
+as uninformative samples can result in negligible updates or long detours,
+whose detrimental effects are further exacerbated by the computationally
+intensive projection step. To address these issues, we propose a novel
+alternative approach based on learning an approximate version of the Bellman
+operator rather than estimating it through samples as in AVI approaches. This
+way, we are able to (i) generalize across transition samples and (ii) avoid the
+computationally intensive projection step. For this reason, we call our novel
+operator projected Bellman operator (PBO). We formulate an optimization problem
+to learn PBO for generic sequential decision-making problems, and we
+theoretically analyze its properties in two representative classes of RL
+problems. Furthermore, we theoretically study our approach under the lens of
+AVI and devise algorithmic implementations to learn PBO in offline and online
+settings by leveraging neural network parameterizations. Finally, we
+empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on
+several RL problems.
+
+
+
+ comment: Proceedings of the National Conference on Artificial Intelligence
+ (AAAI-24)
+
+
+
+
+
+
+ ☆ Federated Learning While Providing Model as a Service: Joint Training
+ and Inference Optimization
+
+
+ While providing machine learning model as a service to process users'
+inference requests, online applications can periodically upgrade the model
+utilizing newly collected data. Federated learning (FL) is beneficial for
+enabling the training of models across distributed clients while keeping the
+data locally. However, existing work has overlooked the coexistence of model
+training and inference under clients' limited resources. This paper focuses on
+the joint optimization of model training and inference to maximize inference
+performance at clients. Such an optimization faces several challenges. The
+first challenge is to characterize the clients' inference performance when
+clients may partially participate in FL. To resolve this challenge, we
+introduce a new notion of age of model (AoM) to quantify client-side model
+freshness, based on which we use FL's global model convergence error as an
+approximate measure of inference performance. The second challenge is the tight
+coupling among clients' decisions, including participation probability in FL,
+model download probability, and service rates. Toward the challenges, we
+propose an online problem approximation to reduce the problem complexity and
+optimize the resources to balance the needs of model training and inference.
+Experimental results demonstrate that the proposed algorithm improves the
+average inference accuracy by up to 12%.
+
+
+
+ comment: Accepted by IEEE International Conference on Computer Communications
+ (INFOCOM) 2024
+
+
+
+
+
+
+ ☆ SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
+ Remote Sensing AAAI 2024
+
+
+ Remote sensing imagery, despite its broad applications in helping achieve
+Sustainable Development Goals and tackle climate change, has not yet benefited
+from the recent advancements of versatile, task-agnostic vision language models
+(VLMs). A key reason is that the large-scale, semantically diverse image-text
+dataset required for developing VLMs is still absent for remote sensing images.
+Unlike natural images, remote sensing images and their associated text
+descriptions cannot be efficiently collected from the public Internet at scale.
+In this work, we bridge this gap by using geo-coordinates to automatically
+connect open, unlabeled remote sensing images with rich semantics covered in
+OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language
+dataset for remote sensing images, comprising 2.6 million image-text pairs
+covering 29K distinct semantic tags. With continual pre-training on this
+dataset, we obtain a VLM that surpasses baseline models with a 6.2% average
+accuracy gain in zero-shot scene classification across seven benchmark
+datasets. It also demonstrates the ability of zero-shot transfer for
+fine-grained object attribute classification and cross-modal retrieval. We hope
+this dataset can support the advancement of VLMs for various multi-modal tasks
+in remote sensing, such as open-vocabulary classification, retrieval,
+captioning, and text-to-image synthesis.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Divergences induced by dual subtractive and divisive normalizations of
+ exponential families and their convex deformations
+
+
+ Exponential families are statistical models which are the workhorses in
+statistics, information theory, and machine learning. An exponential family can
+either be normalized subtractively by its cumulant function or equivalently
+normalized divisively by its partition function. Both subtractive and divisive
+normalizers are strictly convex and smooth functions inducing pairs of Bregman
+and Jensen divergences. It is well-known that skewed Bhattacharryya distances
+between probability densities of an exponential family amounts to skewed Jensen
+divergences induced by the cumulant function between their corresponding
+natural parameters, and in limit cases that the sided Kullback-Leibler
+divergences amount to reverse-sided Bregman divergences. In this note, we first
+show that the $\alpha$-divergences between unnormalized densities of an
+exponential family amounts scaled $\alpha$-skewed Jensen divergences induced by
+the partition function. We then show how comparative convexity with respect to
+a pair of quasi-arithmetic means allows to deform convex functions and define
+dually flat spaces with corresponding divergences when ordinary convexity is
+preserved.
+
+
+
+ comment: 16 pages, 2 figures
+
+
+
+
+
+
+ ☆ Causal Discovery under Identifiable Heteroscedastic Noise Model
+
+
+ Capturing the underlying structural causal relations represented by Directed
+Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines.
+Causal DAG learning via the continuous optimization framework has recently
+achieved promising performance in terms of both accuracy and efficiency.
+However, most methods make strong assumptions of homoscedastic noise, i.e.,
+exogenous noises have equal variances across variables, observations, or even
+both. The noises in real data usually violate both assumptions due to the
+biases introduced by different data collection processes. To address the issue
+of heteroscedastic noise, we introduce relaxed and implementable sufficient
+conditions, proving the identifiability of a general class of SEM subject to
+these conditions. Based on the identifiable general SEM, we propose a novel
+formulation for DAG learning that accounts for the variation in noise variance
+across variables and observations. We then propose an effective two-phase
+iterative DAG learning algorithm to address the increasing optimization
+difficulties and to learn a causal DAG from data with heteroscedastic variable
+noise under varying variance. We show significant empirical gains of the
+proposed approaches over state-of-the-art methods on both synthetic data and
+real data.
+
+
+
+
+
+
+
+
+ Hannah Blocher, Georg Schollmeyer, Malte Nalenz, Christoph Jansen
+
+
+ We propose a framework for descriptively analyzing sets of partial orders
+based on the concept of depth functions. Despite intensive studies in linear
+and metric spaces, there is very little discussion on depth functions for
+non-standard data types such as partial orders. We introduce an adaptation of
+the well-known simplicial depth to the set of all partial orders, the
+union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a
+comparison of machine learning algorithms based on multidimensional performance
+measures. Concretely, we provide two examples of classifier comparisons on
+samples of standard benchmark data sets. Our results demonstrate promisingly
+the wide variety of different analysis approaches based on ufg methods.
+Furthermore, the examples outline that our approach differs substantially from
+existing benchmarking approaches, and thus adds a new perspective to the vivid
+debate on classifier comparison.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2304.09872
+
+
+
+
+
+
+ ☆ FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image
+ Segmentation Against Heterogeneous Annotation Noise AAAI'24
+
+
+ Federated learning (FL) has emerged as a promising paradigm for training
+segmentation models on decentralized medical data, owing to its
+privacy-preserving property. However, existing research overlooks the prevalent
+annotation noise encountered in real-world medical datasets, which limits the
+performance ceilings of FL. In this paper, we, for the first time, identify and
+tackle this problem. For problem formulation, we propose a contour evolution
+for modeling non-independent and identically distributed (Non-IID) noise across
+pixels within each client and then extend it to the case of multi-source data
+to form a heterogeneous noise model (\textit{i.e.}, Non-IID annotation noise
+across clients). For robust learning from annotations with such two-level
+Non-IID noise, we emphasize the importance of data quality in model
+aggregation, allowing high-quality clients to have a greater impact on FL. To
+achieve this, we propose \textbf{Fed}erated learning with \textbf{A}nnotation
+qu\textbf{A}lity-aware \textbf{A}ggregat\textbf{I}on, named \textbf{FedA$^3$I},
+by introducing a quality factor based on client-wise noise estimation.
+Specifically, noise estimation at each client is accomplished through the
+Gaussian mixture model and then incorporated into model aggregation in a
+layer-wise manner to up-weight high-quality clients. Extensive experiments on
+two real-world medical image segmentation datasets demonstrate the superior
+performance of FedA$^3$I against the state-of-the-art approaches in dealing
+with cross-client annotation noise. The code is available at
+\color{blue}{https://github.com/wnn2000/FedAAAI}.
+
+
+
+ comment: Accepted at AAAI'24
+
+
+
+
+
+
+ ☆ Near-Optimal Resilient Aggregation Rules for Distributed Learning Using
+ 1-Center and 1-Mean Clustering with Outliers AAAI
+
+
+ Byzantine machine learning has garnered considerable attention in light of
+the unpredictable faults that can occur in large-scale distributed learning
+systems. The key to secure resilience against Byzantine machines in distributed
+learning is resilient aggregation mechanisms. Although abundant resilient
+aggregation rules have been proposed, they are designed in ad-hoc manners,
+imposing extra barriers on comparing, analyzing, and improving the rules across
+performance criteria. This paper studies near-optimal aggregation rules using
+clustering in the presence of outliers. Our outlier-robust clustering approach
+utilizes geometric properties of the update vectors provided by workers. Our
+analysis show that constant approximations to the 1-center and 1-mean
+clustering problems with outliers provide near-optimal resilient aggregators
+for metric-based criteria, which have been proven to be crucial in the
+homogeneous and heterogeneous cases respectively. In addition, we discuss two
+contradicting types of attacks under which no single aggregation rule is
+guaranteed to improve upon the naive average. Based on the discussion, we
+propose a two-phase resilient aggregation framework. We run experiments for
+image classification using a non-convex loss function. The proposed algorithms
+outperform previously known aggregation rules by a large margin with both
+homogeneous and heterogeneous data distributions among non-faulty workers. Code
+and appendix are available at https://github.com/jerry907/AAAI24-RASHB.
+
+
+
+ comment: 17 pages, 4 figures. Accepted by the 38th Annual AAAI Conference on
+ Artificial Intelligence (AAAI'24)
+
+ Sequential posted pricing auctions are popular because of their simplicity in
+practice and their tractability in theory. A usual assumption in their study is
+that the Bayesian prior distributions of the buyers are known to the seller,
+while in reality these priors can only be accessed from historical data. To
+overcome this assumption, we study sequential posted pricing in the bandit
+learning model, where the seller interacts with $n$ buyers over $T$ rounds: In
+each round the seller posts $n$ prices for the $n$ buyers and the first buyer
+with a valuation higher than the price takes the item. The only feedback that
+the seller receives in each round is the revenue.
+ Our main results obtain nearly-optimal regret bounds for single-item
+sequential posted pricing in the bandit learning model. In particular, we
+achieve an $\tilde{O}(\mathsf{poly}(n)\sqrt{T})$ regret for buyers with
+(Myerson's) regular distributions and an
+$\tilde{O}(\mathsf{poly}(n)T^{{2}/{3}})$ regret for buyers with general
+distributions, both of which are tight in the number of rounds $T$. Our result
+for regular distributions was previously not known even for the single-buyer
+setting and relies on a new half-concavity property of the revenue function in
+the value space. For $n$ sequential buyers, our technique is to run a
+generalized single-buyer algorithm for all the buyers and to carefully bound
+the regret from the sub-optimal pricing of the suffix buyers.
+
+
+
+
+
+
+
+ ☆ Model-Based Control with Sparse Neural Dynamics NeurIPS 2023
+
+
+
+
+
+
+
+
+ Ziang Liu, Genggeng Zhou, Jeff He, Tobia Marcucci, Li Fei-Fei, Jiajun Wu, Yunzhu Li
+
+
+ Learning predictive models from observations using deep neural networks
+(DNNs) is a promising new approach to many real-world planning and control
+problems. However, common DNNs are too unstructured for effective planning, and
+current control methods typically rely on extensive sampling or local gradient
+descent. In this paper, we propose a new framework for integrated model
+learning and predictive control that is amenable to efficient optimization
+algorithms. Specifically, we start with a ReLU neural model of the system
+dynamics and, with minimal losses in prediction accuracy, we gradually sparsify
+it by removing redundant neurons. This discrete sparsification process is
+approximated as a continuous problem, enabling an end-to-end optimization of
+both the model architecture and the weight parameters. The sparsified model is
+subsequently used by a mixed-integer predictive controller, which represents
+the neuron activations as binary variables and employs efficient
+branch-and-bound algorithms. Our framework is applicable to a wide variety of
+DNNs, from simple multilayer perceptrons to complex graph neural dynamics. It
+can efficiently handle tasks involving complicated contact dynamics, such as
+object pushing, compositional object sorting, and manipulation of deformable
+objects. Numerical and hardware experiments show that, despite the aggressive
+sparsification, our framework can deliver better closed-loop performance than
+existing state-of-the-art methods.
+
+
+
+ comment: Accepted at NeurIPS 2023. For tutorial code and additional
+ visualizations, see https://robopil.github.io/Sparse-Dynamics/
+
+
+
+
+
+
+ ☆ SLP-Net:An efficient lightweight network for segmentation of skin
+ lesions
+
+
+
+
+
+
+
+
+ Bo Yang, Hong Peng, Chenggang Guo, Xiaohui Luo, Jun Wang, Xianzhong Long
+
+
+ Prompt treatment for melanoma is crucial. To assist physicians in identifying
+lesion areas precisely in a quick manner, we propose a novel skin lesion
+segmentation technique namely SLP-Net, an ultra-lightweight segmentation
+network based on the spiking neural P(SNP) systems type mechanism. Most
+existing convolutional neural networks achieve high segmentation accuracy while
+neglecting the high hardware cost. SLP-Net, on the contrary, has a very small
+number of parameters and a high computation speed. We design a lightweight
+multi-scale feature extractor without the usual encoder-decoder structure.
+Rather than a decoder, a feature adaptation module is designed to replace it
+and implement multi-scale information decoding. Experiments at the ISIC2018
+challenge demonstrate that the proposed model has the highest Acc and DSC among
+the state-of-the-art methods, while experiments on the PH2 dataset also
+demonstrate a favorable generalization ability. Finally, we compare the
+computational complexity as well as the computational speed of the models in
+experiments, where SLP-Net has the highest overall superiority
+
+
+
+
+
+
+
+ ☆ Fast Cell Library Characterization for Design Technology Co-Optimization
+ Based on Graph Neural Networks
+
+
+ Design technology co-optimization (DTCO) plays a critical role in achieving
+optimal power, performance, and area (PPA) for advanced semiconductor process
+development. Cell library characterization is essential in DTCO flow, but
+traditional methods are time-consuming and costly. To overcome these
+challenges, we propose a graph neural network (GNN)-based machine learning
+model for rapid and accurate cell library characterization. Our model
+incorporates cell structures and demonstrates high prediction accuracy across
+various process-voltage-temperature (PVT) corners and technology parameters.
+Validation with 512 unseen technology corners and over one million test data
+points shows accurate predictions of delay, power, and input pin capacitance
+for 33 types of cells, with a mean absolute percentage error (MAPE) $\le$ 0.95%
+and a speed-up of 100X compared with SPICE simulations. Additionally, we
+investigate system-level metrics such as worst negative slack (WNS), leakage
+power, and dynamic power using predictions obtained from the GNN-based model on
+unseen corners. Our model achieves precise predictions, with absolute error
+$\le$3.0 ps for WNS, percentage errors $\le$0.60% for leakage power, and
+$\le$0.99% for dynamic power, when compared to golden reference. With the
+developed model, we further proposed a fine-grained drive strength
+interpolation methodology to enhance PPA for small-to-medium-scale designs,
+resulting in an approximate 1-3% improvement.
+
+
+
+
+
+
+
+ ☆ DynaLay: An Introspective Approach to Dynamic Layer Selection for Deep
+ Networks
+
+
+ Deep learning models have become increasingly computationally intensive,
+requiring extensive computational resources and time for both training and
+inference. A significant contributing factor to this challenge is the uniform
+computational effort expended on each input example, regardless of its
+complexity. We introduce \textbf{DynaLay}, an alternative architecture that
+features a decision-making agent to adaptively select the most suitable layers
+for processing each input, thereby endowing the model with a remarkable level
+of introspection. DynaLay reevaluates more complex inputs during inference,
+adjusting the computational effort to optimize both performance and efficiency.
+The core of the system is a main model equipped with Fixed-Point Iterative
+(FPI) layers, capable of accurately approximating complex functions, paired
+with an agent that chooses these layers or a direct action based on the
+introspection of the models inner state. The model invests more time in
+processing harder examples, while minimal computation is required for easier
+ones. This introspective approach is a step toward developing deep learning
+models that "think" and "ponder", rather than "ballistically'' produce answers.
+Our experiments demonstrate that DynaLay achieves accuracy comparable to
+conventional deep models while significantly reducing computational demands.
+
+
+
+
+
+
+
+ ☆ Segmenting Messy Text: Detecting Boundaries in Text Derived from
+ Historical Newspaper Images
+
+
+ Text segmentation, the task of dividing a document into sections, is often a
+prerequisite for performing additional natural language processing tasks.
+Existing text segmentation methods have typically been developed and tested
+using clean, narrative-style text with segments containing distinct topics.
+Here we consider a challenging text segmentation task: dividing newspaper
+marriage announcement lists into units of one announcement each. In many cases
+the information is not structured into sentences, and adjacent segments are not
+topically distinct from each other. In addition, the text of the announcements,
+which is derived from images of historical newspapers via optical character
+recognition, contains many typographical errors. As a result, these
+announcements are not amenable to segmentation with existing techniques. We
+present a novel deep learning-based model for segmenting such text and show
+that it significantly outperforms an existing state-of-the-art method on our
+task.
+
+
+
+ comment: 8 pages, 4 figures
+
+
+
+
+
+
+ ☆ ALMANACS: A Simulatability Benchmark for Language Model Explainability
+
+
+
+
+
+
+
+
+ Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons
+
+
+ How do we measure the efficacy of language model explainability methods?
+While many explainability methods have been developed, they are typically
+evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help
+fill this gap, we present ALMANACS, a language model explainability benchmark.
+ALMANACS scores explainability methods on simulatability, i.e., how well the
+explanations improve behavior prediction on new inputs. The ALMANACS scenarios
+span twelve safety-relevant topics such as ethical reasoning and advanced AI
+behaviors; they have idiosyncratic premises to invoke model-specific behavior;
+and they have a train-test distributional shift to encourage faithful
+explanations. By using another language model to predict behavior based on the
+explanations, ALMANACS is a fully automated benchmark. We use ALMANACS to
+evaluate counterfactuals, rationalizations, attention, and Integrated Gradients
+explanations. Our results are sobering: when averaged across all topics, no
+explanation method outperforms the explanation-free control. We conclude that
+despite modest successes in prior work, developing an explanation method that
+aids simulatability in ALMANACS remains an open challenge.
+
+
+
+ comment: Code is available at
+ https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS
+
+
+
+
+
+
+ ☆ 3D-CLMI: A Motor Imagery EEG Classification Model via Fusion of 3D-CNN
+ and LSTM with Attention
+
+
+ Due to the limitations in the accuracy and robustness of current
+electroencephalogram (EEG) classification algorithms, applying motor imagery
+(MI) for practical Brain-Computer Interface (BCI) applications remains
+challenging. This paper proposed a model that combined a three-dimensional
+convolutional neural network (CNN) with a long short-term memory (LSTM) network
+with attention to classify MI-EEG signals. This model combined MI-EEG signals
+from different channels into three-dimensional features and extracted spatial
+features through convolution operations with multiple three-dimensional
+convolutional kernels of different scales. At the same time, to ensure the
+integrity of the extracted MI-EEG signal temporal features, the LSTM network
+was directly trained on the preprocessed raw signal. Finally, the features
+obtained from these two networks were combined and used for classification.
+Experimental results showed that this model achieved a classification accuracy
+of 92.7% and an F1-score of 0.91 on the public dataset BCI Competition IV
+dataset 2a, which were both higher than the state-of-the-art models in the
+field of MI tasks. Additionally, 12 participants were invited to complete a
+four-class MI task in our lab, and experiments on the collected dataset showed
+that the 3D-CLMI model also maintained the highest classification accuracy and
+F1-score. The model greatly improved the classification accuracy of users'
+motor imagery intentions, giving brain-computer interfaces better application
+prospects in emerging fields such as autonomous vehicles and medical
+rehabilitation.
+
+
+
+
+
+
+
+ ☆ Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed
+ Gaussian Bandits with Unknown Variances
+
+
+ We address the problem of best arm identification (BAI) with a fixed budget
+for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the
+best arm, an arm with the highest expected reward, through an adaptive
+experiment. Kaufmann et al. (2016) develops a lower bound for the probability
+of misidentifying the best arm. They also propose a strategy, assuming that the
+variances of rewards are known, and show that it is asymptotically optimal in
+the sense that its probability of misidentification matches the lower bound as
+the budget approaches infinity. However, an asymptotically optimal strategy is
+unknown when the variances are unknown. For this open issue, we propose a
+strategy that estimates variances during an adaptive experiment and draws arms
+with a ratio of the estimated standard deviations. We refer to this strategy as
+the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)
+strategy. We then demonstrate that this strategy is asymptotically optimal by
+showing that its probability of misidentification matches the lower bound when
+the budget approaches infinity, and the gap between the expected rewards of two
+arms approaches zero (small-gap regime). Our results suggest that under the
+worst-case scenario characterized by the small-gap regime, our strategy, which
+employs estimated variance, is asymptotically optimal even when the variances
+are unknown.
+
+
+
+
+
+
+
+ ☆ FSscore: A Machine Learning-based Synthetic Feasibility Score Leveraging
+ Human Expertise
+
+
+
+
+
+
+
+
+ Rebecca M. Neeser, Bruno Correia, Philippe Schwaller
+
+
+ Determining whether a molecule can be synthesized is crucial for many aspects
+of chemistry and drug discovery, allowing prioritization of experimental work
+and ranking molecules in de novo design tasks. Existing scoring approaches to
+assess synthetic feasibility struggle to extrapolate to out-of-distribution
+chemical spaces or fail to discriminate based on minor differences such as
+chirality that might be obvious to trained chemists. This work aims to address
+these limitations by introducing the Focused Synthesizability score (FSscore),
+which learns to rank structures based on binary preferences using a graph
+attention network. First, a baseline trained on an extensive set of
+reactant-product pairs is established that subsequently is fine-tuned with
+expert human feedback on a chemical space of interest. Fine-tuning on focused
+datasets improves performance on these chemical scopes over the pre-trained
+model exhibiting moderate performance and generalizability. This enables
+distinguishing hard- from easy-to-synthesize molecules and improving the
+synthetic accessibility of generative model outputs. On very complex scopes
+with limited labels achieving satisfactory gains remains challenging. The
+FSscore showcases how human expert feedback can be utilized to optimize the
+assessment of synthetic feasibility for a variety of applications.
+
+
+
+
+
+
+
+ ☆ Learning and Forgetting Unsafe Examples in Large Language Models
+
+
+
+
+
+
+
+
+ Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren
+
+
+ As the number of large language models (LLMs) released to the public grows,
+there is a pressing need to understand the safety implications associated with
+these models learning from third-party custom finetuning data. We explore the
+behavior of LLMs finetuned on noisy custom data containing unsafe content,
+represented by datasets that contain biases, toxicity, and harmfulness, finding
+that while aligned LLMs can readily learn this unsafe content, they also tend
+to forget it more significantly than other examples when subsequently finetuned
+on safer content. Drawing inspiration from the discrepancies in forgetting, we
+introduce the "ForgetFilter" algorithm, which filters unsafe data based on how
+strong the model's forgetting signal is for that data. We demonstrate that the
+ForgetFilter algorithm ensures safety in customized finetuning without
+compromising downstream task performance, unlike sequential safety finetuning.
+ForgetFilter outperforms alternative strategies like replay and moral
+self-correction in curbing LLMs' ability to assimilate unsafe content during
+custom finetuning, e.g. 75% lower than not applying any safety measures and 62%
+lower than using self-correction in toxicity score.
+
+
+
+
+
+
+
+ ☆ Robustly Improving Bandit Algorithms with Confounded and Selection
+ Biased Offline Data: A Causal Approach
+
+
+ This paper studies bandit problems where an agent has access to offline data
+that might be utilized to potentially improve the estimation of each arm's
+reward distribution. A major obstacle in this setting is the existence of
+compound biases from the observational data. Ignoring these biases and blindly
+fitting a model with the biased data could even negatively affect the online
+learning phase. In this work, we formulate this problem from a causal
+perspective. First, we categorize the biases into confounding bias and
+selection bias based on the causal structure they imply. Next, we extract the
+causal bound for each arm that is robust towards compound biases from biased
+observational data. The derived bounds contain the ground truth mean reward and
+can effectively guide the bandit agent to learn a nearly-optimal decision
+policy. We also conduct regret analysis in both contextual and non-contextual
+bandit settings and show that prior causal bounds could help consistently
+reduce the asymptotic regret.
+
+
+
+
+
+
+
+ ☆ Lookahead: An Inference Acceleration Framework for Large Language Model
+ with Lossless Generation Accuracy
+
+
+ As Large Language Models (LLMs) have made significant advancements across
+various tasks, such as question answering, translation, text summarization, and
+dialogue systems, the need for accuracy in information becomes crucial,
+especially for serious financial products serving billions of users like
+Alipay. To address this, Alipay has developed a Retrieval-Augmented Generation
+(RAG) system that grounds LLMs on the most accurate and up-to-date information.
+However, for a real-world product serving millions of users, the inference
+speed of LLMs becomes a critical factor compared to a mere experimental model.
+ Hence, this paper presents a generic framework for accelerating the inference
+process, resulting in a substantial increase in speed and cost reduction for
+our RAG system, with lossless generation accuracy. In the traditional inference
+process, each token is generated sequentially by the LLM, leading to a time
+consumption proportional to the number of generated tokens. To enhance this
+process, our framework, named \textit{lookahead}, introduces a
+\textit{multi-branch} strategy. Instead of generating a single token at a time,
+we propose a \textit{Trie-based Retrieval} (TR) process that enables the
+generation of multiple branches simultaneously, each of which is a sequence of
+tokens. Subsequently, for each branch, a \textit{Verification and Accept} (VA)
+process is performed to identify the longest correct sub-sequence as the final
+output. Our strategy offers two distinct advantages: (1) it guarantees absolute
+correctness of the output, avoiding any approximation algorithms, and (2) the
+worst-case performance of our approach is equivalent to the conventional
+process. We conduct extensive experiments to demonstrate the significant
+improvements achieved by applying our inference acceleration framework.
+
+
+
+ comment: 10 pages, 6 figures
+
+
+
+
+
+
+ ☆ Progressive Poisoned Data Isolation for Training-time Backdoor Defense AAAI2024
+
+
+ Deep Neural Networks (DNN) are susceptible to backdoor attacks where
+malicious attackers manipulate the model's predictions via data poisoning. It
+is hence imperative to develop a strategy for training a clean model using a
+potentially poisoned dataset. Previous training-time defense mechanisms
+typically employ an one-time isolation process, often leading to suboptimal
+isolation outcomes. In this study, we present a novel and efficacious defense
+method, termed Progressive Isolation of Poisoned Data (PIPD), that
+progressively isolates poisoned data to enhance the isolation accuracy and
+mitigate the risk of benign samples being misclassified as poisoned ones. Once
+the poisoned portion of the dataset has been identified, we introduce a
+selective training process to train a clean model. Through the implementation
+of these techniques, we ensure that the trained model manifests a significantly
+diminished attack success rate against the poisoned data. Extensive experiments
+on multiple benchmark datasets and DNN models, assessed against nine
+state-of-the-art backdoor attacks, demonstrate the superior performance of our
+PIPD method for backdoor defense. For instance, our PIPD achieves an average
+True Positive Rate (TPR) of 99.95% and an average False Positive Rate (FPR) of
+0.06% for diverse attacks over CIFAR-10 dataset, markedly surpassing the
+performance of state-of-the-art methods.
+
+
+
+ comment: Accepted to AAAI2024
+
+
+
+
+
+
+ ☆ DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS
+ Channel and DNA Storage
+
+
+
+
+
+
+
+
+ Alan J. X. Guo, Sihan Sun, Xiang Wei, Mengyi Wei, Xin Chen
+
+
+ Recently, DNA storage has emerged as a promising data storage solution,
+offering significant advantages in storage density, maintenance cost
+efficiency, and parallel replication capability. Mathematically, the DNA
+storage pipeline can be viewed as an insertion, deletion, and substitution
+(IDS) channel. Because of the mathematical terra incognita of the Levenshtein
+distance, designing an IDS-correcting code is still a challenge. In this paper,
+we propose an innovative approach that utilizes deep Levenshtein distance
+embedding to bypass these mathematical challenges. By representing the
+Levenshtein distance between two sequences as a conventional distance between
+their corresponding embedding vectors, the inherent structural property of
+Levenshtein distance is revealed in the friendly embedding space. Leveraging
+this embedding space, we introduce the DoDo-Code, an IDS-correcting code that
+incorporates deep embedding of Levenshtein distance, deep embedding-based
+codeword search, and deep embedding-based segment correcting. To address the
+requirements of DNA storage, we also present a preliminary algorithm for long
+sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting
+code designed using plausible deep learning methodologies, potentially paving
+the way for a new direction in error-correcting code research. It is also the
+first IDS code that exhibits characteristics of being `optimal' in terms of
+redundancy, significantly outperforming the mainstream IDS-correcting codes of
+the Varshamov-Tenengolts code family in code rate.
+
+
+
+
+
+
+
+
+ Yunye Gong, Robik Shrestha, Jared Claypoole, Michael Cogswell, Arijit Ray, Christopher Kanan, Ajay Divakaran
+
+
+ We propose a novel VQA dataset, based on picture stories designed for
+educating young children, that aims to facilitate comprehensive evaluation and
+characterization of vision-language models on comprehension tasks. Unlike
+current VQA datasets that often focus on fact-based memorization and simple
+reasoning tasks without principled scientific grounding, we collect data
+containing tasks reflecting different levels of comprehension and underlying
+cognitive processes, as laid out in Bloom's Taxonomy, a classic framework
+widely adopted in education research. The proposed BloomVQA dataset can be
+mapped to a hierarchical graph-based representation of visual stories, enabling
+automatic data augmentation and novel measures characterizing model consistency
+across the underlying taxonomy. We demonstrate graded evaluation and
+reliability analysis based on our proposed consistency metrics on
+state-of-the-art vision-language models. Our results suggest that, while
+current models achieve the most gain on low-level comprehension tasks, they
+generally fall short on high-level tasks requiring more advanced comprehension
+and cognitive skills, as 38.0% drop in VQA accuracy is observed comparing
+lowest and highest level tasks. Furthermore, current models show consistency
+patterns misaligned with human comprehension in various scenarios, suggesting
+emergent structures of model behaviors.
+
+
+ In this paper we propose a method for the optimal allocation of observations
+between an intrinsically explainable glass box model and a black box model. An
+optimal allocation being defined as one which, for any given explainability
+level (i.e. the proportion of observations for which the explainable model is
+the prediction function), maximizes the performance of the ensemble on the
+underlying task, and maximizes performance of the explainable model on the
+observations allocated to it, subject to the maximal ensemble performance
+condition. The proposed method is shown to produce such explainability optimal
+allocations on a benchmark suite of tabular datasets across a variety of
+explainable and black box model types. These learned allocations are found to
+consistently maintain ensemble performance at very high explainability levels
+(explaining $74\%$ of observations on average), and in some cases even
+outperforming both the component explainable and black box models while
+improving explainability.
+
+
+
+
+
+
+
+
+ Yang Lu, Lin Chen, Yonggang Zhang, Yiliang Zhang, Bo Han, Yiu-ming Cheung, Hanzi Wang
+
+
+ Federated learning (FL) has shown remarkable success in cooperatively
+training deep models, while typically struggling with noisy labels. Advanced
+works propose to tackle label noise by a re-weighting strategy with a strong
+assumption, i.e., mild label noise. However, it may be violated in many
+real-world FL scenarios because of highly contaminated clients, resulting in
+extreme noise ratios, e.g., $>$90%. To tackle extremely noisy clients, we study
+the robustness of the re-weighting strategy, showing a pessimistic conclusion:
+minimizing the weight of clients trained over noisy data outperforms
+re-weighting strategies. To leverage models trained on noisy clients, we
+propose a novel approach, called negative distillation (FedNed). FedNed first
+identifies noisy clients and employs rather than discards the noisy clients in
+a knowledge distillation manner. In particular, clients identified as noisy
+ones are required to train models using noisy labels and pseudo-labels obtained
+by global models. The model trained on noisy labels serves as a `bad teacher'
+in knowledge distillation, aiming to decrease the risk of providing incorrect
+information. Meanwhile, the model trained on pseudo-labels is involved in model
+aggregation if not identified as a noisy client. Consequently, through
+pseudo-labeling, FedNed gradually increases the trustworthiness of models
+trained on noisy clients, while leveraging all clients for model aggregation
+through negative distillation. To verify the efficacy of FedNed, we conduct
+extensive experiments under various settings, demonstrating that FedNed can
+consistently outperform baselines and achieve state-of-the-art performance. Our
+code is available at https://github.com/linChen99/FedNed.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ DGCLUSTER: A Neural Framework for Attributed Graph Clustering via
+ Modularity Maximization AAAI'24
+
+
+ Graph clustering is a fundamental and challenging task in the field of graph
+mining where the objective is to group the nodes into clusters taking into
+consideration the topology of the graph. It has several applications in diverse
+domains spanning social network analysis, recommender systems, computer vision,
+and bioinformatics. In this work, we propose a novel method, DGCluster, which
+primarily optimizes the modularity objective using graph neural networks and
+scales linearly with the graph size. Our method does not require the number of
+clusters to be specified as a part of the input and can also leverage the
+availability of auxiliary node level information. We extensively test DGCluster
+on several real-world datasets of varying sizes, across multiple popular
+cluster quality metrics. Our approach consistently outperforms the
+state-of-the-art methods, demonstrating significant performance gains in almost
+all settings.
+
+
+
+ comment: Accepted to AAAI'24
+
+
+
+
+
+
+ ☆ How Good Are Deep Generative Models for Solving Inverse Problems?
+
+
+
+
+
+
+
+
+ Shichong Peng, Alireza Moazeni, Ke Li
+
+
+ Deep generative models, such as diffusion models, GANs, and IMLE, have shown
+impressive capability in tackling inverse problems. However, the validity of
+model-generated solutions w.r.t. the forward problem and the reliability of
+associated uncertainty estimates remain understudied. This study evaluates
+recent diffusion-based, GAN-based, and IMLE-based methods on three inverse
+problems, i.e., $16\times$ super-resolution, colourization, and image
+decompression. We assess the validity of these models' outputs as solutions to
+the inverse problems and conduct a thorough analysis of the reliability of the
+models' estimates of uncertainty over the solution. Overall, we find that the
+IMLE-based CHIMLE method outperforms other methods in terms of producing valid
+solutions and reliable uncertainty estimates.
+
+
+
+
+
+
+
+ ☆ CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data
+ and Language Models of Code
+
+
+
+
+
+
+
+
+ Martin Weyssow, Claudio Di Sipio, Davide Di Ruscio, Houari Sahraoui
+
+
+ Motivated by recent work on lifelong learning applications for language
+models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused
+on code changes. Our contribution addresses a notable research gap marked by
+the absence of a long-term temporal dimension in existing code change datasets,
+limiting their suitability in lifelong learning scenarios. In contrast, our
+dataset aims to comprehensively capture code changes across the entire release
+history of open-source software repositories. In this work, we introduce an
+initial version of CodeLL, comprising 71 machine-learning-based projects mined
+from Software Heritage. This dataset enables the extraction and in-depth
+analysis of code changes spanning 2,483 releases at both the method and API
+levels. CodeLL enables researchers studying the behaviour of LMs in lifelong
+fine-tuning settings for learning code changes. Additionally, the dataset can
+help studying data distribution shifts within software repositories and the
+evolution of API usages over time.
+
+
+
+ comment: 4+1 pages
+
+
+
+
+
+
+ ☆ Towards Efficient Verification of Quantized Neural Networks AAAI2024
+
+
+
+
+
+
+
+
+ Pei Huang, Haoze Wu, Yuting Yang, Ieva Daukantas, Min Wu, Yedi Zhang, Clark Barrett
+
+
+ Quantization replaces floating point arithmetic with integer arithmetic in
+deep neural network models, providing more efficient on-device inference with
+less power and memory. In this work, we propose a framework for formally
+verifying properties of quantized neural networks. Our baseline technique is
+based on integer linear programming which guarantees both soundness and
+completeness. We then show how efficiency can be improved by utilizing
+gradient-based heuristic search methods and also bound-propagation techniques.
+We evaluate our approach on perception networks quantized with PyTorch. Our
+results show that we can verify quantized networks with better scalability and
+efficiency than the previous state of the art.
+
+
+
+ comment: This paper has accepted by AAAI2024
+
+
+
+
+
+
+ ☆ Causal Discovery for fMRI data: Challenges, Solutions, and a Case Study
+
+
+
+
+
+
+
+
+ Eric Rawls, Bryan Andrews, Kelvin Lim, Erich Kummerfeld
+
+
+ Designing studies that apply causal discovery requires navigating many
+researcher degrees of freedom. This complexity is exacerbated when the study
+involves fMRI data. In this paper we (i) describe nine challenges that occur
+when applying causal discovery to fMRI data, (ii) discuss the space of
+decisions that need to be made, (iii) review how a recent case study made those
+decisions, (iv) and identify existing gaps that could potentially be solved by
+the development of new methods. Overall, causal discovery is a promising
+approach for analyzing fMRI data, and multiple successful applications have
+indicated that it is superior to traditional fMRI functional connectivity
+methods, but current causal discovery methods for fMRI leave room for
+improvement.
+
+
+
+
+
+
+
+ ☆ Combinatorial Gaussian Process Bandits in Bayesian Settings: Theory and
+ Application for Energy-Efficient Navigation
+
+
+ We consider a combinatorial Gaussian process semi-bandit problem with
+time-varying arm availability. Each round, an agent is provided a set of
+available base arms and must select a subset of them to maximize the long-term
+cumulative reward. Assuming the expected rewards are sampled from a Gaussian
+process (GP) over the arm space, the agent can efficiently learn. We study the
+Bayesian setting and provide novel Bayesian regret bounds for three GP-based
+algorithms: GP-UCB, Bayes-GP-UCB and GP-TS. Our bounds extend previous results
+for GP-UCB and GP-TS to a combinatorial setting with varying arm availability
+and to the best of our knowledge, we provide the first Bayesian regret bound
+for Bayes-GP-UCB. Time-varying arm availability encompasses other widely
+considered bandit problems such as contextual bandits. We formulate the online
+energy-efficient navigation problem as a combinatorial and contextual bandit
+and provide a comprehensive experimental study on synthetic and real-world road
+networks with detailed simulations. The contextual GP model obtains lower
+regret and is less dependent on the informativeness of the prior compared to
+the non-contextual Bayesian inference model. In addition, Thompson sampling
+obtains lower regret than Bayes-UCB for both the contextual and non-contextual
+model.
+
+
+
+ comment: 39 pages, 10 figures
+
+
+
+
+
+
+ ☆ Meta-Learning with Versatile Loss Geometries for Fast Adaptation Using
+ Mirror Descent ICASSP-24
+
+
+ Utilizing task-invariant prior knowledge extracted from related tasks,
+meta-learning is a principled framework that empowers learning a new task
+especially when data records are limited. A fundamental challenge in
+meta-learning is how to quickly "adapt" the extracted prior in order to train a
+task-specific model within a few optimization steps. Existing approaches deal
+with this challenge using a preconditioner that enhances convergence of the
+per-task training process. Though effective in representing locally a quadratic
+training loss, these simple linear preconditioners can hardly capture complex
+loss geometries. The present contribution addresses this limitation by learning
+a nonlinear mirror map, which induces a versatile distance metric to enable
+capturing and optimizing a wide range of loss geometries, hence facilitating
+the per-task training. Numerical tests on few-shot learning datasets
+demonstrate the superior expressiveness and convergence of the advocated
+approach.
+
+
+
+ comment: Accepted by 2024 IEEE International Conference on Acoustics, Speech
+ and Signal Processing (ICASSP-24)
+
+
+
+
+
+
+ ☆ Bayesian Transfer Learning
+
+
+
+
+
+
+
+
+ Piotr M. Suder, Jason Xu, David B. Dunson
+
+
+ Transfer learning is a burgeoning concept in statistical machine learning
+that seeks to improve inference and/or predictive accuracy on a domain of
+interest by leveraging data from related domains. While the term "transfer
+learning" has garnered much recent interest, its foundational principles have
+existed for years under various guises. Prior literature reviews in computer
+science and electrical engineering have sought to bring these ideas into focus,
+primarily surveying general methodologies and works from these disciplines.
+This article highlights Bayesian approaches to transfer learning, which have
+received relatively limited attention despite their innate compatibility with
+the notion of drawing upon prior knowledge to guide new learning tasks. Our
+survey encompasses a wide range of Bayesian transfer learning frameworks
+applicable to a variety of practical settings. We discuss how these methods
+address the problem of finding the optimal information to transfer between
+domains, which is a central question in transfer learning. We illustrate the
+utility of Bayesian transfer learning methods via a simulation study where we
+compare performance against frequentist competitors.
+
+
+
+
+
+
+
+ ☆ InvertibleNetworks.jl: A Julia package for scalable normalizing flows
+
+
+
+
+
+
+
+
+ Rafael Orozco, Philipp Witte, Mathias Louboutin, Ali Siahkoohi, Gabrio Rizzuti, Bas Peters, Felix J. Herrmann
+
+
+ InvertibleNetworks.jl is a Julia package designed for the scalable
+implementation of normalizing flows, a method for density estimation and
+sampling in high-dimensional distributions. This package excels in memory
+efficiency by leveraging the inherent invertibility of normalizing flows, which
+significantly reduces memory requirements during backpropagation compared to
+existing normalizing flow packages that rely on automatic differentiation
+frameworks. InvertibleNetworks.jl has been adapted for diverse applications,
+including seismic imaging, medical imaging, and CO2 monitoring, demonstrating
+its effectiveness in learning high-dimensional distributions.
+
+
+
+ comment: Submitted to Journal of Open Source Software (JOSS)
+
+
+
+
+
+
+ ☆ Accuracy vs Memory Advantage in the Quantum Simulation of Stochastic
+ Processes
+
+
+ Many inference scenarios rely on extracting relevant information from known
+data in order to make future predictions. When the underlying stochastic
+process satisfies certain assumptions, there is a direct mapping between its
+exact classical and quantum simulators, with the latter asymptotically using
+less memory. Here we focus on studying whether such quantum advantage persists
+when those assumptions are not satisfied, and the model is doomed to have
+imperfect accuracy. By studying the trade-off between accuracy and memory
+requirements, we show that quantum models can reach the same accuracy with less
+memory, or alternatively, better accuracy with the same memory. Finally, we
+discuss the implications of this result for learning tasks.
+
+
+
+
+
+
+
+ ☆ Neural feels with neural fields: Visuo-tactile perception for in-hand
+ manipulation
+
+
+
+
+
+
+
+
+ Sudharshan Suresh, Haozhi Qi, Tingfan Wu, Taosha Fan, Luis Pineda, Mike Lambeta, Jitendra Malik, Mrinal Kalakrishnan, Roberto Calandra, Michael Kaess, Joseph Ortiz, Mustafa Mukadam
+
+
+ To achieve human-level dexterity, robots must infer spatial awareness from
+multimodal sensing to reason over contact interactions. During in-hand
+manipulation of novel objects, such spatial awareness involves estimating the
+object's pose and shape. The status quo for in-hand perception primarily
+employs vision, and restricts to tracking a priori known objects. Moreover,
+visual occlusion of objects in-hand is imminent during manipulation, preventing
+current systems to push beyond tasks without occlusion. We combine vision and
+touch sensing on a multi-fingered hand to estimate an object's pose and shape
+during in-hand manipulation. Our method, NeuralFeels, encodes object geometry
+by learning a neural field online and jointly tracks it by optimizing a pose
+graph problem. We study multimodal in-hand perception in simulation and the
+real-world, interacting with different objects via a proprioception-driven
+policy. Our experiments show final reconstruction F-scores of $81$% and average
+pose drifts of $4.7\,\text{mm}$, further reduced to $2.3\,\text{mm}$ with known
+CAD models. Additionally, we observe that under heavy visual occlusion we can
+achieve up to $94$% improvements in tracking compared to vision-only methods.
+Our results demonstrate that touch, at the very least, refines and, at the very
+best, disambiguates visual estimates during in-hand manipulation. We release
+our evaluation dataset of 70 experiments, FeelSight, as a step towards
+benchmarking in this domain. Our neural representation driven by multimodal
+sensing can serve as a perception backbone towards advancing robot dexterity.
+Videos can be found on our project website
+https://suddhu.github.io/neural-feels/
+
+
+
+
+
+
+
+
+ Paris A. Karakasis, Nicholas D. Sidiropoulos
+
+
+ Canonical correlation analysis (CCA) is a classic statistical method for
+discovering latent co-variation that underpins two or more observed random
+vectors. Several extensions and variations of CCA have been proposed that have
+strengthened our capabilities in terms of revealing common random factors from
+multiview datasets. In this work, we first revisit the most recent
+deterministic extensions of deep CCA and highlight the strengths and
+limitations of these state-of-the-art methods. Some methods allow trivial
+solutions, while others can miss weak common factors. Others overload the
+problem by also seeking to reveal what is not common among the views -- i.e.,
+the private components that are needed to fully reconstruct each view. The
+latter tends to overload the problem and its computational and sample
+complexities. Aiming to improve upon these limitations, we design a novel and
+efficient formulation that alleviates some of the current restrictions. The
+main idea is to model the private components as conditionally independent given
+the common ones, which enables the proposed compact formulation. In addition,
+we also provide a sufficient condition for identifying the common random
+factors. Judicious experiments with synthetic and real datasets showcase the
+validity of our claims and the effectiveness of the proposed approach.
+
+
+
+
+
+
+
+ ☆ MixEHR-SurG: a joint proportional hazard and guided topic model for
+ inferring mortality-associated topics from electronic health records
+
+
+
+
+
+
+
+
+ Yixuan Li, Ariane Marelli, Archer Y. Yang, Yue Li
+
+
+ Objective: To improve survival analysis using EHR data, we aim to develop a
+supervised topic model called MixEHR-SurG to simultaneously integrate
+heterogeneous EHR data and model survival hazard.
+ Materials and Methods: Our technical contributions are three-folds: (1)
+integrating EHR topic inference with Cox proportional hazards likelihood; (2)
+inferring patient-specific topic hyperparameters using the PheCode concepts
+such that each topic can be identified with exactly one PheCode-associated
+phenotype; (3) multi-modal survival topic inference. This leads to a highly
+interpretable survival and guided topic model that can infer PheCode-specific
+phenotype topics associated with patient mortality. We evaluated MixEHR-G using
+a simulated dataset and two real-world EHR datasets: the Quebec Congenital
+Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient
+claim data of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458
+subjects with multi-modal EHR records.
+ Results: Compared to the baselines, MixEHR-G achieved a superior dynamic
+AUROC for mortality prediction, with a mean AUROC score of 0.89 in the
+simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively,
+MixEHR-G associates severe cardiac conditions with high mortality risk among
+the CHD patients after the first heart failure hospitalization and critical
+brain injuries with increased mortality among the MIMIC-III patients after
+their ICU discharge.
+ Conclusion: The integration of the Cox proportional hazards model and EHR
+topic inference in MixEHR-SurG led to not only competitive mortality prediction
+but also meaningful phenotype topics for systematic survival analysis. The
+software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
+
+
+
+
+
+
+
+ ☆ Learning the Factors Controlling Mineralization for Geologic Carbon
+ Sequestration
+
+
+
+
+
+
+
+
+ Aleksandra Pachalieva, Jeffrey D. Hyman, Daniel O'Malley, Hari Viswanathan, Gowri Srinivasan
+
+
+ We perform a set of flow and reactive transport simulations within
+three-dimensional fracture networks to learn the factors controlling mineral
+reactions. CO$_2$ mineralization requires CO$_2$-laden water, dissolution of a
+mineral that then leads to precipitation of a CO$_2$-bearing mineral. Our
+discrete fracture networks (DFN) are partially filled with quartz that
+gradually dissolves until it reaches a quasi-steady state. At the end of the
+simulation, we measure the quartz remaining in each fracture within the domain.
+We observe that a small backbone of fracture exists, where the quartz is fully
+dissolved which leads to increased flow and transport. However, depending on
+the DFN topology and the rate of dissolution, we observe a large variability of
+these changes, which indicates an interplay between the fracture network
+structure and the impact of geochemical dissolution. In this work, we developed
+a machine learning framework to extract the important features that support
+mineralization in the form of dissolution. In addition, we use structural and
+topological features of the fracture network to predict the remaining quartz
+volume in quasi-steady state conditions. As a first step to characterizing
+carbon mineralization, we study dissolution with this framework. We studied a
+variety of reaction and fracture parameters and their impact on the dissolution
+of quartz in fracture networks. We found that the dissolution reaction rate
+constant of quartz and the distance to the flowing backbone in the fracture
+network are the two most important features that control the amount of quartz
+left in the system. For the first time, we use a combination of a finite-volume
+reservoir model and graph-based approach to study reactive transport in a
+complex fracture network to determine the key features that control
+dissolution.
+
+
+
+ comment: 23 pages, 5 figures, 2 tables
+
+
+
+
+
+
+ ☆ Independent Mechanism Analysis and the Manifold Hypothesis
+
+
+
+
+
+
+
+
+ Shubhangi Ghosh, Luigi Gresele, Julius von Kügelgen, Michel Besserve, Bernhard Schölkopf
+
+
+ Independent Mechanism Analysis (IMA) seeks to address non-identifiability in
+nonlinear Independent Component Analysis (ICA) by assuming that the Jacobian of
+the mixing function has orthogonal columns. As typical in ICA, previous work
+focused on the case with an equal number of latent components and observed
+mixtures. Here, we extend IMA to settings with a larger number of mixtures that
+reside on a manifold embedded in a higher-dimensional than the latent space --
+in line with the manifold hypothesis in representation learning. For this
+setting, we show that IMA still circumvents several non-identifiability issues,
+suggesting that it can also be a beneficial principle for higher-dimensional
+observations when the manifold hypothesis holds. Further, we prove that the IMA
+principle is approximately satisfied with high probability (increasing with the
+number of observed mixtures) when the directions along which the latent
+components influence the observations are chosen independently at random. This
+provides a new and rigorous statistical interpretation of IMA.
+
+
+
+
+
+
+
+ ☆ A General Model for Aggregating Annotations Across Simple, Complex, and
+ Multi-Object Annotation Tasks
+
+
+
+
+
+
+
+
+ Alexander Braylan, Madalyn Marabella, Omar Alonso, Matthew Lease
+
+
+ Human annotations are vital to supervised learning, yet annotators often
+disagree on the correct label, especially as annotation tasks increase in
+complexity. A strategy to improve label quality is to ask multiple annotators
+to label the same item and aggregate their labels. Many aggregation models have
+been proposed for categorical or numerical annotation tasks, but far less work
+has considered more complex annotation tasks involving open-ended,
+multivariate, or structured responses. While a variety of bespoke models have
+been proposed for specific tasks, our work is the first to introduce
+aggregation methods that generalize across many diverse complex tasks,
+including sequence labeling, translation, syntactic parsing, ranking, bounding
+boxes, and keypoints. This generality is achieved by devising a task-agnostic
+method to model distances between labels rather than the labels themselves.
+ This article extends our prior work with investigation of three new research
+questions. First, how do complex annotation properties impact aggregation
+accuracy? Second, how should a task owner navigate the many modeling choices to
+maximize aggregation accuracy? Finally, what diagnoses can verify that
+aggregation models are specified correctly for the given data? To understand
+how various factors impact accuracy and to inform model selection, we conduct
+simulation studies and experiments on real, complex datasets. Regarding
+testing, we introduce unit tests for aggregation models and present a suite of
+such tests to ensure that a given model is not mis-specified and exhibits
+expected behavior.
+ Beyond investigating these research questions above, we discuss the
+foundational concept of annotation complexity, present a new aggregation model
+as a bridge between traditional models and our own, and contribute a new
+semi-supervised learning method for complex label aggregation that outperforms
+prior work.
+
+
+
+
+
+
+
+ ☆ Consistent Long-Term Forecasting of Ergodic Dynamical Systems
+
+
+
+
+
+
+
+
+ Prune Inzerilli, Vladimir Kostic, Karim Lounici, Pietro Novelli, Massimiliano Pontil
+
+
+ We study the evolution of distributions under the action of an ergodic
+dynamical system, which may be stochastic in nature. By employing tools from
+Koopman and transfer operator theory one can evolve any initial distribution of
+the state forward in time, and we investigate how estimators of these operators
+perform on long-term forecasting. Motivated by the observation that standard
+estimators may fail at this task, we introduce a learning paradigm that neatly
+combines classical techniques of eigenvalue deflation from operator theory and
+feature centering from statistics. This paradigm applies to any operator
+estimator based on empirical risk minimization, making them satisfy learning
+bounds which hold uniformly on the entire trajectory of future distributions,
+and abide to the conservation of mass for each of the forecasted distributions.
+Numerical experiments illustrates the advantages of our approach in practice.
+
+
+
+
+
+
+
+ ☆ Texture Matching GAN for CT Image Enhancement
+
+
+
+
+
+
+
+
+ Madhuri Nagare, Gregery T. Buzzard, Charles A. Bouman
+
+
+ Deep neural networks (DNN) are commonly used to denoise and sharpen X-ray
+computed tomography (CT) images with the goal of reducing patient X-ray dosage
+while maintaining reconstruction quality. However, naive application of
+DNN-based methods can result in image texture that is undesirable in clinical
+applications. Alternatively, generative adversarial network (GAN) based methods
+can produce appropriate texture, but naive application of GANs can introduce
+inaccurate or even unreal image detail. In this paper, we propose a texture
+matching generative adversarial network (TMGAN) that enhances CT images while
+generating an image texture that can be matched to a target texture. We use
+parallel generators to separate anatomical features from the generated texture,
+which allows the GAN to be trained to match the desired texture without
+directly affecting the underlying CT image. We demonstrate that TMGAN generates
+enhanced image quality while also producing image texture that is desirable for
+clinical application.
+
+
+
+ comment: Submitted to IEEE Transactions on Medical Imaging
+
+
+
+
+
+
+ ♻ ☆ Hard Regularization to Prevent Deep Online Clustering Collapse without
+ Data Augmentation
+
+
+ Online deep clustering refers to the joint use of a feature extraction
+network and a clustering model to assign cluster labels to each new data point
+or batch as it is processed. While faster and more versatile than offline
+methods, online clustering can easily reach the collapsed solution where the
+encoder maps all inputs to the same point and all are put into a single
+cluster. Successful existing models have employed various techniques to avoid
+this problem, most of which require data augmentation or which aim to make the
+average soft assignment across the dataset the same for each cluster. We
+propose a method that does not require data augmentation, and that, differently
+from existing methods, regularizes the hard assignments. Using a Bayesian
+framework, we derive an intuitive optimization objective that can be
+straightforwardly included in the training of the encoder network. Tested on
+four image datasets and one human-activity recognition dataset, it consistently
+avoids collapse more robustly than other methods and leads to more accurate
+clustering. We also conduct further experiments and analyses justifying our
+choice to regularize the hard cluster assignments. Code is available at
+https://github.com/Lou1sM/online_hard_clustering.
+
+
+
+
+
+
+
+
+ Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
+
+
+ The recent popularity of text-to-image diffusion models (DM) can largely be
+attributed to the intuitive interface they provide to users. The intended
+generation can be expressed in natural language, with the model producing
+faithful interpretations of text prompts. However, expressing complex or
+nuanced ideas in text alone can be difficult. To ease image generation, we
+propose MultiFusion that allows one to express complex and nuanced concepts
+with arbitrarily interleaved inputs of multiple modalities and languages.
+MutliFusion leverages pre-trained models and aligns them for integration into a
+cohesive system, thereby avoiding the need for extensive training from scratch.
+Our experimental results demonstrate the efficient transfer of capabilities
+from individual modules to the downstream model. Specifically, the fusion of
+all independent components allows the image generation module to utilize
+multilingual, interleaved multimodal inputs despite being trained solely on
+monomodal data in a single language.
+
+
+
+ comment: Proceedings of Advances in Neural Information Processing Systems:
+ Annual Conference on Neural Information Processing Systems (NeurIPS)
+
+
+
+
+
+
+ ♻ ☆ Online RL in Linearly $q^π$-Realizable MDPs Is as Easy as in Linear
+ MDPs If You Learn What to Ignore
+
+
+
+
+
+
+
+
+ Gellért Weisz, András György, Csaba Szepesvári
+
+
+ We consider online reinforcement learning (RL) in episodic Markov decision
+processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is
+assumed that the action-values of all policies can be expressed as linear
+functions of state-action features. This class is known to be more general than
+linear MDPs, where the transition kernel and the reward function are assumed to
+be linear functions of the feature vectors. As our first contribution, we show
+that the difference between the two classes is the presence of states in
+linearly $q^\pi$-realizable MDPs where for any policy, all the actions have
+approximately equal values, and skipping over these states by following an
+arbitrarily fixed policy in those states transforms the problem to a linear
+MDP. Based on this observation, we derive a novel (computationally inefficient)
+learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously
+learns what states should be skipped over and runs another learning algorithm
+on the linear MDP hidden in the problem. The method returns an
+$\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions
+with the MDP, where $H$ is the time horizon and $d$ is the dimension of the
+feature vectors, giving the first polynomial-sample-complexity online RL
+algorithm for this setting. The results are proved for the misspecified case,
+where the sample complexity is shown to degrade gracefully with the
+misspecification error.
+
+
+
+
+
+
+
+ ♻ ☆ FedECA: A Federated External Control Arm Method for Causal Inference
+ with Time-To-Event Data in Distributed Settings
+
+
+
+
+
+
+
+
+ Jean Ogier du Terrail, Quentin Klopfenstein, Honghao Li, Imke Mayer, Nicolas Loiseau, Mohammad Hallal, Félix Balazard, Mathieu Andreux
+
+
+ External control arms (ECA) can inform the early clinical development of
+experimental drugs and provide efficacy evidence for regulatory approval in
+non-randomized settings. However, the main challenge of implementing ECA lies
+in accessing real-world data or historical clinical trials. Indeed, data
+sharing is often not feasible due to privacy considerations related to data
+leaving the original collection centers, along with pharmaceutical companies'
+competitive motives. In this paper, we leverage a privacy-enhancing technology
+called federated learning (FL) to remove some of the barriers to data sharing.
+We introduce a federated learning inverse probability of treatment weighted
+(IPTW) method for time-to-event outcomes called FedECA which eases the
+implementation of ECA by limiting patients' data exposure. We show with
+extensive experiments that FedECA outperforms its closest competitor,
+matching-adjusted indirect comparison (MAIC), in terms of statistical power and
+ability to balance the treatment and control groups. To encourage the use of
+such methods, we publicly release our code which relies on Substra, an
+open-source FL software with proven experience in privacy-sensitive contexts.
+
+
+
+ comment: code available at: https://github.com/owkin/fedeca, fixed some typos,
+ figures and acknowledgments in v2
+
+ Session-based recommendation, which aims to predict the next item of users'
+interest as per an existing sequence interaction of items, has attracted
+growing applications of Contrastive Learning (CL) with improved user and item
+representations. However, these contrastive objectives: (1) serve a similar
+role as the cross-entropy loss while ignoring the item representation space
+optimisation; and (2) commonly require complicated modelling, including complex
+positive/negative sample constructions and extra data augmentation. In this
+work, we introduce Self-Contrastive Learning (SCL), which simplifies the
+application of CL and enhances the performance of state-of-the-art CL-based
+recommendation techniques. Specifically, SCL is formulated as an objective
+function that directly promotes a uniform distribution among item
+representations and efficiently replaces all the existing contrastive objective
+components of state-of-the-art models. Unlike previous works, SCL eliminates
+the need for any positive/negative sample construction or data augmentation,
+leading to enhanced interpretability of the item representation space and
+facilitating its extensibility to existing recommender systems. Through
+experiments on three benchmark datasets, we demonstrate that SCL consistently
+improves the performance of state-of-the-art models with statistical
+significance. Notably, our experiments show that SCL improves the performance
+of two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and
+11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks.
+Additionally, our analysis elucidates the improvement in terms of alignment and
+uniformity of representations, as well as the effectiveness of SCL with a low
+computational cost.
+
+
+
+ comment: ECIR 2024 (Full Paper) Camera-ready Version. Code is available at
+ https://github.com/ZhengxiangShi/SelfContrastiveLearningRecSys
+
+
+
+
+
+
+ ♻ ☆ On the Number of Regions of Piecewise Linear Neural Networks
+
+
+
+
+
+
+
+
+ Alexis Goujon, Arian Etemadi, Michael Unser
+
+
+ Many feedforward neural networks (NNs) generate continuous and
+piecewise-linear (CPWL) mappings. Specifically, they partition the input domain
+into regions on which the mapping is affine. The number of these so-called
+linear regions offers a natural metric to characterize the expressiveness of
+CPWL NNs. The precise determination of this quantity is often out of reach in
+practice, and bounds have been proposed for specific architectures, including
+for ReLU and Maxout NNs. In this work, we generalize these bounds to NNs with
+arbitrary and possibly multivariate CPWL activation functions. We first provide
+upper and lower bounds on the maximal number of linear regions of a CPWL NN
+given its depth, width, and the number of linear regions of its activation
+functions. Our results rely on the combinatorial structure of convex partitions
+and confirm the distinctive role of depth which, on its own, is able to
+exponentially increase the number of regions. We then introduce a complementary
+stochastic framework to estimate the average number of linear regions produced
+by a CPWL NN. Under reasonable assumptions, the expected density of linear
+regions along any 1D path is bounded by the product of depth, width, and a
+measure of activation complexity (up to a scaling factor). This yields an
+identical role to the three sources of expressiveness: no exponential growth
+with depth is observed anymore.
+
+
+ This research delves into the intricate landscape of Musculoskeletal Disorder
+(MSD) risk factors, employing a novel fusion of Natural Language Processing
+(NLP) techniques and mode-based ranking methodologies. The primary objective is
+to advance the comprehension of MSD risk factors, their classification, and
+their relative severity, facilitating more targeted preventive and management
+interventions. The study utilizes eight diverse models, integrating pre-trained
+transformers, cosine similarity, and various distance metrics to classify risk
+factors into personal, biomechanical, workplace, psychological, and
+organizational classes. Key findings reveal that the BERT model with cosine
+similarity attains an overall accuracy of 28%, while the sentence transformer,
+coupled with Euclidean, Bray-Curtis, and Minkowski distances, achieves a
+flawless accuracy score of 100%. In tandem with the classification efforts, the
+research employs a mode-based ranking approach on survey data to discern the
+severity hierarchy of MSD risk factors. Intriguingly, the rankings align
+precisely with the previous literature, reaffirming the consistency and
+reliability of the approach. ``Working posture" emerges as the most severe risk
+factor, emphasizing the critical role of proper posture in preventing MSDs. The
+collective perceptions of survey participants underscore the significance of
+factors like "Job insecurity," "Effort reward imbalance," and "Poor employee
+facility" in contributing to MSD risks. The convergence of rankings provides
+actionable insights for organizations aiming to reduce the prevalence of MSDs.
+The study concludes with implications for targeted interventions,
+recommendations for improving workplace conditions, and avenues for future
+research.
+
+
+ We study the task of $(\epsilon, \delta)$-differentially private online
+convex optimization (OCO). In the online setting, the release of each distinct
+decision or iterate carries with it the potential for privacy loss. This
+problem has a long history of research starting with Jain et al. [2012] and the
+best known results for the regime of {\epsilon} not being very small are
+presented in Agarwal et al. [2023]. In this paper we improve upon the results
+of Agarwal et al. [2023] in terms of the dimension factors as well as removing
+the requirement of smoothness. Our results are now the best known rates for
+DP-OCO in this regime.
+ Our algorithms builds upon the work of [Asi et al., 2023] which introduced
+the idea of explicitly limiting the number of switches via rejection sampling.
+The main innovation in our algorithm is the use of sampling from a strongly
+log-concave density which allows us to trade-off the dimension factors better
+leading to improved results.
+
+
+
+
+
+
+
+ ♻ ☆ Automatic and effective discovery of quantum kernels
+
+
+
+
+
+
+
+
+ Massimiliano Incudini, Daniele Lizzio Bosco, Francesco Martini, Michele Grossi, Giuseppe Serra, Alessandra Di Pierro
+
+
+ Quantum computing can empower machine learning models by enabling kernel
+machines to leverage quantum kernels for representing similarity measures
+between data. Quantum kernels are able to capture relationships in the data
+that are not efficiently computable on classical devices. However, there is no
+straightforward method to engineer the optimal quantum kernel for each specific
+use case. While recent literature has focused on exploiting the potential
+offered by the presence of symmetries in the data to guide the construction of
+quantum kernels, we adopt here a different approach, which employs optimization
+techniques, similar to those used in neural architecture search and AutoML, to
+automatically find an optimal kernel in a heuristic manner. The algorithm we
+present constructs a quantum circuit implementing the similarity measure as a
+combinatorial object, which is evaluated based on a cost function and is then
+iteratively modified using a meta-heuristic optimization technique. The cost
+function can encode many criteria ensuring favorable statistical properties of
+the candidate solution, such as the rank of the Dynamical Lie Algebra.
+Importantly, our approach is independent of the optimization technique
+employed. The results obtained by testing our approach on a high-energy physics
+problem demonstrate that, in the best-case scenario, we can either match or
+improve testing accuracy with respect to the manual design approach, showing
+the potential of our technique to deliver superior results with reduced effort.
+
+
+
+
+
+
+
+ ♻ ☆ One step closer to unbiased aleatoric uncertainty estimation
+
+
+ Neural networks are powerful tools in various applications, and quantifying
+their uncertainty is crucial for reliable decision-making. In the deep learning
+field, the uncertainties are usually categorized into aleatoric (data) and
+epistemic (model) uncertainty. In this paper, we point out that the existing
+popular variance attenuation method highly overestimates aleatoric uncertainty.
+To address this issue, we propose a new estimation method by actively
+de-noising the observed data. By conducting a broad range of experiments, we
+demonstrate that our proposed approach provides a much closer approximation to
+the actual data uncertainty than the standard method.
+
+
+
+
+
+
+
+
+ Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov
+
+
+ Multimedia generation approaches occupy a prominent place in artificial
+intelligence research. Text-to-image models achieved high-quality results over
+the last few years. However, video synthesis methods recently started to
+develop. This paper presents a new two-stage latent diffusion text-to-video
+generation architecture based on the text-to-image diffusion model. The first
+stage concerns keyframes synthesis to figure the storyline of a video, while
+the second one is devoted to interpolation frames generation to make movements
+of the scene and objects smooth. We compare several temporal conditioning
+approaches for keyframes generation. The results show the advantage of using
+separate temporal blocks over temporal layers in terms of metrics reflecting
+video generation quality aspects and human preference. The design of our
+interpolation model significantly reduces computational costs compared to other
+masked frame interpolation approaches. Furthermore, we evaluate different
+configurations of MoVQ-based video decoding scheme to improve consistency and
+achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our
+pipeline with existing solutions and achieve top-2 scores overall and top-1
+among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:
+https://ai-forever.github.io/kandinsky-video/
+
+
+ In this paper, we revisit the bilevel optimization problem, in which the
+upper-level objective function is generally nonconvex and the lower-level
+objective function is strongly convex. Although this type of problem has been
+studied extensively, it still remains an open question how to achieve an
+${O}(\epsilon^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic
+bilevel optimization without any second-order derivative computation. To fill
+this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named
+FdeHBO, which features a simple fully single-loop structure, a projection-aided
+finite-difference Hessian/Jacobian-vector approximation, and momentum-based
+updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$
+iterations (each using ${O}(1)$ samples and only first-order gradient
+information) to find an $\epsilon$-accurate stationary point. As far as we
+know, this is the first Hessian/Jacobian-free method with an
+${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex
+stochastic bilevel optimization.
+
+
+
+
+
+
+
+ ♻ ☆ OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in
+ Noisy Environments AAAI 2024
+
+
+
+
+
+
+
+
+ Jinyi Liu, Zhi Wang, Yan Zheng, Jianye Hao, Chenjia Bai, Junjie Ye, Zhen Wang, Haiyin Piao, Yang Sun
+
+
+ In reinforcement learning, the optimism in the face of uncertainty (OFU) is a
+mainstream principle for directing exploration towards less explored areas,
+characterized by higher uncertainty. However, in the presence of environmental
+stochasticity (noise), purely optimistic exploration may lead to excessive
+probing of high-noise areas, consequently impeding exploration efficiency.
+Hence, in exploring noisy environments, while optimism-driven exploration
+serves as a foundation, prudent attention to alleviating unnecessary
+over-exploration in high-noise areas becomes beneficial. In this work, we
+propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a
+noise-aware optimistic exploration for continuous control. OVD-Explorer
+proposes a new measurement of the policy's exploration ability considering
+noise in optimistic perspectives, and leverages gradient ascent to drive
+exploration. Practically, OVD-Explorer can be easily integrated with continuous
+control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks
+demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic
+exploration.
+
+
+
+ comment: Accepted by AAAI 2024, with appendix
+
+
+
+
+
+
+ ♻ ☆ Forecasting Trends in Food Security: a Reservoir Computing Approach
+
+
+ Early warning systems are an essential tool for effective humanitarian
+action. Advance warnings on impending disasters facilitate timely and targeted
+response which help save lives, livelihoods, and scarce financial resources. In
+this work we present a new quantitative methodology to forecast levels of food
+consumption for 60 consecutive days, at the sub-national level, in four
+countries: Mali, Nigeria, Syria, and Yemen. The methodology is built on
+publicly available data from the World Food Programme's integrated global
+hunger monitoring system which collects, processes, and displays daily updates
+on key food security metrics, conflict, weather events, and other drivers of
+food insecurity across 90 countries (https://hungermap.wfp.org/). In this
+study, we assessed the performance of various models including ARIMA, XGBoost,
+LSTMs, CNNs, and Reservoir Computing (RC), by comparing their Root Mean Squared
+Error (RMSE) metrics. This comprehensive analysis spanned classical
+statistical, machine learning, and deep learning approaches. Our findings
+highlight Reservoir Computing as a particularly well-suited model in the field
+of food security given both its notable resistance to over-fitting on limited
+data samples and its efficient training capabilities. The methodology we
+introduce establishes the groundwork for a global, data-driven early warning
+system designed to anticipate and detect food insecurity.
+
+
+
+
+
+
+
+ ♻ ☆ Covariance Adaptive Best Arm Identification
+
+
+
+
+
+
+
+
+ El Mehdi Saad, Gilles Blanchard, Nicolas Verzelen
+
+
+ We consider the problem of best arm identification in the multi-armed bandit
+model, under fixed confidence. Given a confidence input $\delta$, the goal is
+to identify the arm with the highest mean reward with a probability of at least
+1 -- $\delta$, while minimizing the number of arm pulls. While the literature
+provides solutions to this problem under the assumption of independent arms
+distributions, we propose a more flexible scenario where arms can be dependent
+and rewards can be sampled simultaneously. This framework allows the learner to
+estimate the covariance among the arms distributions, enabling a more efficient
+identification of the best arm. The relaxed setting we propose is relevant in
+various applications, such as clinical trials, where similarities between
+patients or drugs suggest underlying correlations in the outcomes. We introduce
+new algorithms that adapt to the unknown covariance of the arms and demonstrate
+through theoretical guarantees that substantial improvement can be achieved
+over the standard setting. Additionally, we provide new lower bounds for the
+relaxed setting and present numerical simulations that support their
+theoretical findings.
+
+
+
+ comment: New version with some minor corrections
+
+
+
+
+
+
+
+ Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Wang, Xiang-Yang Li, Tao Qin, Edward Lin, Tie-Yan Liu
+
+
+ Error correction in automatic speech recognition (ASR) aims to correct those
+incorrect words in sentences generated by ASR models. Since recent ASR models
+usually have low word error rate (WER), to avoid affecting originally correct
+tokens, error correction models should only modify incorrect words, and
+therefore detecting incorrect words is important for error correction. Previous
+works on error correction either implicitly detect error words through
+target-source attention or CTC (connectionist temporal classification) loss, or
+explicitly locate specific deletion/substitution/insertion errors. However,
+implicit error detection does not provide clear signal about which tokens are
+incorrect and explicit error detection suffers from low detection accuracy. In
+this paper, we propose SoftCorrect with a soft error detection mechanism to
+avoid the limitations of both explicit and implicit error detection.
+Specifically, we first detect whether a token is correct or not through a
+probability produced by a dedicatedly designed language model, and then design
+a constrained CTC loss that only duplicates the detected incorrect tokens to
+let the decoder focus on the correction of error tokens. Compared with implicit
+error detection with CTC loss, SoftCorrect provides explicit signal about which
+words are incorrect and thus does not need to duplicate every token but only
+incorrect tokens; compared with explicit error detection, SoftCorrect does not
+detect specific deletion/substitution/insertion errors but just leaves it to
+CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that
+SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming
+previous works by a large margin, while still enjoying fast speed of parallel
+generation.
+
+
+
+
+
+
+
+
+ Faïcel Chamroukhi, Nhat Thien Pham, Van Hà Hoang, Geoffrey J. McLachlan
+
+
+ We consider the statistical analysis of heterogeneous data for prediction in
+situations where the observations include functions, typically time series. We
+extend the modeling with Mixtures-of-Experts (ME), as a framework of choice in
+modeling heterogeneity in data for prediction with vectorial observations, to
+this functional data analysis context. We first present a new family of ME
+models, named functional ME (FME) in which the predictors are potentially noisy
+observations, from entire functions. Furthermore, the data generating process
+of the predictor and the real response, is governed by a hidden discrete
+variable representing an unknown partition. Second, by imposing sparsity on
+derivatives of the underlying functional parameters via Lasso-like
+regularizations, we provide sparse and interpretable functional representations
+of the FME models called iFME. We develop dedicated expectation--maximization
+algorithms for Lasso-like (EM-Lasso) regularized maximum-likelihood parameter
+estimation strategies to fit the models. The proposed models and algorithms are
+studied in simulated scenarios and in applications to two real data sets, and
+the obtained results demonstrate their performance in accurately capturing
+complex nonlinear relationships and in clustering the heterogeneous regression
+data.
+
+
+ Diffusion model (DM), as a powerful generative model, recently achieved huge
+success in various scenarios including offline reinforcement learning, where
+the policy learns to conduct planning by generating trajectory in the online
+evaluation. However, despite the effectiveness shown for single-agent learning,
+it remains unclear how DMs can operate in multi-agent problems, where agents
+can hardly complete teamwork without good coordination by independently
+modeling each agent's trajectories. In this paper, we propose MADiff, a novel
+generative multi-agent learning framework to tackle this problem. MADiff is
+realized with an attention-based diffusion model to model the complex
+coordination among behaviors of multiple diffusion agents. To the best of our
+knowledge, MADiff is the first diffusion-based multi-agent offline RL
+framework, which behaves as both a decentralized policy and a centralized
+controller. During decentralized executions, MADiff simultaneously performs
+teammate modeling, and the centralized controller can also be applied in
+multi-agent trajectory predictions. Our experiments show the superior
+performance of MADiff compared to baseline algorithms in a wide range of
+multi-agent learning tasks, which emphasizes the effectiveness of MADiff in
+modeling complex multi-agent interactions. Our code is available at
+https://github.com/zbzhu99/madiff.
+
+
+
+ comment: 20 pages, 10 figures, 6 tables. The first two authors contributed
+ equally to the work
+
+ Accurate uncertainty quantification is necessary to enhance the reliability
+of deep learning models in real-world applications. In the case of regression
+tasks, prediction intervals (PIs) should be provided along with the
+deterministic predictions of deep learning models. Such PIs are useful or
+"high-quality" as long as they are sufficiently narrow and capture most of the
+probability density. In this paper, we present a method to learn prediction
+intervals for regression-based neural networks automatically in addition to the
+conventional target predictions. In particular, we train two companion neural
+networks: one that uses one output, the target estimate, and another that uses
+two outputs, the upper and lower bounds of the corresponding PI. Our main
+contribution is the design of a novel loss function for the PI-generation
+network that takes into account the output of the target-estimation network and
+has two optimization objectives: minimizing the mean prediction interval width
+and ensuring the PI integrity using constraints that maximize the prediction
+interval probability coverage implicitly. Furthermore, we introduce a
+self-adaptive coefficient that balances both objectives within the loss
+function, which alleviates the task of fine-tuning. Experiments using a
+synthetic dataset, eight benchmark datasets, and a real-world crop yield
+prediction dataset showed that our method was able to maintain a nominal
+probability coverage and produce significantly narrower PIs without detriment
+to its target estimation accuracy when compared to those PIs generated by three
+state-of-the-art neural-network-based methods. In other words, our method was
+shown to produce higher-quality PIs.
+
+
+
+ comment: Accepted at the IEEE Transactions on Neural Networks and Learning
+ Systems
+
+ Graph neural networks (GNN) are increasingly used to classify EEG for tasks
+such as emotion recognition, motor imagery and neurological diseases and
+disorders. A wide range of methods have been proposed to design GNN-based
+classifiers. Therefore, there is a need for a systematic review and
+categorisation of these approaches. We exhaustively search the published
+literature on this topic and derive several categories for comparison. These
+categories highlight the similarities and differences among the methods. The
+results suggest a prevalence of spectral graph convolutional layers over
+spatial. Additionally, we identify standard forms of node features, with the
+most popular being the raw EEG signal and differential entropy. Our results
+summarise the emerging trends in GNN-based approaches for EEG classification.
+Finally, we discuss several promising research directions, such as exploring
+the potential of transfer learning methods and appropriate modelling of
+cross-frequency interactions.
+
+
+
+
+
+
+
+
+ Md Zobaer Islam, Brenden Martin, Carly Gotcher, Tyler Martinez, John F. O'Hara, Sabit Ekin
+
+
+ Human respiratory rate and its pattern convey essential information about the
+physical and psychological states of the subject. Abnormal breathing can
+indicate fatal health issues leading to further diagnosis and treatment.
+Wireless light-wave sensing (LWS) using incoherent infrared light shows promise
+in safe, discreet, efficient, and non-invasive human breathing monitoring
+without raising privacy concerns. The respiration monitoring system needs to be
+trained on different types of breathing patterns to identify breathing
+anomalies.The system must also validate the collected data as a breathing
+waveform, discarding any faulty data caused by external interruption, user
+movement, or system malfunction. To address these needs, this study simulated
+normal and different types of abnormal respiration using a robot that mimics
+human breathing patterns. Then, time-series respiration data were collected
+using infrared light-wave sensing technology. Three machine learning
+algorithms, decision tree, random forest and XGBoost, were applied to detect
+breathing anomalies and faulty data. Model performances were evaluated through
+cross-validation, assessing classification accuracy, precision and recall
+scores. The random forest model achieved the highest classification accuracy of
+96.75% with data collected at a 0.5m distance. In general, ensemble models like
+random forest and XGBoost performed better than a single model in classifying
+the data collected at multiple distances from the light-wave sensing setup.
+
+
+
+ comment: 12 pages, 15 figures excluding photos of authors, submitted to IEEE
+ Transactions on Human-machine Systems
+
+
+
+
+
+
+ ♻ ☆ A Framework for Interpretability in Machine Learning for Medical Imaging
+
+
+
+
+
+
+
+
+ Alan Q. Wang, Batuhan K. Karaman, Heejong Kim, Jacob Rosenthal, Rachit Saluja, Sean I. Young, Mert R. Sabuncu
+
+
+ Interpretability for machine learning models in medical imaging (MLMI) is an
+important direction of research. However, there is a general sense of murkiness
+in what interpretability means. Why does the need for interpretability in MLMI
+arise? What goals does one actually seek to address when interpretability is
+needed? To answer these questions, we identify a need to formalize the goals
+and elements of interpretability in MLMI. By reasoning about real-world tasks
+and goals common in both medical image analysis and its intersection with
+machine learning, we identify five core elements of interpretability:
+localization, visual recognizability, physical attribution, model transparency,
+and actionability. From this, we arrive at a framework for interpretability in
+MLMI, which serves as a step-by-step guide to approaching interpretability in
+this context. Overall, this paper formalizes interpretability needs in the
+context of medical imaging, and our applied perspective clarifies concrete
+MLMI-specific goals and considerations in order to guide method design and
+improve real-world usage. Our goal is to provide practical and didactic
+information for model designers and practitioners, inspire developers of models
+in the medical imaging field to reason more deeply about what interpretability
+is achieving, and suggest future directions of interpretability research.
+
+
+
+
+
+
+
+ ♻ ☆ Learning Lattice Quantum Field Theories with Equivariant Continuous
+ Flows
+
+
+
+
+
+
+
+
+ Mathis Gerdes, Pim de Haan, Corrado Rainone, Roberto Bondesan, Miranda C. N. Cheng
+
+
+ We propose a novel machine learning method for sampling from the
+high-dimensional probability distributions of Lattice Field Theories, which is
+based on a single neural ODE layer and incorporates the full symmetries of the
+problem. We test our model on the $\phi^4$ theory, showing that it
+systematically outperforms previously proposed flow-based methods in sampling
+efficiency, and the improvement is especially pronounced for larger lattices.
+Furthermore, we demonstrate that our model can learn a continuous family of
+theories at once, and the results of learning can be transferred to larger
+lattices. Such generalizations further accentuate the advantages of machine
+learning methods.
+
+
+
+ comment: 17 pages, 9 figures, 1 table; slightly expanded published version,
+ added 2 figures and 2 sections to appendix
+
+
+
+
+
+
+ ♻ ☆ From system models to class models: An in-context learning paradigm
+
+
+ Is it possible to understand the intricacies of a dynamical system not solely
+from its input/output pattern, but also by observing the behavior of other
+systems within the same class? This central question drives the study presented
+in this paper.
+ In response to this query, we introduce a novel paradigm for system
+identification, addressing two primary tasks: one-step-ahead prediction and
+multi-step simulation. Unlike conventional methods, we do not directly estimate
+a model for the specific system. Instead, we learn a meta model that represents
+a class of dynamical systems. This meta model is trained on a potentially
+infinite stream of synthetic data, generated by simulators whose settings are
+randomly extracted from a probability distribution. When provided with a
+context from a new system-specifically, an input/output sequence-the meta model
+implicitly discerns its dynamics, enabling predictions of its behavior.
+ The proposed approach harnesses the power of Transformers, renowned for their
+\emph{in-context learning} capabilities. For one-step prediction, a GPT-like
+decoder-only architecture is utilized, whereas the simulation problem employs
+an encoder-decoder structure. Initial experimental results affirmatively answer
+our foundational question, opening doors to fresh research avenues in system
+identification.
+
+
+
+
+
+
+
+ ♻ ☆ Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with
+ Multi-Step On-Policy Optimization
+
+
+
+
+
+
+
+
+ Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, Huazhe Xu
+
+
+ Combining offline and online reinforcement learning (RL) is crucial for
+efficient and safe learning. However, previous approaches treat offline and
+online learning as separate procedures, resulting in redundant designs and
+limited performance. We ask: Can we achieve straightforward yet effective
+offline and online learning without introducing extra conservatism or
+regularization? In this study, we propose Uni-o4, which utilizes an on-policy
+objective for both offline and online learning. Owning to the alignment of
+objectives in two phases, the RL agent can transfer between offline and online
+learning seamlessly. This property enhances the flexibility of the learning
+paradigm, allowing for arbitrary combinations of pretraining, fine-tuning,
+offline, and online learning. In the offline phase, specifically, Uni-o4
+leverages diverse ensemble policies to address the mismatch issues between the
+estimated behavior policy and the offline dataset. Through a simple offline
+policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy
+improvement safely. We demonstrate that by employing the method above, the
+fusion of these two paradigms can yield superior offline initialization as well
+as stable and rapid online fine-tuning capabilities. Through real-world robot
+tasks, we highlight the benefits of this paradigm for rapid deployment in
+challenging, previously unseen real-world environments. Additionally, through
+comprehensive evaluations using numerous simulated benchmarks, we substantiate
+that our method achieves state-of-the-art performance in both offline and
+offline-to-online fine-tuning learning. Our website:
+https://lei-kun.github.io/uni-o4/ .
+
+
+ Hierarchy is an important and commonly observed topological property in
+real-world graphs that indicate the relationships between supervisors and
+subordinates or the organizational behavior of human groups. As hierarchy is
+introduced as a new inductive bias into the Graph Neural Networks (GNNs) in
+various tasks, it implies latent topological relations for attackers to improve
+their inference attack performance, leading to serious privacy leakage issues.
+In addition, existing privacy-preserving frameworks suffer from reduced
+protection ability in hierarchical propagation due to the deficiency of
+adaptive upper-bound estimation of the hierarchical perturbation boundary. It
+is of great urgency to effectively leverage the hierarchical property of data
+while satisfying privacy guarantees. To solve the problem, we propose the
+Poincar\'e Differential Privacy framework, named PoinDP, to protect the
+hierarchy-aware graph embedding based on hyperbolic geometry. Specifically,
+PoinDP first learns the hierarchy weights for each entity based on the
+Poincar\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware
+Sensitivity is designed to measure the sensitivity of the hierarchical
+structure and adaptively allocate the privacy protection strength. Besides, the
+Hyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian
+mechanism in Euclidean space to hyperbolic space to realize random
+perturbations that satisfy differential privacy under the hyperbolic space
+metric. Extensive experiment results on five real-world datasets demonstrate
+the proposed PoinDP's advantages of effective privacy protection while
+maintaining good performance on the node classification task.
+
+
+
+
+
+
+
+
+ Johannes Aspman, Georgios Korpas, Jakub Marecek
+
+
+ There has been a great deal of recent interest in binarized neural networks,
+especially because of their explainability. At the same time, automatic
+differentiation algorithms such as backpropagation fail for binarized neural
+networks, which limits their applicability. By reformulating the problem of
+training binarized neural networks as a subadditive dual of a mixed-integer
+program, we show that binarized neural networks admit a tame representation.
+This, in turn, makes it possible to use the framework of Bolte et al. for
+implicit differentiation, which offers the possibility for practical
+implementation of backpropagation in the context of binarized neural networks.
+ This approach could also be used for a broader class of mixed-integer
+programs, beyond the training of binarized neural networks, as encountered in
+symbolic approaches to AI and beyond.
+
+
+ Ensuring fairness in Recommendation Systems (RSs) across demographic groups
+is critical due to the increased integration of RSs in applications such as
+personalized healthcare, finance, and e-commerce. Graph-based RSs play a
+crucial role in capturing intricate higher-order interactions among entities.
+However, integrating these graph models into the Federated Learning (FL)
+paradigm with fairness constraints poses formidable challenges as this requires
+access to the entire interaction graph and sensitive user information (such as
+gender, age, etc.) at the central server. This paper addresses the pervasive
+issue of inherent bias within RSs for different demographic groups without
+compromising the privacy of sensitive user attributes in FL environment with
+the graph-based model. To address the group bias, we propose F2PGNN (Fair
+Federated Personalized Graph Neural Network), a novel framework that leverages
+the power of Personalized Graph Neural Network (GNN) coupled with fairness
+considerations. Additionally, we use differential privacy techniques to fortify
+privacy protection. Experimental evaluation on three publicly available
+datasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47%
+- 99% compared to the state-of-the-art while preserving privacy and maintaining
+the utility. The results validate the significance of our framework in
+achieving equitable and personalized recommendations using GNN within the FL
+landscape.
+
+
+
+ comment: To appear as a full paper in AAAI 2024
+
+
+
+
+
+
+
+ Jacopo Bonato, Francesco Pelosin, Luigi Sabetta, Alessandro Nicolosi
+
+
+ The recent surge of pervasive devices that generate dynamic data streams has
+underscored the necessity for learning systems to adapt continually to data
+distributional shifts. To tackle this challenge, the research community has put
+forth a spectrum of methodologies, including the demanding pursuit of
+class-incremental learning without replay data. In this study, we present MIND,
+a parameter isolation method that aims to significantly enhance the performance
+of replay-free solutions and achieve state-of-the-art results on several widely
+studied datasets. Our approach introduces two main contributions: two
+alternative distillation procedures that significantly improve the efficiency
+of MIND increasing the accumulated knowledge of each sub-network, and the
+optimization of the BachNorm layers across tasks inside the sub-networks.
+Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free
+Class-Incremental learning (with an increment in classification accuracy of
+approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx.
++40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each
+contribution to demonstrate its impact on performance improvement. Our results
+showcase the superior performance of MIND indicating its potential for
+addressing the challenges posed by Class-incremental and Domain-Incremental
+learning in resource-constrained environments.
+
+
+
+ comment: Accepted at the 38th AAAI Conference on Artificial Intelligence
+
+ Protein-ligand binding affinity (PLBA) prediction is the fundamental task in
+drug discovery. Recently, various deep learning-based models predict binding
+affinity by incorporating the three-dimensional structure of protein-ligand
+complexes as input and achieving astounding progress. However, due to the
+scarcity of high-quality training data, the generalization ability of current
+models is still limited. In addition, different bioassays use varying affinity
+measurement labels (i.e., IC50, Ki, Kd), and different experimental conditions
+inevitably introduce systematic noise, which poses a significant challenge to
+constructing high-precision affinity prediction models. To address these
+issues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training
+framework for structure-based PLBA prediction; (2) construct a pre-training
+dataset called ChEMBL-Dock with more than 300k experimentally measured affinity
+labels and about 2.8M docked three-dimensional structures. By introducing
+multi-task pre-training to treat the prediction of different affinity labels as
+different tasks and classifying relative rankings between samples from the same
+bioassay, MBP learns robust and transferrable structural knowledge from our new
+ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the
+capability of MBP as a general framework that can improve and be tailored to
+mainstream structure-based PLBA prediction tasks. To the best of our knowledge,
+MBP is the first affinity pre-training model and shows great potential for
+future development.
+
+
+
+ comment: 21 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy
+ Learning
+
+
+ We propose a simple and general framework for nonparametric estimation of
+heterogeneous treatment effects under fairness constraints. Under standard
+regularity conditions, we show that the resulting estimators possess the double
+robustness property. We use this framework to characterize the trade-off
+between fairness and the maximum welfare achievable by the optimal policy. We
+evaluate the methods in a simulation study and illustrate them in a real-world
+case study.
+
+
+
+
+
+
+
+
+ Alexis Goujon, Sebastian Neumayer, Michael Unser
+
+
+ We propose to learn non-convex regularizers with a prescribed upper bound on
+their weak-convexity modulus. Such regularizers give rise to variational
+denoisers that minimize a convex energy. They rely on few parameters (less than
+15,000) and offer a signal-processing interpretation as they mimic handcrafted
+sparsity-promoting regularizers. Through numerical experiments, we show that
+such denoisers outperform convex-regularization methods as well as the popular
+BM3D denoiser. Additionally, the learned regularizer can be deployed to solve
+inverse problems with iterative schemes that provably converge. For both CT and
+MRI reconstruction, the regularizer generalizes well and offers an excellent
+tradeoff between performance, number of parameters, guarantees, and
+interpretability when compared to other data-driven approaches.
+
+
+
+
+
+
+
+ ♻ ☆ Contextual Pre-Planning on Reward Machine Abstractions for Enhanced
+ Transfer in Deep Reinforcement Learning AAAI
+
+
+
+
+
+
+
+
+ Guy Azran, Mohamad H. Danesh, Stefano V. Albrecht, Sarah Keren
+
+
+ Recent studies show that deep reinforcement learning (DRL) agents tend to
+overfit to the task on which they were trained and fail to adapt to minor
+environment changes. To expedite learning when transferring to unseen tasks, we
+propose a novel approach to representing the current task using reward machines
+(RMs), state machine abstractions that induce subtasks based on the current
+task's rewards and dynamics. Our method provides agents with symbolic
+representations of optimal transitions from their current abstract state and
+rewards them for achieving these transitions. These representations are shared
+across tasks, allowing agents to exploit knowledge of previously encountered
+symbols and transitions, thus enhancing transfer. Empirical results show that
+our representations improve sample efficiency and few-shot transfer in a
+variety of domains.
+
+
+
+ comment: Proceedings of the 38th AAAI Conference on Artificial Intelligence
+ (AAAI), 2024
+
+ We present a novel approach to non-convex optimization with certificates,
+which handles smooth functions on the hypercube or on the torus. Unlike
+traditional methods that rely on algebraic properties, our algorithm exploits
+the regularity of the target function intrinsic in the decay of its Fourier
+spectrum. By defining a tractable family of models, we allow at the same time
+to obtain precise certificates and to leverage the advanced and powerful
+computational techniques developed to optimize neural networks. In this way the
+scalability of our approach is naturally enhanced by parallel computing with
+GPUs. Our approach, when applied to the case of polynomials of moderate
+dimensions but with thousands of coefficients, outperforms the state-of-the-art
+optimization methods with certificates, as the ones based on Lasserre's
+hierarchy, addressing problems intractable for the competitors.
+
+
+
+ comment: Edit affiliations and acknowledgments
+
+
+
+
+
+
+ ♻ ☆ Hybrid Sample Synthesis-based Debiasing of Classifier in Limited Data
+ Setting WACV 2024
+
+
+ Deep learning models are known to suffer from the problem of bias, and
+researchers have been exploring methods to address this issue. However, most of
+these methods require prior knowledge of the bias and are not always practical.
+In this paper, we focus on a more practical setting with no prior information
+about the bias. Generally, in this setting, there are a large number of
+bias-aligned samples that cause the model to produce biased predictions and a
+few bias-conflicting samples that do not conform to the bias. If the training
+data is limited, the influence of the bias-aligned samples may become even
+stronger on the model predictions, and we experimentally demonstrate that
+existing debiasing techniques suffer severely in such cases. In this paper, we
+examine the effects of unknown bias in small dataset regimes and present a
+novel approach to mitigate this issue. The proposed approach directly addresses
+the issue of the extremely low occurrence of bias-conflicting samples in
+limited data settings through the synthesis of hybrid samples that can be used
+to reduce the effect of bias. We perform extensive experiments on several
+benchmark datasets and experimentally demonstrate the effectiveness of our
+proposed approach in addressing any unknown bias in the presence of limited
+data. Specifically, our approach outperforms the vanilla, LfF, LDD, and DebiAN
+debiasing methods by absolute margins of 10.39%, 9.08%, 8.07%, and 9.67% when
+only 10% of the Corrupted CIFAR-10 Type 1 dataset is available with a
+bias-conflicting sample ratio of 0.05.
+
+
+
+ comment: Accepted in WACV 2024
+
+
+
+
+
+
+ ♻ ☆ Attribution-based Explanations that Provide Recourse Cannot be Robust
+
+
+
+
+
+
+
+
+ Hidde Fokkema, Rianne de Heide, Tim van Erven
+
+
+ Different users of machine learning methods require different explanations,
+depending on their goals. To make machine learning accountable to society, one
+important goal is to get actionable options for recourse, which allow an
+affected user to change the decision $f(x)$ of a machine learning system by
+making limited changes to its input $x$. We formalize this by providing a
+general definition of recourse sensitivity, which needs to be instantiated with
+a utility function that describes which changes to the decisions are relevant
+to the user. This definition applies to local attribution methods, which
+attribute an importance weight to each input feature. It is often argued that
+such local attributions should be robust, in the sense that a small change in
+the input $x$ that is being explained, should not cause a large change in the
+feature weights. However, we prove formally that it is in general impossible
+for any single attribution method to be both recourse sensitive and robust at
+the same time. It follows that there must always exist counterexamples to at
+least one of these properties. We provide such counterexamples for several
+popular attribution methods, including LIME, SHAP, Integrated Gradients and
+SmoothGrad. Our results also cover counterfactual explanations, which may be
+viewed as attributions that describe a perturbation of $x$. We further discuss
+possible ways to work around our impossibility result, for instance by allowing
+the output to consist of sets with multiple attributions, and we provide
+sufficient conditions for specific classes of continuous functions to be
+recourse sensitive. Finally, we strengthen our impossibility result for the
+restricted case where users are only able to change a single attribute of $x$,
+by providing an exact characterization of the functions $f$ to which
+impossibility applies.
+
+
+
+ comment: 32 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Instance-Conditional Timescales of Decay for Non-Stationary Learning AAAI 2024
+
+
+ Slow concept drift is a ubiquitous, yet under-studied problem in practical
+machine learning systems. In such settings, although recent data is more
+indicative of future data, naively prioritizing recent instances runs the risk
+of losing valuable information from the past. We propose an optimization-driven
+approach towards balancing instance importance over large training windows.
+First, we model instance relevance using a mixture of multiple timescales of
+decay, allowing us to capture rich temporal trends. Second, we learn an
+auxiliary scorer model that recovers the appropriate mixture of timescales as a
+function of the instance itself. Finally, we propose a nested optimization
+objective for learning the scorer, by which it maximizes forward transfer for
+the learned model. Experiments on a large real-world dataset of 39M photos over
+a 9 year period show upto 15% relative gains in accuracy compared to other
+robust learning baselines. We replicate our gains on two collections of
+real-world datasets for non-stationary learning, and extend our work to
+continual learning settings where, too, we beat SOTA methods by large margins.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Physics-informed Neural Network Estimation of Material Properties in
+ Soft Tissue Nonlinear Biomechanical Models
+
+
+
+
+
+
+
+
+ Federica Caforio, Francesco Regazzoni, Stefano Pagani, Elias Karabelas, Christoph Augustin, Gundolf Haase, Gernot Plank, Alfio Quarteroni
+
+
+ The development of biophysical models for clinical applications is rapidly
+advancing in the research community, thanks to their predictive nature and
+their ability to assist the interpretation of clinical data. However,
+high-resolution and accurate multi-physics computational models are
+computationally expensive and their personalisation involves fine calibration
+of a large number of parameters, which may be space-dependent, challenging
+their clinical translation. In this work, we propose a new approach which
+relies on the combination of physics-informed neural networks (PINNs) with
+three-dimensional soft tissue nonlinear biomechanical models, capable of
+reconstructing displacement fields and estimating heterogeneous
+patient-specific biophysical properties. The proposed learning algorithm
+encodes information from a limited amount of displacement and, in some cases,
+strain data, that can be routinely acquired in the clinical setting, and
+combines it with the physics of the problem, represented by a mathematical
+model based on partial differential equations, to regularise the problem and
+improve its convergence properties. Several benchmarks are presented to show
+the accuracy and robustness of the proposed method and its great potential to
+enable the robust and effective identification of patient-specific,
+heterogeneous physical properties, s.a. tissue stiffness properties. In
+particular, we demonstrate the capability of the PINN to detect the presence,
+location and severity of scar tissue, which is beneficial to develop
+personalised simulation models for disease diagnosis, especially for cardiac
+applications.
+
+
+
+
+
+
+
+
+ Jann Spiess, Vasilis Syrgkanis, Victor Yaneng Wang
+
+
+ Researchers often run resource-intensive randomized controlled trials (RCTs)
+to estimate the causal effects of interventions on outcomes of interest. Yet
+these outcomes are often noisy, and estimated overall effects can be small or
+imprecise. Nevertheless, we may still be able to produce reliable evidence of
+the efficacy of an intervention by finding subgroups with significant effects.
+In this paper, we propose a machine-learning method that is specifically
+optimized for finding such subgroups in noisy data. Unlike available methods
+for personalized treatment assignment, our tool is fundamentally designed to
+take significance testing into account: it produces a subgroup that is chosen
+to maximize the probability of obtaining a statistically significant positive
+treatment effect. We provide a computationally efficient implementation using
+decision trees and demonstrate its gain over selecting subgroups based on
+positive (estimated) treatment effects. Compared to standard tree-based
+regression and classification tools, this approach tends to yield higher power
+in detecting subgroups affected by the treatment.
+
+
+
+
+
+
+
+ ♻ ☆ Transformed Low-Rank Parameterization Can Help Robust Generalization for
+ Tensor Neural Networks NeurIPS 2023
+
+
+ Achieving efficient and robust multi-channel data learning is a challenging
+task in data science. By exploiting low-rankness in the transformed domain,
+i.e., transformed low-rankness, tensor Singular Value Decomposition (t-SVD) has
+achieved extensive success in multi-channel data representation and has
+recently been extended to function representation such as Neural Networks with
+t-product layers (t-NNs). However, it still remains unclear how t-SVD
+theoretically affects the learning behavior of t-NNs. This paper is the first
+to answer this question by deriving the upper bounds of the generalization
+error of both standard and adversarially trained t-NNs. It reveals that the
+t-NNs compressed by exact transformed low-rank parameterization can achieve a
+sharper adversarial generalization bound. In practice, although t-NNs rarely
+have exactly transformed low-rank weights, our analysis further shows that by
+adversarial training with gradient flow (GF), the over-parameterized t-NNs with
+ReLU activations are trained with implicit regularization towards transformed
+low-rank parameterization under certain conditions. We also establish
+adversarial generalization bounds for t-NNs with approximately transformed
+low-rank weights. Our analysis indicates that the transformed low-rank
+parameterization can promisingly enhance robust generalization for t-NNs.
+
+
+ Formal abductive explanations offer crucial guarantees of rigor and so are of
+interest in high-stakes uses of machine learning (ML). One drawback of
+abductive explanations is explanation size, justified by the cognitive limits
+of human decision-makers. Probabilistic abductive explanations (PAXps) address
+this limitation, but their theoretical and practical complexity makes their
+exact computation most often unrealistic. This paper proposes novel efficient
+algorithms for the computation of locally-minimal PXAps, which offer
+high-quality approximations of PXAps in practice. The experimental results
+demonstrate the practical efficiency of the proposed algorithms.
+
+
+
+
+
+
+
+ ♻ ☆ Data-Juicer: A One-Stop Data Processing System for Large Language Models
+
+
+ The immense evolution in Large Language Models (LLMs) has underscored the
+importance of massive, heterogeneous, and high-quality data. A data recipe is a
+mixture of data from different sources for training LLMs, which plays a vital
+role in LLMs' performance. Existing open-source tools for LLM data processing
+are mostly tailored for specific data recipes. To continuously uncover the
+potential of LLMs, incorporate data from new sources, and improve LLMs'
+performance, we build a new system named Data-Juicer, with which we can
+efficiently generate diverse data recipes, explore different possibilities in
+forming data mixtures, and evaluate their effects on model performance.
+Different from traditional data-analytics pipelines, Data-Juicer faces some
+unique challenges. Firstly, the possible data sources for forming data recipes
+are truly heterogeneous and massive with various qualities. Secondly, it is
+extremely expensive to precisely evaluate data recipes' impact on LLMs'
+performance. Thirdly, the end users of Data-Juicer, model developers, need
+sufficient flexibility to configure and evaluate different data recipes.
+ Data-Juicer features a fine-grained abstraction of pipelines for constructing
+data recipes, with over 50 built-in operators for easy composition and
+extension. By incorporating visualization and auto-evaluation capabilities,
+Data-Juicer enables a timely feedback loop for both LLM pre-training and
+fine-tuning. Further, Data-Juicer is optimized and integrated with ecosystems
+for LLM training, evaluation, and distributed computing. The data recipes
+derived with Data-Juicer gain notable improvements on state-of-the-art LLMs, by
+up to 7.45% increase in averaged score across 16 LLM benchmarks and 17.5%
+higher win rate in pair-wise GPT-4 evaluations. Our system, data recipes, and
+tutorials are released, calling for broader data-centric research on training
+and understanding LLMs.
+
+
+
+ comment: 20 Pages, 10 figures, 9 tables. The system, data recipes, and demos
+ are continuously maintained at https://github.com/alibaba/data-juicer
+
+
+
+
+
+
+ ♻ ☆ Fake detection in imbalance dataset by Semi-supervised learning with GAN
+
+
+ As social media continues to grow rapidly, the prevalence of harassment on
+these platforms has also increased. This has piqued the interest of researchers
+in the field of fake detection. Social media data, often forms complex graphs
+with numerous nodes, posing several challenges. These challenges and
+limitations include dealing with a significant amount of irrelevant features in
+matrices and addressing issues such as high data dispersion and an imbalanced
+class distribution within the dataset. To overcome these challenges and
+limitations, researchers have employed auto-encoders and a combination of
+semi-supervised learning with a GAN algorithm, referred to as SGAN. Our
+proposed method utilizes auto-encoders for feature extraction and incorporates
+SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN
+compensates for the limited availability of labeled data, making efficient use
+of the limited number of labeled instances. Multiple evaluation metrics were
+employed, including the Confusion Matrix and the ROC curve. The dataset was
+divided into training and testing sets, with 100 labeled samples for training
+and 1,000 samples for testing. The novelty of our research lies in applying
+SGAN to address the issue of imbalanced datasets in fake account detection. By
+optimizing the use of a smaller number of labeled instances and reducing the
+need for extensive computational power, our method offers a more efficient
+solution. Additionally, our study contributes to the field by achieving an 81%
+accuracy in detecting fake accounts using only 100 labeled samples. This
+demonstrates the potential of SGAN as a powerful tool for handling minority
+classes and addressing big data challenges in fake account detection.
+
+
+
+ comment: needed more investigation o final results
+
+
+
+
+
+
+ ♻ ☆ RED-PSM: Regularization by Denoising of Partially Separable Models for
+ Dynamic Imaging
+
+
+
+
+
+
+
+
+ Berk Iskender, Marc L. Klasky, Yoram Bresler
+
+
+ Dynamic imaging addresses the recovery of a time-varying 2D or 3D object at
+each time instant using its undersampled measurements. In particular, in the
+case of dynamic tomography, only a single projection at a single view angle may
+be available at a time, making the problem severely ill-posed. In this work, we
+propose an approach, RED-PSM, which combines for the first time two powerful
+techniques to address this challenging imaging problem. The first, are
+partially separable models, which have been used to efficiently introduce a
+low-rank prior for the spatio-temporal object. The second is the recent
+\textit{Regularization by Denoising (RED)}, which provides a flexible framework
+to exploit the impressive performance of state-of-the-art image denoising
+algorithms, for various inverse problems. We propose a partially separable
+objective with RED and a computationally efficient and scalable optimization
+scheme with variable splitting and ADMM. Theoretical analysis proves the
+convergence of our objective to a value corresponding to a stationary point
+satisfying the first-order optimality conditions. Convergence is accelerated by
+a particular projection-domain-based initialization. We demonstrate the
+performance and computational improvements of our proposed RED-PSM with a
+learned image denoiser by comparing it to a recent deep-prior-based method
+known as TD-DIP. Although the main focus is on dynamic tomography, we also show
+performance advantages of RED-PSM in a cardiac dynamic MRI setting.
+
+
+
+
+
+
+
+ ♻ ☆ Detecting fake accounts through Generative Adversarial Network in online
+ social media
+
+
+
+
+
+
+
+
+ Jinus Bordbar, Mohammadreza Mohammadrezaie, Saman Ardalan, Mohammad Ebrahim Shiri
+
+
+ Online social media is integral to human life, facilitating messaging,
+information sharing, and confidential communication while preserving privacy.
+Platforms like Twitter, Instagram, and Facebook exemplify this phenomenon.
+However, users face challenges due to network anomalies, often stemming from
+malicious activities such as identity theft for financial gain or harm. This
+paper proposes a novel method using user similarity measures and the Generative
+Adversarial Network (GAN) algorithm to identify fake user accounts in the
+Twitter dataset. Despite the problem's complexity, the method achieves an AUC
+rate of 80\% in classifying and detecting fake accounts. Notably, the study
+builds on previous research, highlighting advancements and insights into the
+evolving landscape of anomaly detection in online social networks.
+
+
+
+ comment: needed more investigation on final results
+
+
+
+
+
+
+ ♻ ☆ Exponentially Improved Efficient and Accurate Machine Learning for
+ Quantum Many-body States with Provable Guarantees
+
+
+
+
+
+
+
+
+ Yanming Che, Clemens Gneiting, Franco Nori
+
+
+ Solving the ground state and the ground-state properties of quantum many-body
+systems is generically a hard task for classical algorithms. For a family of
+Hamiltonians defined on an $m$-dimensional space of physical parameters, the
+ground state and its properties at an arbitrary parameter configuration can be
+predicted via a machine learning protocol up to a prescribed prediction error
+$\varepsilon$, provided that a sample set (of size $N$) of the states can be
+efficiently prepared and measured. In a recent work [Huang et al., Science 377,
+eabk3333 (2022)], a rigorous guarantee for such a generalization was proved.
+Unfortunately, an exponential scaling for the provable sample complexity,
+$N=m^{{\cal{O}}\left(\frac{1}{\varepsilon}\right)}$, was found to be universal
+for generic gapped Hamiltonians. This result applies to the situation where the
+dimension of the parameter space is large while the scaling with the accuracy
+is not an urgent factor. In this work, we consider an alternative scenario
+where $m$ is a finite, not necessarily large constant while the scaling with
+the prediction error becomes the central concern. By jointly preserving the
+fundamental properties of density matrices in the learning protocol and
+utilizing the continuity of quantum states in the parameter range of interest,
+we rigorously obtain a polynomial sample complexity for predicting quantum
+many-body states and their properties, with respect to the uniform prediction
+error $\varepsilon$ and the number of qubits $n$. Moreover, if restricted to
+learning local quantum-state properties, the number of samples with respect to
+$n$ can be further reduced exponentially. Our results provide theoretical
+guarantees for efficient and accurate learning of quantum many-body states and
+their properties, with model-independent applications not restricted to ground
+states of gapped Hamiltonians.
+
+
+
+ comment: 8 + 13 pages, 2 + 1 figures; With supplemental material (SM).
+ Improved presentation to highlight our new findings; Added numerical
+ demonstration with a quantum XY model; Added Sec. II in the SM
+
+
+
+
+
+
+ ♻ ☆ Universal Approximation Property of Random Neural Networks
+
+
+ In this paper, we study random neural networks which are single-hidden-layer
+feedforward neural networks whose weights and biases are randomly initialized.
+After this random initialization, only the linear readout needs to be trained,
+which can be performed efficiently, e.g., by the least squares method. By
+viewing random neural networks as Banach space-valued random variables, we
+prove a universal approximation theorem within a large class of Bochner spaces.
+Hereby, the corresponding Banach space can be significantly more general than
+the space of continuous functions over a compact subset of a Euclidean space,
+namely, e.g., an $L^p$-space or a Sobolev space, where the latter includes the
+approximation of the derivatives. Moreover, we derive approximation rates and
+an explicit algorithm to learn a deterministic function by a random neural
+network. In addition, we provide a full error analysis and study when random
+neural networks overcome the curse of dimensionality in the sense that the
+training costs scale at most polynomially in the input and output dimension.
+Furthermore, we show in two numerical examples the empirical advantages of
+random neural networks compared to fully trained deterministic neural networks.
+
+
+ Graph neural networks (GNNs) have shown remarkable success in learning
+representations for graph-structured data. However, GNNs still face challenges
+in modeling complex phenomena that involve feature transportation. In this
+paper, we propose a novel GNN architecture inspired by
+Advection-Diffusion-Reaction systems, called ADR-GNN. Advection models feature
+transportation, while diffusion captures the local smoothing of features, and
+reaction represents the non-linear transformation between feature channels. We
+provide an analysis of the qualitative behavior of ADR-GNN, that shows the
+benefit of combining advection, diffusion, and reaction. To demonstrate its
+efficacy, we evaluate ADR-GNN on real-world node classification and
+spatio-temporal datasets, and show that it improves or offers competitive
+performance compared to state-of-the-art networks.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ A Survey of Reasoning with Foundation Models: Concepts, Methodologies,
+ and Outlook
+
+
+ Reasoning, a crucial ability for complex problem-solving, plays a pivotal
+role in various real-world settings such as negotiation, medical diagnosis, and
+criminal investigation. It serves as a fundamental methodology in the field of
+Artificial General Intelligence (AGI). With the ongoing development of
+foundation models, there is a growing interest in exploring their abilities in
+reasoning tasks. In this paper, we introduce seminal foundation models proposed
+or adaptable for reasoning, highlighting the latest advancements in various
+reasoning tasks, methods, and benchmarks. We then delve into the potential
+future directions behind the emergence of reasoning abilities within foundation
+models. We also discuss the relevance of multimodal learning, autonomous
+agents, and super alignment in the context of reasoning. By discussing these
+future research directions, we hope to inspire researchers in their exploration
+of this field, stimulate further advancements in reasoning with foundation
+models, and contribute to the development of AGI.
+
+
+ Relational inference aims to identify interactions between parts of a
+dynamical system from the observed dynamics. Current state-of-the-art methods
+fit the dynamics with a graph neural network (GNN) on a learnable graph. They
+use one-step message-passing GNNs -- intuitively the right choice since
+non-locality of multi-step or spectral GNNs may confuse direct and indirect
+interactions. But the \textit{effective} interaction graph depends on the
+sampling rate and it is rarely localized to direct neighbors, leading to poor
+local optima for the one-step model. In this work, we propose a \textit{graph
+dynamics prior} (GDP) for relational inference. GDP constructively uses error
+amplification in non-local polynomial filters to steer the solution to the
+ground-truth graph. To deal with non-uniqueness, GDP simultaneously fits a
+``shallow'' one-step model and a polynomial multi-step model with shared graph
+topology. Experiments show that GDP reconstructs graphs far more accurately
+than earlier methods, with remarkable robustness to under-sampling. Since
+appropriate sampling rates for unknown dynamical systems are not known a
+priori, this robustness makes GDP suitable for real applications in scientific
+machine learning. Reproducible code is available at
+https://github.com/DaDaCheng/GDP.
+
+
+
+
+
+
+
+ ♻ ☆ Data-driven Piecewise Affine Decision Rules for Stochastic Programming
+ with Covariate Information
+
+
+ Focusing on stochastic programming (SP) with covariate information, this
+paper proposes an empirical risk minimization (ERM) method embedded within a
+nonconvex piecewise affine decision rule (PADR), which aims to learn the direct
+mapping from features to optimal decisions. We establish the nonasymptotic
+consistency result of our PADR-based ERM model for unconstrained problems and
+asymptotic consistency result for constrained ones. To solve the nonconvex and
+nondifferentiable ERM problem, we develop an enhanced stochastic
+majorization-minimization algorithm and establish the asymptotic convergence to
+(composite strong) directional stationarity along with complexity analysis. We
+show that the proposed PADR-based ERM method applies to a broad class of
+nonconvex SP problems with theoretical consistency guarantees and computational
+tractability. Our numerical study demonstrates the superior performance of
+PADR-based ERM methods compared to state-of-the-art approaches under various
+settings, with significantly lower costs, less computation time, and robustness
+to feature dimensions and nonlinearity of the underlying dependency.
+
+
+ We propose a differentiable imaging framework to address uncertainty in
+measurement coordinates such as sensor locations and projection angles. We
+formulate the problem as measurement interpolation at unknown nodes supervised
+through the forward operator. To solve it we apply implicit neural networks,
+also known as neural fields, which are naturally differentiable with respect to
+the input coordinates. We also develop differentiable spline interpolators
+which perform as well as neural networks, require less time to optimize and
+have well-understood properties. Differentiability is key as it allows us to
+jointly fit a measurement representation, optimize over the uncertain
+measurement coordinates, and perform image reconstruction which in turn ensures
+consistent calibration. We apply our approach to 2D and 3D computed tomography,
+and show that it produces improved reconstructions compared to baselines that
+do not account for the lack of calibration. The flexibility of the proposed
+framework makes it easy to extend to almost arbitrary imaging problems.
+
+
+
+
+
+
+
+ ♻ ☆ On the Tradeoff between Privacy Preservation and Byzantine-Robustness in
+ Decentralized Learning
+
+
+ This paper jointly considers privacy preservation and Byzantine-robustness in
+decentralized learning. In a decentralized network, honest-but-curious agents
+faithfully follow the prescribed algorithm, but expect to infer their
+neighbors' private data from messages received during the learning process,
+while dishonest-and-Byzantine agents disobey the prescribed algorithm, and
+deliberately disseminate wrong messages to their neighbors so as to bias the
+learning process. For this novel setting, we investigate a generic
+privacy-preserving and Byzantine-robust decentralized stochastic gradient
+descent (SGD) framework, in which Gaussian noise is injected to preserve
+privacy and robust aggregation rules are adopted to counteract Byzantine
+attacks. We analyze its learning error and privacy guarantee, discovering an
+essential tradeoff between privacy preservation and Byzantine-robustness in
+decentralized learning -- the learning error caused by defending against
+Byzantine attacks is exacerbated by the Gaussian noise added to preserve
+privacy. For a class of state-of-the-art robust aggregation rules, we give
+unified analysis of the "mixing abilities". Building upon this analysis, we
+reveal how the "mixing abilities" affect the tradeoff between privacy
+preservation and Byzantine-robustness. The theoretical results provide
+guidelines for achieving a favorable tradeoff with proper design of robust
+aggregation rules. Numerical experiments are conducted and corroborate our
+theoretical findings.
+
+
+
+
+
+
+
+ ♻ ☆ Invariant Random Forest: Tree-Based Model Solution for OOD
+ Generalization AAAI
+
+
+ Out-Of-Distribution (OOD) generalization is an essential topic in machine
+learning. However, recent research is only focusing on the corresponding
+methods for neural networks. This paper introduces a novel and effective
+solution for OOD generalization of decision tree models, named Invariant
+Decision Tree (IDT). IDT enforces a penalty term with regard to the
+unstable/varying behavior of a split across different environments during the
+growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is
+constructed. Our proposed method is motivated by a theoretical result under
+mild conditions, and validated by numerical tests with both synthetic and real
+datasets. The superior performance compared to non-OOD tree models implies that
+considering OOD generalization for tree models is absolutely necessary and
+should be given more attention.
+
+
+
+ comment: AAAI Conference on Artificial Intelligence, 2024
+
+
+
+
+
+
+ ♻ ☆ Transformer as Linear Expansion of Learngene
+
+
+ We propose expanding the shared Transformer module to produce and initialize
+Transformers of varying depths, enabling adaptation to diverse resource
+constraints. Drawing an analogy to genetic expansibility, we term such module
+as learngene. To identify the expansion mechanism, we delve into the
+relationship between the layer's position and its corresponding weight value,
+and find that linear function appropriately approximates this relationship.
+Building on this insight, we present Transformer as Linear Expansion of
+learnGene (TLEG), a novel approach for flexibly producing and initializing
+Transformers of diverse depths. Specifically, to learn learngene, we firstly
+construct an auxiliary Transformer linearly expanded from learngene, after
+which we train it through employing soft distillation. Subsequently, we can
+produce and initialize Transformers of varying depths via linearly expanding
+the well-trained learngene, thereby supporting diverse downstream scenarios.
+Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable
+or better performance in contrast to many individual models trained from
+scratch, while reducing around 2x training cost. When transferring to several
+downstream classification datasets, TLEG surpasses existing initialization
+methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100).
+Under the situation where we need to produce models of varying depths adapting
+for different resource constraints, TLEG achieves comparable results while
+reducing around 19x parameters stored to initialize these models and around 5x
+pre-training costs, in contrast to the pre-training and fine-tuning approach.
+When transferring a fixed set of parameters to initialize different models,
+TLEG presents better flexibility and competitive performance while reducing
+around 2.9x parameters stored to initialize, compared to the pre-training
+approach.
+
+
+
+
+
+
+
+ ♻ ☆ MAPTree: Beating "Optimal" Decision Trees with Bayesian Decision Trees
+
+
+
+
+
+
+
+
+ Colin Sullivan, Mo Tiwari, Sebastian Thrun
+
+
+ Decision trees remain one of the most popular machine learning models today,
+largely due to their out-of-the-box performance and interpretability. In this
+work, we present a Bayesian approach to decision tree induction via maximum a
+posteriori inference of a posterior distribution over trees. We first
+demonstrate a connection between maximum a posteriori inference of decision
+trees and AND/OR search. Using this connection, we propose an AND/OR search
+algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori
+tree. Lastly, we demonstrate the empirical performance of the maximum a
+posteriori tree both on synthetic data and in real world settings. On 16 real
+world datasets, MAPTree either outperforms baselines or demonstrates comparable
+performance but with much smaller trees. On a synthetic dataset, MAPTree also
+demonstrates greater robustness to noise and better generalization than
+existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree
+faster than existing sampling approaches and, in contrast with those
+algorithms, is able to provide a certificate of optimality. The code for our
+experiments is available at https://github.com/ThrunGroup/maptree.
+
+
+
+ comment: 19 pages
+
+
+
+
+
+
+ ♻ ☆ Temporal Conditioning Spiking Latent Variable Models of the Neural
+ Response to Natural Visual Scenes NeurIPS 2023
+
+
+ Developing computational models of neural response is crucial for
+understanding sensory processing and neural computations. Current
+state-of-the-art neural network methods use temporal filters to handle temporal
+dependencies, resulting in an unrealistic and inflexible processing paradigm.
+Meanwhile, these methods target trial-averaged firing rates and fail to capture
+important features in spike trains. This work presents the temporal
+conditioning spiking latent variable models (TeCoS-LVM) to simulate the neural
+response to natural visual stimuli. We use spiking neurons to produce spike
+outputs that directly match the recorded trains. This approach helps to avoid
+losing information embedded in the original spike trains. We exclude the
+temporal dimension from the model parameter space and introduce a temporal
+conditioning operation to allow the model to adaptively explore and exploit
+temporal dependencies in stimuli sequences in a {\it natural paradigm}. We show
+that TeCoS-LVM models can produce more realistic spike activities and
+accurately fit spike statistics than powerful alternatives. Additionally,
+learned TeCoS-LVM models can generalize well to longer time scales. Overall,
+while remaining computationally tractable, our model effectively captures key
+features of neural coding systems. It thus provides a useful tool for building
+accurate predictive computational accounts for various sensory perception
+circuits.
+
+
+
+
+
+
+
+ ♻ ☆ The Power of Contrast for Feature Learning: A Theoretical Analysis
+
+
+
+
+
+
+
+
+ Wenlong Ji, Zhun Deng, Ryumei Nakada, James Zou, Linjun Zhang
+
+
+ Contrastive learning has achieved state-of-the-art performance in various
+self-supervised learning tasks and even outperforms its supervised counterpart.
+Despite its empirical success, theoretical understanding of the superiority of
+contrastive learning is still limited. In this paper, under linear
+representation settings, (i) we provably show that contrastive learning
+outperforms the standard autoencoders and generative adversarial networks, two
+classical generative unsupervised learning methods, for both feature recovery
+and in-domain downstream tasks; (ii) we also illustrate the impact of labeled
+data in supervised contrastive learning. This provides theoretical support for
+recent findings that contrastive learning with labels improves the performance
+of learned representations in the in-domain downstream task, but it can harm
+the performance in transfer learning. We verify our theory with numerical
+experiments.
+
+
+
+ comment: 78 pages, accepted by JMLR
+
+
+
+
+
+
+ ♻ ☆ Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
+
+
+ We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
+title reranking technique to achieve efficient title reranking 20x-40x faster
+than vanilla passage reranker. However, one of the challenges with the training
+of Efficient Title Reranker is the instability. Analyzing the issue, we found
+some very difficult ground truths might act as noisy labels causing accuracy to
+drop as well as some extreme values in model probability output causing nan. To
+address these issues, we introduce the Sigmoid Trick, a novel technique that
+reduces the gradient update of both cases resulting in better retrieval
+efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
+achieved four state-of-the-art positions on the kilt knowledge benchmark.
+
+
+
+
+
+
+
+ ♻ ☆ PMET: Precise Model Editing in a TransformerAAAI24
+
+
+
+
+
+
+
+
+ Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, Jie Yu
+
+
+ Model editing techniques modify a minor proportion of knowledge in Large
+Language Models (LLMs) at a relatively low cost, which have demonstrated
+notable success. Existing methods assume Transformer Layer (TL) hidden states
+are values of key-value memories of the Feed-Forward Network (FFN). They
+usually optimize the TL hidden states to memorize target knowledge and use it
+to update the weights of the FFN in LLMs. However, the information flow of TL
+hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN,
+and residual connections. Existing methods neglect the fact that the TL hidden
+states contains information not specifically required for FFN. Consequently,
+the performance of model editing decreases. To achieve more precise model
+editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes
+certain general knowledge extraction patterns. This implies that MHSA weights
+do not require updating when new knowledge is introduced. Based on above
+findings, we introduce PMET, which simultaneously optimizes Transformer
+Component (TC, namely MHSA and FFN) hidden states, while only using the
+optimized TC hidden states of FFN to precisely update FFN weights. Our
+experiments demonstrate that PMET exhibits state-of-the-art performance on both
+the COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the
+effectiveness of our enhancements, further reinforcing the finding that the
+MHSA encodes certain general knowledge extraction patterns and indicating its
+storage of a small amount of factual knowledge. Our code is available at
+https://github.com/xpq-tech/PMET.
+
+
+
+ comment: Accepted in AAAI24
+
+
+
+
+
+
+ ♻ ☆ Differentially Private Over-the-Air Federated Learning Over MIMO Fading
+ Channels
+
+
+ Federated learning (FL) enables edge devices to collaboratively train machine
+learning models, with model communication replacing direct data uploading.
+While over-the-air model aggregation improves communication efficiency,
+uploading models to an edge server over wireless networks can pose privacy
+risks. Differential privacy (DP) is a widely used quantitative technique to
+measure statistical data privacy in FL. Previous research has focused on
+over-the-air FL with a single-antenna server, leveraging communication noise to
+enhance user-level DP. This approach achieves the so-called "free DP" by
+controlling transmit power rather than introducing additional DP-preserving
+mechanisms at devices, such as adding artificial noise. In this paper, we study
+differentially private over-the-air FL over a multiple-input multiple-output
+(MIMO) fading channel. We show that FL model communication with a
+multiple-antenna server amplifies privacy leakage as the multiple-antenna
+server employs separate receive combining for model aggregation and information
+inference. Consequently, relying solely on communication noise, as done in the
+multiple-input single-output system, cannot meet high privacy requirements, and
+a device-side privacy-preserving mechanism is necessary for optimal DP design.
+We analyze the learning convergence and privacy loss of the studied FL system
+and propose a transceiver design algorithm based on alternating optimization.
+Numerical results demonstrate that the proposed method achieves a better
+privacy-learning trade-off compared to prior work.
+
+
+
+ comment: This work has been accepted by the IEEE for possible publication.
+ Copyright may be transferred without notice, after which this version may no
+ longer be accessible
+
+
+
+
+
+
+ ♻ ☆ Use of Deep Neural Networks for Uncertain Stress Functions with
+ Extensions to Impact Mechanics
+
+
+
+
+
+
+
+
+ Garrett Blum, Ryan Doris, Diego Klabjan, Horacio Espinosa, Ron Szalkowski
+
+
+ Stress-strain curves, or more generally, stress functions, are an extremely
+important characterization of a material's mechanical properties. However,
+stress functions are often difficult to derive and are narrowly tailored to a
+specific material. Further, large deformations, high strain-rates, temperature
+sensitivity, and effect of material parameters compound modeling challenges. We
+propose a generalized deep neural network approach to model stress as a state
+function with quantile regression to capture uncertainty. We extend these
+models to uniaxial impact mechanics using stochastic differential equations to
+demonstrate a use case and provide a framework for implementing this
+uncertainty-aware stress function. We provide experiments benchmarking our
+approach against leading constitutive, machine learning, and transfer learning
+approaches to stress and impact mechanics modeling on publicly available and
+newly presented data sets. We also provide a framework to optimize material
+parameters given multiple competing impact scenarios.
+
+
+
+ comment: Index Terms: Stress, Uncertainty, Impact Mechanics, Deep Learning,
+ Neural Network. 10 pages, 9 figures, 6 tables
+
+
+
+
+
+
+ ♻ ☆ DeSCo: Towards Generalizable and Scalable Deep Subgraph Counting SC
+
+
+ We introduce DeSCo, a scalable neural deep subgraph counting pipeline,
+designed to accurately predict both the count and occurrence position of
+queries on target graphs post single training. Firstly, DeSCo uses a novel
+canonical partition and divides the large target graph into small neighborhood
+graphs, greatly reducing the count variation while guaranteeing no missing or
+double-counting. Secondly, neighborhood counting uses an expressive
+subgraph-based heterogeneous graph neural network to accurately count in each
+neighborhood. Finally, gossip propagation propagates neighborhood counts with
+learnable gates to harness the inductive biases of motif counts. DeSCo is
+evaluated on eight real-world datasets from various domains. It outperforms
+state-of-the-art neural methods with 137x improvement in the mean squared error
+of count prediction, while maintaining the polynomial runtime complexity. Our
+open source project is at https://github.com/fuvty/DeSCo.
+
+
+
+ comment: 8 pages main text, 2 pages references, 11 pages appendix; open source
+ at https://github.com/fuvty/DeSCo
+
+
+
+
+
+
+ ♻ ☆ Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse
+ Problems
+
+
+ Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial
+technologies in the field of medical imaging. Score-based models have proven to
+be effective in addressing different inverse problems encountered in CT and
+MRI, such as sparse-view CT and fast MRI reconstruction. However, these models
+face challenges in achieving accurate three dimensional (3D) volumetric
+reconstruction. The existing score-based models primarily focus on
+reconstructing two dimensional (2D) data distribution, leading to
+inconsistencies between adjacent slices in the reconstructed 3D volumetric
+images. To overcome this limitation, we propose a novel two-and-a-half order
+score-based model (TOSM). During the training phase, our TOSM learns data
+distributions in 2D space, which reduces the complexity of training compared to
+directly working on 3D volumes. However, in the reconstruction phase, the TOSM
+updates the data distribution in 3D space, utilizing complementary scores along
+three directions (sagittal, coronal, and transaxial) to achieve a more precise
+reconstruction. The development of TOSM is built on robust theoretical
+principles, ensuring its reliability and efficacy. Through extensive
+experimentation on large-scale sparse-view CT and fast MRI datasets, our method
+demonstrates remarkable advancements and attains state-of-the-art results in
+solving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively
+addresses the inter-slice inconsistency issue, resulting in high-quality 3D
+volumetric reconstruction.
+
+
+
+ comment: 10 pages, 13 figures
+
+
+
+
+
+
+ ♻ ☆ Agglomerative Federated Learning: Empowering Larger Model Training via
+ End-Edge-Cloud Collaboration
+
+
+ Federated Learning (FL) enables training Artificial Intelligence (AI) models
+over end devices without compromising their privacy. As computing tasks are
+increasingly performed by a combination of cloud, edge, and end devices, FL can
+benefit from this End-Edge-Cloud Collaboration (EECC) paradigm to achieve
+collaborative device-scale expansion with real-time access. Although
+Hierarchical Federated Learning (HFL) supports multi-tier model aggregation
+suitable for EECC, prior works assume the same model structure on all computing
+nodes, constraining the model scale by the weakest end devices. To address this
+issue, we propose Agglomerative Federated Learning (FedAgg), which is a novel
+EECC-empowered FL framework that allows the trained models from end, edge, to
+cloud to grow larger in size and stronger in generalization ability. FedAgg
+recursively organizes computing nodes among all tiers based on Bridge Sample
+Based Online Distillation Protocol (BSBODP), which enables every pair of
+parent-child computing nodes to mutually transfer and distill knowledge
+extracted from generated bridge samples. This design enhances the performance
+by exploiting the potential of larger models, with privacy constraints of FL
+and flexibility requirements of EECC both satisfied. Experiments under various
+settings demonstrate that FedAgg outperforms state-of-the-art methods by an
+average of 4.53\% accuracy gains and remarkable improvements in convergence
+rate.
+
+
+
+ comment: Accepted by IEEE International Conference on Computer Communications
+ (INFOCOM), 2024
+
+
+
+
+
+
+ ♻ ☆ Learning to Simulate Tree-Branch Dynamics for Manipulation
+
+
+
+
+
+
+
+
+ Jayadeep Jacob, Tirthankar Bandyopadhyay, Jason Williams, Paulo Borges, Fabio Ramos
+
+
+ We propose to use a simulation driven inverse inference approach to model the
+dynamics of tree branches under manipulation. Learning branch dynamics and
+gaining the ability to manipulate deformable vegetation can help with
+occlusion-prone tasks, such as fruit picking in dense foliage, as well as
+moving overhanging vines and branches for navigation in dense vegetation. The
+underlying deformable tree geometry is encapsulated as coarse spring
+abstractions executed on parallel, non-differentiable simulators. The implicit
+statistical model defined by the simulator, reference trajectories obtained by
+actively probing the ground truth, and the Bayesian formalism, together guide
+the spring parameter posterior density estimation. Our non-parametric inference
+algorithm, based on Stein Variational Gradient Descent, incorporates
+biologically motivated assumptions into the inference process as neural network
+driven learnt joint priors; moreover, it leverages the finite difference scheme
+for gradient approximations. Real and simulated experiments confirm that our
+model can predict deformation trajectories, quantify the estimation
+uncertainty, and it can perform better when base-lined against other inference
+algorithms, particularly from the Monte Carlo family. The model displays strong
+robustness properties in the presence of heteroscedastic sensor noise;
+furthermore, it can generalise to unseen grasp locations.
+
+
+ We present latent combinational game design -- an approach for generating
+playable games that blend a given set of games in a desired combination using
+deep generative latent variable models. We use Gaussian Mixture Variational
+Autoencoders (GMVAEs) which model the VAE latent space via a mixture of
+Gaussian components. Through supervised training, each component encodes levels
+from one game and lets us define blended games as linear combinations of these
+components. This enables generating new games that blend the input games as
+well as controlling the relative proportions of each game in the blend. We also
+extend prior blending work using conditional VAEs and compare against the GMVAE
+and additionally introduce a hybrid conditional GMVAE (CGMVAE) architecture
+which lets us generate whole blended levels and layouts. Results show that
+these approaches can generate playable games that blend the input games in
+specified combinations. We use both platformers and dungeon-based games to
+demonstrate our results.
+
+
+
+ comment: 10 pages, 9 figures, IEEE Transactions on Games
+
+
+
+
+
+
+
+ Antonin Schrab, Benjamin Guedj, Arthur Gretton
+
+
+ We investigate properties of goodness-of-fit tests based on the Kernel Stein
+Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg,
+which aggregates multiple tests with different kernels. KSDAgg avoids splitting
+the data to perform kernel selection (which leads to a loss in test power), and
+rather maximises the test power over a collection of kernels. We provide
+non-asymptotic guarantees on the power of KSDAgg: we show it achieves the
+smallest uniform separation rate of the collection, up to a logarithmic term.
+For compactly supported densities with bounded model score function, we derive
+the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the
+minimax optimal rate over unrestricted Sobolev balls, up to an iterated
+logarithmic term. KSDAgg can be computed exactly in practice as it relies
+either on a parametric bootstrap or on a wild bootstrap to estimate the
+quantiles and the level corrections. In particular, for the crucial choice of
+bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such
+as median or standard deviation) or to data splitting. We find on both
+synthetic and real-world data that KSDAgg outperforms other state-of-the-art
+quadratic-time adaptive KSD-based goodness-of-fit testing procedures.
+
+
+
+
+
+
+
+ ♻ ☆ Consensus, dissensus and synergy between clinicians and specialist
+ foundation models in radiology report generation
+
+
+
+
+
+
+
+
+ Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Yossi Matias, Joelle Barral, Ali Eslami, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam, Ira Ktena
+
+
+ Radiology reports are an instrumental part of modern medicine, informing key
+clinical decisions such as diagnosis and treatment. The worldwide shortage of
+radiologists, however, restricts access to expert care and imposes heavy
+workloads, contributing to avoidable errors and delays in report delivery.
+While recent progress in automated report generation with vision-language
+models offer clear potential in ameliorating the situation, the path to
+real-world adoption has been stymied by the challenge of evaluating the
+clinical quality of AI-generated reports. In this study, we build a
+state-of-the-art report generation system for chest radiographs,
+$\textit{Flamingo-CXR}$, by fine-tuning a well-known vision-language foundation
+model on radiology data. To evaluate the quality of the AI-generated reports, a
+group of 16 certified radiologists provide detailed evaluations of AI-generated
+and human written reports for chest X-rays from an intensive care setting in
+the United States and an inpatient setting in India. At least one radiologist
+(out of two per case) preferred the AI report to the ground truth report in
+over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated
+reports that contain errors, the most frequently cited reasons were related to
+the location and finding, whereas for human written reports, most mistakes were
+related to severity and finding. This disparity suggested potential
+complementarity between our AI system and human experts, prompting us to
+develop an assistive scenario in which Flamingo-CXR generates a first-draft
+report, which is subsequently revised by a clinician. This is the first
+demonstration of clinician-AI collaboration for report writing, and the
+resultant reports are assessed to be equivalent or preferred by at least one
+radiologist to reports written by experts alone in 80$\%$ of in-patient cases
+and 60$\%$ of intensive care cases.
+
+
+
+
+
+
+
+
+ Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
+
+
+ In recent years, the deployment of large-scale pre-trained models in
+audio-visual downstream tasks has yielded remarkable outcomes. However, these
+models, primarily trained on single-modality unconstrained datasets, still
+encounter challenges in feature extraction for multi-modal tasks, leading to
+suboptimal performance. This limitation arises due to the introduction of
+irrelevant modality-specific information during encoding, which adversely
+affects the performance of downstream tasks. To address this challenge, this
+paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention
+mechanism. This mechanism leverages audio and visual modalities as soft prompts
+to dynamically adjust the parameters of pre-trained models based on the current
+multi-modal input features. Specifically, the DG-SCT module incorporates
+trainable cross-modal interaction layers into pre-trained audio-visual
+encoders, allowing adaptive extraction of crucial information from the current
+modality across spatial, channel, and temporal dimensions, while preserving the
+frozen parameters of large-scale pre-trained models. Experimental evaluations
+demonstrate that our proposed model achieves state-of-the-art results across
+multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our
+model exhibits promising performance in challenging few-shot and zero-shot
+scenarios. The source code and pre-trained models are available at
+https://github.com/haoyi-duan/DG-SCT.
+
+
+
+ comment: Accepted to NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling
+
+
+
+
+
+
+
+
+ Wei-Yin Ko, Daniel D'souza, Karina Nguyen, Randall Balestriero, Sara Hooker
+
+
+ Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way
+to improve top-line metrics and to outperform a larger single model. In this
+work, we go beyond top-line metrics and instead explore the impact of
+ensembling on subgroup performances. Surprisingly, we observe that even with a
+simple homogeneous ensemble -- all the individual DNNs share the same training
+set, architecture, and design choices -- the minority group performance
+disproportionately improves with the number of models compared to the majority
+group, i.e. fairness naturally emerges from ensembling. Even more surprising,
+we find that this gain keeps occurring even when a large number of models is
+considered, e.g. $20$, despite the fact that the average performance of the
+ensemble plateaus with fewer models. Our work establishes that simple DNN
+ensembles can be a powerful tool for alleviating disparate impact from DNN
+classifiers, thus curbing algorithmic harm. We also explore why this is the
+case. We find that even in homogeneous ensembles, varying the sources of
+stochasticity through parameter initialization, mini-batch sampling, and
+data-augmentation realizations, results in different fairness outcomes.
+
+
+ As larger deep learning models are hard to interpret, there has been a recent
+focus on generating explanations of these black-box models. In contrast, we may
+have apriori explanations of how models should behave. In this paper, we
+formalize this notion as learning from explanation constraints and provide a
+learning theoretic framework to analyze how such explanations can improve the
+learning of our models. One may naturally ask, "When would these explanations
+be helpful?" Our first key contribution addresses this question via a class of
+models that satisfies these explanation constraints in expectation over new
+data. We provide a characterization of the benefits of these models (in terms
+of the reduction of their Rademacher complexities) for a canonical class of
+explanations given by gradient information in the settings of both linear
+models and two layer neural networks. In addition, we provide an algorithmic
+solution for our framework, via a variational approximation that achieves
+better performance and satisfies these constraints more frequently, when
+compared to simpler augmented Lagrangian methods to incorporate these
+explanations. We demonstrate the benefits of our approach over a large array of
+synthetic and real-world experiments.
+
+
+
+ comment: NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ TacoGFN: Target Conditioned GFlowNet for Structure-Based Drug Design NeurIPS 2023
+
+
+
+
+
+
+
+
+ Tony Shen, Mohit Pandey, Jason Smith, Artem Cherkasov, Martin Ester
+
+
+ We seek to automate the generation of drug-like compounds conditioned to
+specific protein pocket targets. Most current methods approximate the
+protein-molecule distribution of a finite dataset and, therefore struggle to
+generate molecules with significant binding improvement over the training
+dataset. We instead frame the pocket-conditioned molecular generation task as
+an RL problem and develop TacoGFN, a target conditional Generative Flow Network
+model. Our method is explicitly encouraged to generate molecules with desired
+properties as opposed to fitting on a pre-existing data distribution. To this
+end, we develop transformer-based docking score prediction to speed up docking
+score computation and propose TacoGFN to explore molecule space efficiently.
+Furthermore, we incorporate several rounds of active learning where generated
+samples are queried using a docking oracle to improve the docking score
+prediction. This approach allows us to accurately explore as much of the
+molecule landscape as we can afford computationally. Empirically, molecules
+generated using TacoGFN and its variants significantly outperform all baseline
+methods across every property (Docking score, QED, SA, Lipinski), while being
+orders of magnitude faster.
+
+
+
+ comment: Accepted at NeurIPS 2023 AID3 and at NeurIPS 2023 GenBio as Spotlight
+
+
+
+
+
+
+ ♻ ☆ ConSequence: Synthesizing Logically Constrained Sequences for Electronic
+ Health Record Generation
+
+
+
+
+
+
+
+
+ Brandon Theodorou, Shrusti Jain, Cao Xiao, Jimeng Sun
+
+
+ Generative models can produce synthetic patient records for analytical tasks
+when real data is unavailable or limited. However, current methods struggle
+with adhering to domain-specific knowledge and removing invalid data. We
+present ConSequence, an effective approach to integrating domain knowledge into
+sequential generative neural network outputs. Our rule-based formulation
+includes temporal aggregation and antecedent evaluation modules, ensured by an
+efficient matrix multiplication formulation, to satisfy hard and soft logical
+constraints across time steps. Existing constraint methods often fail to
+guarantee constraint satisfaction, lack the ability to handle temporal
+constraints, and hinder the learning and computational efficiency of the model.
+In contrast, our approach efficiently handles all types of constraints with
+guaranteed logical coherence. We demonstrate ConSequence's effectiveness in
+generating electronic health records, outperforming competitors in achieving
+complete temporal and spatial constraint satisfaction without compromising
+runtime performance or generative quality. Specifically, ConSequence
+successfully prevents all rule violations while improving the model quality in
+reducing its test perplexity by 5% and incurring less than a 13% slowdown in
+generation speed compared to an unconstrained model.
+
+
+ We tackle the problem of sampling from intractable high-dimensional density
+functions, a fundamental task that often appears in machine learning and
+statistics. We extend recent sampling-based approaches that leverage controlled
+stochastic processes to model approximate samples from these target densities.
+The main drawback of these approaches is that the training objective requires
+full trajectories to compute, resulting in sluggish credit assignment issues
+due to use of entire trajectories and a learning signal present only at the
+terminal time. In this work, we present Diffusion Generative Flow Samplers
+(DGFS), a sampling-based framework where the learning process can be tractably
+broken down into short partial trajectory segments, via parameterizing an
+additional "flow function". Our method takes inspiration from the theory
+developed for generative flow networks (GFlowNets), allowing us to make use of
+intermediate learning signals. Through various challenging experiments, we
+demonstrate that DGFS achieves more accurate estimates of the normalization
+constant than closely-related prior methods.
+
+
+
+
+
+
+
+ ♻ ☆ Adversarial Purification with the Manifold Hypothesis AAAI 2024
+
+
+
+
+
+
+
+
+ Zhaoyuan Yang, Zhiwei Xu, Jing Zhang, Richard Hartley, Peter Tu
+
+
+ In this work, we formulate a novel framework for adversarial robustness using
+the manifold hypothesis. This framework provides sufficient conditions for
+defending against adversarial examples. We develop an adversarial purification
+method with this framework. Our method combines manifold learning with
+variational inference to provide adversarial robustness without the need for
+expensive adversarial training. Experimentally, our approach can provide
+adversarial robustness even if attackers are aware of the existence of the
+defense. In addition, our method can also serve as a test-time defense
+mechanism for variational autoencoders.
+
+
+
+ comment: Extended version of paper accepted at AAAI 2024 with supplementary
+ materials
+
+ Fine-tuning large pre-trained language models on downstream tasks has become
+an important paradigm in NLP. However, common practice fine-tunes all of the
+parameters in a pre-trained model, which becomes prohibitive when a large
+number of downstream tasks are present. Therefore, many fine-tuning methods are
+proposed to learn incremental updates of pre-trained weights in a parameter
+efficient way, e.g., low-rank increments. These methods often evenly distribute
+the budget of incremental updates across all pre-trained weight matrices, and
+overlook the varying importance of different weight parameters. As a
+consequence, the fine-tuning performance is suboptimal. To bridge this gap, we
+propose AdaLoRA, which adaptively allocates the parameter budget among weight
+matrices according to their importance score. In particular, AdaLoRA
+parameterizes the incremental updates in the form of singular value
+decomposition. Such a novel approach allows us to effectively prune the
+singular values of unimportant updates, which is essentially to reduce their
+parameter budget but circumvent intensive exact SVD computations. We conduct
+extensive experiments with several pre-trained models on natural language
+processing, question answering, and natural language generation to validate the
+effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable
+improvement over baselines, especially in the low budget settings. Our code is
+publicly available at https://github.com/QingruZhang/AdaLoRA .
+
+
+
+ comment: The 11th International Conference on Learning Representations (ICLR
+ 2023)
+
+
+
+
+
+
+ ♻ ☆ Universal and Transferable Adversarial Attacks on Aligned Language
+ Models
+
+
+
+
+
+
+
+
+ Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
+
+
+ Because "out-of-the-box" large language models are capable of generating a
+great deal of objectionable content, recent work has focused on aligning these
+models in an attempt to prevent undesirable generation. While there has been
+some success at circumventing these measures -- so-called "jailbreaks" against
+LLMs -- these attacks have required significant human ingenuity and are brittle
+in practice. In this paper, we propose a simple and effective attack method
+that causes aligned language models to generate objectionable behaviors.
+Specifically, our approach finds a suffix that, when attached to a wide range
+of queries for an LLM to produce objectionable content, aims to maximize the
+probability that the model produces an affirmative response (rather than
+refusing to answer). However, instead of relying on manual engineering, our
+approach automatically produces these adversarial suffixes by a combination of
+greedy and gradient-based search techniques, and also improves over past
+automatic prompt generation methods.
+ Surprisingly, we find that the adversarial prompts generated by our approach
+are quite transferable, including to black-box, publicly released LLMs.
+Specifically, we train an adversarial attack suffix on multiple prompts (i.e.,
+queries asking for many different types of objectionable content), as well as
+multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting
+attack suffix is able to induce objectionable content in the public interfaces
+to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,
+Pythia, Falcon, and others. In total, this work significantly advances the
+state-of-the-art in adversarial attacks against aligned language models,
+raising important questions about how such systems can be prevented from
+producing objectionable information. Code is available at
+github.com/llm-attacks/llm-attacks.
+
+
+
+ comment: Website: http://llm-attacks.org/
+
+
+
+
+
+
+ ♻ ☆ An Introduction to Bi-level Optimization: Foundations and Applications
+ in Signal Processing and Machine Learning
+
+
+
+
+
+
+
+
+ Yihua Zhang, Prashant Khanduri, Ioannis Tsaknakis, Yuguang Yao, Mingyi Hong, Sijia Liu
+
+
+ Recently, bi-level optimization (BLO) has taken center stage in some very
+exciting developments in the area of signal processing (SP) and machine
+learning (ML). Roughly speaking, BLO is a classical optimization problem that
+involves two levels of hierarchy (i.e., upper and lower levels), wherein
+obtaining the solution to the upper-level problem requires solving the
+lower-level one. BLO has become popular largely because it is powerful in
+modeling problems in SP and ML, among others, that involve optimizing nested
+objective functions. Prominent applications of BLO range from resource
+allocation for wireless systems to adversarial machine learning. In this work,
+we focus on a class of tractable BLO problems that often appear in SP and ML
+applications. We provide an overview of some basic concepts of this class of
+BLO problems, such as their optimality conditions, standard algorithms
+(including their optimization principles and practical implementations), as
+well as how they can be leveraged to obtain state-of-the-art results for a
+number of key SP and ML applications. Further, we discuss some recent advances
+in BLO theory, its implications for applications, and point out some
+limitations of the state-of-the-art that require significant future research
+efforts. Overall, we hope that this article can serve to accelerate the
+adoption of BLO as a generic tool to model, analyze, and innovate on a wide
+array of emerging SP and ML applications.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 7
+
+
+
+
+
+ ☆ Trajectory Approximation of Video Based on Phase Correlation for Forward
+ Facing Camera
+
+
+ In this paper, we introduce an innovative approach for extracting
+trajectories from a camera sensor in GPS-denied environments, leveraging visual
+odometry. The system takes video footage captured by a forward-facing camera
+mounted on a vehicle as input, with the output being a chain code representing
+the camera's trajectory. The proposed methodology involves several key steps.
+Firstly, we employ phase correlation between consecutive frames of the video to
+extract essential information. Subsequently, we introduce a novel chain code
+method termed "dynamic chain code," which is based on the x-shift values
+derived from the phase correlation. The third step involves determining
+directional changes (forward, left, right) by establishing thresholds and
+extracting the corresponding chain code. This extracted code is then stored in
+a buffer for further processing. Notably, our system outperforms traditional
+methods reliant on spatial features, exhibiting greater speed and robustness in
+noisy environments. Importantly, our approach operates without external camera
+calibration information. Moreover, by incorporating visual odometry, our system
+enhances its accuracy in estimating camera motion, providing a more
+comprehensive understanding of trajectory dynamics. Finally, the system
+culminates in the visualization of the normalized camera motion trajectory.
+
+
+
+
+
+
+
+ ☆ Coffee: Cost-Effective Edge Caching for 360 Degree Live Video Streaming
+
+
+
+
+
+
+
+
+ Chen Li, Tingwei Ye, Tongyu Zong, Liyang Sun, Houwei Cao, Yong Liu
+
+
+ While live 360 degree video streaming delivers immersive viewing experience,
+it poses significant bandwidth and latency challenges for content delivery
+networks. Edge servers are expected to play an important role in facilitating
+live streaming of 360 degree videos. In this paper, we propose a novel
+predictive edge caching algorithm (Coffee) for live 360 degree video that
+employ collaborative FoV prediction and predictive tile prefetching to reduce
+bandwidth consumption, streaming cost and improve the streaming quality and
+robustness. Our light-weight caching algorithms exploit the unique tile
+consumption patterns of live 360 degree video streaming to achieve high tile
+caching gains. Through extensive experiments driven by real 360 degree video
+streaming traces, we demonstrate that edge caching algorithms specifically
+designed for live 360 degree video streaming can achieve high streaming cost
+savings with small edge cache space consumption. Coffee, guided by viewer FoV
+predictions, significantly reduces back-haul traffic up to 76% compared to
+state-of-the-art edge caching algorithms. Furthermore, we develop a
+transcoding-aware variant (TransCoffee) and evaluate it using comprehensive
+experiments, which demonstrate that TransCoffee can achieve 63\% lower cost
+compared to state-of-the-art transcoding-aware approaches.
+
+
+
+
+
+
+
+
+ Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov
+
+
+ Multimedia generation approaches occupy a prominent place in artificial
+intelligence research. Text-to-image models achieved high-quality results over
+the last few years. However, video synthesis methods recently started to
+develop. This paper presents a new two-stage latent diffusion text-to-video
+generation architecture based on the text-to-image diffusion model. The first
+stage concerns keyframes synthesis to figure the storyline of a video, while
+the second one is devoted to interpolation frames generation to make movements
+of the scene and objects smooth. We compare several temporal conditioning
+approaches for keyframes generation. The results show the advantage of using
+separate temporal blocks over temporal layers in terms of metrics reflecting
+video generation quality aspects and human preference. The design of our
+interpolation model significantly reduces computational costs compared to other
+masked frame interpolation approaches. Furthermore, we evaluate different
+configurations of MoVQ-based video decoding scheme to improve consistency and
+achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our
+pipeline with existing solutions and achieve top-2 scores overall and top-1
+among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:
+https://ai-forever.github.io/kandinsky-video/
+
+
+ The surge of interest towards Multi-modal Large Language Models (MLLMs),
+e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
+academia and industry. They endow Large Language Models (LLMs) with powerful
+capabilities in visual understanding, enabling them to tackle diverse
+multi-modal tasks. Very recently, Google released Gemini, its newest and most
+capable MLLM built from the ground up for multi-modality. In light of the
+superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
+in multi-modal learning? In this paper, we present a preliminary exploration of
+Gemini Pro's visual understanding proficiency, which comprehensively covers
+four domains: fundamental perception, advanced cognition, challenging vision
+tasks, and various expert capacities. We compare Gemini Pro with the
+state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
+open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
+black-box systems. The qualitative samples indicate that, while GPT-4V and
+Gemini showcase different answering styles and preferences, they can exhibit
+comparable visual reasoning capabilities, and Sphinx still trails behind them
+concerning domain generalizability. Specifically, GPT-4V tends to elaborate
+detailed explanations and intermediate steps, and Gemini prefers to output a
+direct and concise answer. The quantitative evaluation on the popular MME
+benchmark also demonstrates the potential of Gemini to be a strong challenger
+to GPT-4V. Our early investigation of Gemini also observes some common issues
+of MLLMs, indicating that there still remains a considerable distance towards
+artificial general intelligence. Our project for tracking the progress of MLLM
+is released at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Total 120 pages. See our project at
+ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+
+
+
+
+
+
+ Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
+
+
+ In recent years, the deployment of large-scale pre-trained models in
+audio-visual downstream tasks has yielded remarkable outcomes. However, these
+models, primarily trained on single-modality unconstrained datasets, still
+encounter challenges in feature extraction for multi-modal tasks, leading to
+suboptimal performance. This limitation arises due to the introduction of
+irrelevant modality-specific information during encoding, which adversely
+affects the performance of downstream tasks. To address this challenge, this
+paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention
+mechanism. This mechanism leverages audio and visual modalities as soft prompts
+to dynamically adjust the parameters of pre-trained models based on the current
+multi-modal input features. Specifically, the DG-SCT module incorporates
+trainable cross-modal interaction layers into pre-trained audio-visual
+encoders, allowing adaptive extraction of crucial information from the current
+modality across spatial, channel, and temporal dimensions, while preserving the
+frozen parameters of large-scale pre-trained models. Experimental evaluations
+demonstrate that our proposed model achieves state-of-the-art results across
+multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our
+model exhibits promising performance in challenging few-shot and zero-shot
+scenarios. The source code and pre-trained models are available at
+https://github.com/haoyi-duan/DG-SCT.
+
+
+
+ comment: Accepted to NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
+ Masked Autoencoder ICTAI
+
+
+ Learning high-quality video representation has shown significant applications
+in computer vision and remains challenging. Previous work based on mask
+autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of
+learning representations in images and videos through reconstruction strategy
+in the visual modality. However, these models exhibit inherent limitations,
+particularly in scenarios where extracting features solely from the visual
+modality proves challenging, such as when dealing with low-resolution and
+blurry original videos. Based on this, we propose AV-MaskEnhancer for learning
+high-quality video representation by combining visual and audio information.
+Our approach addresses the challenge by demonstrating the complementary nature
+of audio and video features in cross-modality content. Moreover, our result of
+the video classification task on the UCF101 dataset outperforms the existing
+work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a
+top-5 accuracy of 99.9%.
+
+
+
+ comment: 2023 IEEE 35th International Conference on Tools with Artificial
+ Intelligence (ICTAI)
+
+
+
+
+
+
+ ♻ ☆ HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment WACV 2024
+
+
+
+
+
+
+
+
+ Shreshth Saini, Avinab Saha, Alan C. Bovik
+
+
+ We introduce HIDRO-VQA, a no-reference (NR) video quality assessment model
+designed to provide precise quality evaluations of High Dynamic Range (HDR)
+videos. HDR videos exhibit a broader spectrum of luminance, detail, and color
+than Standard Dynamic Range (SDR) videos. As HDR content becomes increasingly
+popular, there is a growing demand for video quality assessment (VQA)
+algorithms that effectively address distortions unique to HDR content. To
+address this challenge, we propose a self-supervised contrastive fine-tuning
+approach to transfer quality-aware features from the SDR to the HDR domain,
+utilizing unlabeled HDR videos. Our findings demonstrate that self-supervised
+pre-trained neural networks on SDR content can be further fine-tuned in a
+self-supervised setting using limited unlabeled HDR videos to achieve
+state-of-the-art performance on the only publicly available VQA database for
+HDR content, the LIVE-HDR VQA database. Moreover, our algorithm can be extended
+to the Full Reference VQA setting, also achieving state-of-the-art performance.
+Our code is available publicly at https://github.com/avinabsaha/HIDRO-VQA.
+
+
+
+ comment: WACV 2024 Workshop Paper. Shreshth Saini, Avinab Saha contributed
+ equally to this work
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 85
+
+
+
+
+
+ ☆ A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
+
+
+ The surge of interest towards Multi-modal Large Language Models (MLLMs),
+e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
+academia and industry. They endow Large Language Models (LLMs) with powerful
+capabilities in visual understanding, enabling them to tackle diverse
+multi-modal tasks. Very recently, Google released Gemini, its newest and most
+capable MLLM built from the ground up for multi-modality. In light of the
+superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
+in multi-modal learning? In this paper, we present a preliminary exploration of
+Gemini Pro's visual understanding proficiency, which comprehensively covers
+four domains: fundamental perception, advanced cognition, challenging vision
+tasks, and various expert capacities. We compare Gemini Pro with the
+state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
+open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
+black-box systems. The qualitative samples indicate that, while GPT-4V and
+Gemini showcase different answering styles and preferences, they can exhibit
+comparable visual reasoning capabilities, and Sphinx still trails behind them
+concerning domain generalizability. Specifically, GPT-4V tends to elaborate
+detailed explanations and intermediate steps, and Gemini prefers to output a
+direct and concise answer. The quantitative evaluation on the popular MME
+benchmark also demonstrates the potential of Gemini to be a strong challenger
+to GPT-4V. Our early investigation of Gemini also observes some common issues
+of MLLMs, indicating that there still remains a considerable distance towards
+artificial general intelligence. Our project for tracking the progress of MLLM
+is released at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Total 120 pages. See our project at
+ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+
+
+
+
+
+ ☆ Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
+
+
+ We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
+title reranking technique to achieve efficient title reranking 20x-40x faster
+than vanilla passage reranker. However, one of the challenges with the training
+of Efficient Title Reranker is the instability. Analyzing the issue, we found
+some very difficult ground truths might act as noisy labels causing accuracy to
+drop as well as some extreme values in model probability output causing nan. To
+address these issues, we introduce the Sigmoid Trick, a novel technique that
+reduces the gradient update of both cases resulting in better retrieval
+efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
+achieved four state-of-the-art positions on the kilt knowledge benchmark.
+
+
+
+
+
+
+
+ ☆ SpokesBiz -- an Open Corpus of Conversational Polish
+
+
+
+
+
+
+
+
+ Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski
+
+
+ This paper announces the early release of SpokesBiz, a freely available
+corpus of conversational Polish developed within the CLARIN-BIZ project and
+comprising over 650 hours of recordings. The transcribed recordings have been
+diarized and manually annotated for punctuation and casing. We outline the
+general structure and content of the corpus, showcasing selected applications
+in linguistic research, evaluation and improvement of automatic speech
+recognition (ASR) systems
+
+
+
+
+
+
+
+ ☆ Avoiding Data Contamination in Language Model Evaluation: Dynamic Test
+ Construction with Latest Materials AAAI 2024
+
+
+
+
+
+
+
+
+ Yucheng Li, Frank Geurin, Chenghua Lin
+
+
+ Data contamination in evaluation is getting increasingly prevalent with the
+emerge of language models pre-trained on super large, automatically-crawled
+corpora. This problem leads to significant challenges in accurate assessment of
+model capabilities and generalisations. In this paper, we propose LatestEval,
+an automatic method leverages the most recent texts to create uncontaminated
+reading comprehension evaluations. LatestEval avoids data contamination by only
+using texts published within a recent time window, ensuring no overlap with the
+training corpora of pre-trained language models. We develop LatestEval
+automated pipeline to 1) gather latest texts; 2) identify key information, and
+3) construct questions targeting the information while removing the existing
+answers from the context. This encourages models to infer the answers
+themselves based on the remaining context, rather than just copy-paste. Our
+experiments demonstrate that language models exhibit negligible memorisation
+behaviours on LatestEval as opposed to previous benchmarks, suggesting a
+significantly reduced risk of data contamination and leading to a more robust
+evaluation. Data and code are publicly available at:
+https://github.com/liyucheng09/LatestEval.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis
+
+
+ Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to
+interpret the complex nature of human sentiments. Despite significant progress
+in multimodal architecture design, the field lacks comprehensive regularization
+methods. This paper introduces PowMix, a versatile embedding space regularizer
+that builds upon the strengths of unimodal mixing-based regularization
+approaches and introduces novel algorithmic components that are specifically
+tailored to multimodal tasks. PowMix is integrated before the fusion stage of
+multimodal architectures and facilitates intra-modal mixing, such as mixing
+text with text, to act as a regularizer. PowMix consists of five components: 1)
+a varying number of generated mixed examples, 2) mixing factor reweighting, 3)
+anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing.
+Extensive experimentation across benchmark MSA datasets and a broad spectrum of
+diverse architectural designs demonstrate the efficacy of PowMix, as evidenced
+by consistent performance improvements over baselines and existing mixing
+methods. An in-depth ablation study highlights the critical contribution of
+each PowMix component and how they synergistically enhance performance.
+Furthermore, algorithmic analysis demonstrates how PowMix behaves in different
+scenarios, particularly comparing early versus late fusion architectures.
+Notably, PowMix enhances overall performance without sacrificing model
+robustness or magnifying text dominance. It also retains its strong performance
+in situations of limited data. Our findings position PowMix as a promising
+versatile regularization strategy for MSA. Code will be made available.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ☆ Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
+
+
+ With the recent surge in popularity of LLMs has come an ever-increasing need
+for LLM safety training. In this paper, we show that SOTA open-source LLMs are
+vulnerable to simple, optimization-free attacks we refer to as $\textit{priming
+attacks}$, which are easy to execute and effectively bypass alignment from
+safety training. Our proposed attack improves the Attack Success Rate on
+Harmful Behaviors, as measured by Llama Guard, by up to $3.3\times$ compared to
+baselines. Source code and data are available at
+https://github.com/uiuc-focal-lab/llm-priming-attacks .
+
+
+
+
+
+
+
+ ☆ Instruct-SCTG: Guiding Sequential Controlled Text Generation through
+ Instructions
+
+
+ Instruction-tuned large language models have shown remarkable performance in
+aligning generated text with user intentions across various tasks. However,
+maintaining human-like discourse structure in the generated text remains a
+challenging research question. In this paper, we propose Instruct-SCTG, a
+flexible and effective sequential framework that harnesses instruction-tuned
+language models to generate structurally coherent text in both fine-tuned and
+zero-shot setups. Our framework generates articles in a section-by-section
+manner, aligned with the desired human structure using natural language
+instructions. Furthermore, we introduce a new automatic metric that measures
+discourse divergence in a fuzzy manner. Extensive experiments on three datasets
+from representative domains of news and recipes demonstrate the
+state-of-the-art performance of our framework in imposing discourse structure
+during text generation, as verified by both automatic and human evaluation. Our
+code will be available on Github.
+
+
+
+
+
+
+
+ ☆ Automated speech audiometry: Can it work using open-source pre-trained
+ Kaldi-NL automatic speech recognition?
+
+
+
+
+
+
+
+
+ Gloria Araiza-Illan, Luke Meyer, Khiet P. Truong, Deniz Baskent
+
+
+ A practical speech audiometry tool is the digits-in-noise (DIN) test for
+hearing screening of populations of varying ages and hearing status. The test
+is usually conducted by a human supervisor (e.g., clinician), who scores the
+responses spoken by the listener, or online, where a software scores the
+responses entered by the listener. The test has 24 digit-triplets presented in
+an adaptive staircase procedure, resulting in a speech reception threshold
+(SRT). We propose an alternative automated DIN test setup that can evaluate
+spoken responses whilst conducted without a human supervisor, using the
+open-source automatic speech recognition toolkit, Kaldi-NL. Thirty
+self-reported normal-hearing Dutch adults (19-64 years) completed one
+DIN+Kaldi-NL test. Their spoken responses were recorded, and used for
+evaluating the transcript of decoded responses by Kaldi-NL. Study 1 evaluated
+the Kaldi-NL performance through its word error rate (WER), percentage of
+summed decoding errors regarding only digits found in the transcript compared
+to the total number of digits present in the spoken responses. Average WER
+across participants was 5.0% (range 0 - 48%, SD = 8.8%), with average decoding
+errors in three triplets per participant. Study 2 analysed the effect that
+triplets with decoding errors from Kaldi-NL had on the DIN test output (SRT),
+using bootstrapping simulations. Previous research indicated 0.70 dB as the
+typical within-subject SRT variability for normal-hearing adults. Study 2
+showed that up to four triplets with decoding errors produce SRT variations
+within this range, suggesting that our proposed setup could be feasible for
+clinical applications.
+
+
+ Sentiment analysis methods are rapidly being adopted by the field of Urban
+Design and Planning, for the crowdsourced evaluation of urban environments.
+However, most models used within this domain are able to identify positive or
+negative sentiment associated with a textual appraisal as a whole, without
+inferring information about specific urban aspects contained within it, or the
+sentiment associated with them. While Aspect Based Sentiment Analysis (ABSA) is
+becoming increasingly popular, most existing ABSA models are trained on
+non-urban themes such as restaurants, electronics, consumer goods and the like.
+This body of research develops an ABSA model capable of extracting urban
+aspects contained within geo-located textual urban appraisals, along with
+corresponding aspect sentiment classification. We annotate a dataset of 2500
+crowdsourced reviews of public parks, and train a Bidirectional Encoder
+Representations from Transformers (BERT) model with Local Context Focus (LCF)
+on this data. Our model achieves significant improvement in prediction accuracy
+on urban reviews, for both Aspect Term Extraction (ATE) and Aspect Sentiment
+Classification (ASC) tasks. For demonstrative analysis, positive and negative
+urban aspects across Boston are spatially visualized. We hope that this model
+is useful for designers and planners for fine-grained urban sentiment
+evaluation.
+
+
+
+ comment: Created for 6.8610, Quantitative Methods for Natural Language
+ Processing at MIT Fall 2022. 5 pages, 4 figures
+
+
+
+
+
+
+ ☆ GeomVerse: A Systematic Evaluation of Large Models for Geometric
+ Reasoning
+
+
+
+
+
+
+
+
+ Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, Radu Soricut
+
+
+ Large language models have shown impressive results for multi-hop
+mathematical reasoning when the input question is only textual. Many
+mathematical reasoning problems, however, contain both text and image. With the
+ever-increasing adoption of vision language models (VLMs), understanding their
+reasoning abilities for such problems is crucial. In this paper, we evaluate
+the reasoning capabilities of VLMs along various axes through the lens of
+geometry problems. We procedurally create a synthetic dataset of geometry
+questions with controllable difficulty levels along multiple axes, thus
+enabling a systematic evaluation. The empirical results obtained using our
+benchmark for state-of-the-art VLMs indicate that these models are not as
+capable in subjects like geometry (and, by generalization, other topics
+requiring similar reasoning) as suggested by previous benchmarks. This is made
+especially clear by the construction of our benchmark at various depth levels,
+since solving higher-depth problems requires long chains of reasoning rather
+than additional memorized knowledge. We release the dataset for further
+research in this area.
+
+
+
+
+
+
+
+ ☆ Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models:
+ A Critical Review and Assessment
+
+
+
+
+
+
+
+
+ Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang
+
+
+ With the continuous growth in the number of parameters of transformer-based
+pretrained language models (PLMs), particularly the emergence of large language
+models (LLMs) with billions of parameters, many natural language processing
+(NLP) tasks have demonstrated remarkable success. However, the enormous size
+and computational demands of these models pose significant challenges for
+adapting them to specific downstream tasks, especially in environments with
+limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers
+an effective solution by reducing the number of fine-tuning parameters and
+memory usage while achieving comparable performance to full fine-tuning. The
+demands for fine-tuning PLMs, especially LLMs, have led to a surge in the
+development of PEFT methods, as depicted in Fig. 1. In this paper, we present a
+comprehensive and systematic review of PEFT methods for PLMs. We summarize
+these PEFT methods, discuss their applications, and outline future directions.
+Furthermore, we conduct experiments using several representative PEFT methods
+to better understand their effectiveness in parameter efficiency and memory
+efficiency. By offering insights into the latest advancements and practical
+applications, this survey serves as an invaluable resource for researchers and
+practitioners seeking to navigate the challenges and opportunities presented by
+PEFT in the context of PLMs.
+
+
+
+ comment: 20 pages, 4 figures
+
+
+
+
+
+
+ ☆ Exploring the Residual Stream of Transformers
+
+
+ Transformer-based models have achieved great breakthroughs in recent years.
+However, there are many significant questions that have not been answered in
+the field of explaining the reason why the models have powerful outputs. We do
+not know how to locate the models' important parameters storing the knowledge
+for predicting the next word, and whether these parameters are stored on the
+same layer/module or different ones. Moreover, we do not understand the
+mechanism to merge the knowledge into the final embedding for next word
+prediction. In this paper, we explore the residual stream of transformers to
+increase the interpretability. We find the mechanism behind residual connection
+is a direct addition function on before-softmax values, so the probabilities of
+tokens with larger before-softmax values will increase. Moreover, we prove that
+using log probability increase as contribution scores is reasonable, and based
+on this we can locate important parameters. Besides, we propose a method to
+analyze how previous layers affect upper layers by comparing the inner
+products. The experimental results and case study show that our research can
+increase the interpretability of transformer-based models. We will release our
+code on https://github.com/zepingyu0512/residualstream.
+
+
+ Knowledge graphs (KGs) often contain various errors. Previous works on
+detecting errors in KGs mainly rely on triplet embedding from graph structure.
+We conduct an empirical study and find that these works struggle to
+discriminate noise from semantically-similar correct triplets. In this paper,
+we propose a KG error detection model CCA to integrate both textual and graph
+structural information from triplet reconstruction for better distinguishing
+semantics. We design interactive contrastive learning to capture the
+differences between textual and structural patterns. Furthermore, we construct
+realistic datasets with semantically-similar noise and adversarial noise.
+Experimental results demonstrate that CCA outperforms state-of-the-art
+baselines, especially in detecting semantically-similar noise and adversarial
+noise.
+
+
+
+ comment: Accepted in the 38th AAAI Conference on Artificial Intelligence (AAAI
+ 2024)
+
+
+
+
+
+
+ ☆ Founder-GPT: Self-play to evaluate the Founder-Idea fit
+
+
+ This research introduces an innovative evaluation method for the
+"founder-idea" fit in early-stage startups, utilizing advanced large language
+model techniques to assess founders' profiles against their startup ideas to
+enhance decision-making. Embeddings, self-play, tree-of-thought, and
+critique-based refinement techniques show early promising results that each
+idea's success patterns are unique and they should be evaluated based on the
+context of the founder's background.
+
+
+ Few-shot Relation Extraction (FSRE) aims to extract relational facts from a
+sparse set of labeled corpora. Recent studies have shown promising results in
+FSRE by employing Pre-trained Language Models (PLMs) within the framework of
+supervised contrastive learning, which considers both instances and label
+facts. However, how to effectively harness massive instance-label pairs to
+encompass the learned representation with semantic richness in this learning
+paradigm is not fully explored. To address this gap, we introduce a novel
+synergistic anchored contrastive pre-training framework. This framework is
+motivated by the insight that the diverse viewpoints conveyed through
+instance-label pairs capture incomplete yet complementary intrinsic textual
+semantics. Specifically, our framework involves a symmetrical contrastive
+objective that encompasses both sentence-anchored and label-anchored
+contrastive losses. By combining these two losses, the model establishes a
+robust and uniform representation space. This space effectively captures the
+reciprocal alignment of feature distributions among instances and relational
+facts, simultaneously enhancing the maximization of mutual information across
+diverse perspectives within the same relation. Experimental results demonstrate
+that our framework achieves significant performance enhancements compared to
+baseline models in downstream FSRE tasks. Furthermore, our approach exhibits
+superior adaptability to handle the challenges of domain shift and zero-shot
+relation extraction. Our code is available online at
+https://github.com/AONE-NLP/FSRE-SaCon.
+
+
+
+
+
+
+
+ ☆ Active Preference Inference using Language Models and Probabilistic
+ Reasoning
+
+
+
+
+
+
+
+
+ Top Piriyakulkij, Volodymyr Kuleshov, Kevin Ellis
+
+
+ Actively inferring user preferences, for example by asking good questions, is
+important for any human-facing decision-making system. Active inference allows
+such systems to adapt and personalize themselves to nuanced individual
+preferences. To enable this ability for instruction-tuned large language models
+(LLMs), one may prompt them to ask users questions to infer their preferences,
+transforming the language models into more robust, interactive systems.
+However, out of the box, these models are not efficient at extracting
+preferences: the questions they generate are not informative, requiring a high
+number of user interactions and impeding the usability of the downstream
+system. In this work, we introduce an inference-time algorithm that helps LLMs
+quickly infer preferences by using more informative questions. Our algorithm
+uses a probabilistic model whose conditional distributions are defined by
+prompting an LLM, and returns questions that optimize expected entropy and
+expected model change. Results in a simplified interactive web shopping setting
+with real product items show that an LLM equipped with our entropy reduction
+algorithm outperforms baselines with the same underlying LLM on task
+performance while using fewer user interactions.
+
+
+
+
+
+
+
+ ☆ Can ChatGPT be Your Personal Medical Assistant?
+
+
+ The advanced large language model (LLM) ChatGPT has shown its potential in
+different domains and remains unbeaten due to its characteristics compared to
+other LLMs. This study aims to evaluate the potential of using a fine-tuned
+ChatGPT model as a personal medical assistant in the Arabic language. To do so,
+this study uses publicly available online questions and answering datasets in
+Arabic language. There are almost 430K questions and answers for 20
+disease-specific categories. GPT-3.5-turbo model was fine-tuned with a portion
+of this dataset. The performance of this fine-tuned model was evaluated through
+automated and human evaluation. The automated evaluations include perplexity,
+coherence, similarity, and token count. Native Arabic speakers with medical
+knowledge evaluated the generated text by calculating relevance, accuracy,
+precision, logic, and originality. The overall result shows that ChatGPT has a
+bright future in medical assistance.
+
+
+
+ comment: 5 pages, 7 figures, two tables, Accepted on The International
+ Symposium on Foundation and Large Language Models (FLLM2023)
+
+ Mind-map generation aims to process a document into a hierarchical structure
+to show its central idea and branches. Such a manner is more conducive to
+understanding the logic and semantics of the document than plain text.
+Recently, a state-of-the-art method encodes the sentences of a document
+sequentially and converts them to a relation graph via sequence-to-graph.
+Though this method is efficient to generate mind-maps in parallel, its
+mechanism focuses more on sequential features while hardly capturing structural
+information. Moreover, it's difficult to model long-range semantic relations.
+In this work, we propose a coreference-guided mind-map generation network
+(CMGN) to incorporate external structure knowledge. Specifically, we construct
+a coreference graph based on the coreference semantic relationship to introduce
+the graph structure information. Then we employ a coreference graph encoder to
+mine the potential governing relations between sentences. In order to exclude
+noise and better utilize the information of the coreference graph, we adopt a
+graph enhancement module in a contrastive learning manner. Experimental results
+demonstrate that our model outperforms all the existing methods. The case study
+further proves that our model can more accurately and concisely reveal the
+structure and semantics of a document. Code and data are available at
+https://github.com/Cyno2232/CMGN.
+
+
+ Climate change presents significant challenges to the global community, and
+it is imperative to raise widespread awareness of the climate crisis and
+educate users about low-carbon living. Artificial intelligence, particularly
+large language models (LLMs), have emerged as powerful tools in mitigating the
+climate crisis, leveraging their extensive knowledge, broad user base, and
+natural language interaction capabilities. However, despite the growing body of
+research on climate change, there is a lack of comprehensive assessments of
+climate crisis knowledge within LLMs. This paper aims to resolve this gap by
+proposing an automatic evaluation framework. We employ a hybrid approach to
+data acquisition that combines data synthesis and manual collection to compile
+a diverse set of questions related to the climate crisis. These questions cover
+various aspects of climate change, including its causes, impacts, mitigation
+strategies, and adaptation measures. We then evaluate the model knowledge
+through prompt engineering based on the collected questions and generated
+answers. We propose a set of comprehensive metrics to evaluate the climate
+crisis knowledge, incorporating indicators from 10 different perspectives.
+Experimental results show that our method is effective in evaluating the
+knowledge of LLMs regarding the climate crisis. We evaluate several
+state-of-the-art LLMs and find that their knowledge falls short in terms of
+timeliness.
+
+
+
+
+
+
+
+ ☆ Fluctuation-based Adaptive Structured Pruning for Large Language Models AAAI 2024
+
+
+
+
+
+
+
+
+ Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
+
+
+ Network Pruning is a promising way to address the huge computing resource
+demands of the deployment and inference of Large Language Models (LLMs).
+Retraining-free is important for LLMs' pruning methods. However, almost all of
+the existing retraining-free pruning approaches for LLMs focus on unstructured
+pruning, which requires specific hardware support for acceleration. In this
+paper, we propose a novel retraining-free structured pruning framework for
+LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is
+hardware-friendly by effectively reducing storage and enhancing inference
+speed. For effective structured pruning of LLMs, we highlight three critical
+elements that demand the utmost attention: formulating structured importance
+metrics, adaptively searching the global compressed model, and implementing
+compensation mechanisms to mitigate performance loss. First, FLAP determines
+whether the output feature map is easily recoverable when a column of weight is
+removed, based on the fluctuation pruning metric. Then it standardizes the
+importance scores to adaptively determine the global compressed model
+structure. At last, FLAP adds additional bias terms to recover the output
+feature maps using the baseline values. We thoroughly evaluate our approach on
+a variety of language benchmarks. Without any retraining, our method
+significantly outperforms the state-of-the-art methods, including LLM-Pruner
+and the extension of Wanda in structured pruning. The code is released at
+https://github.com/CASIA-IVA-Lab/FLAP.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ Large Language Models Empowered Agent-based Modeling and Simulation: A
+ Survey and Perspectives
+
+
+ Agent-based modeling and simulation has evolved as a powerful tool for
+modeling complex systems, offering insights into emergent behaviors and
+interactions among diverse agents. Integrating large language models into
+agent-based modeling and simulation presents a promising avenue for enhancing
+simulation capabilities. This paper surveys the landscape of utilizing large
+language models in agent-based modeling and simulation, examining their
+challenges and promising future directions. In this survey, since this is an
+interdisciplinary field, we first introduce the background of agent-based
+modeling and simulation and large language model-empowered agents. We then
+discuss the motivation for applying large language models to agent-based
+simulation and systematically analyze the challenges in environment perception,
+human alignment, action generation, and evaluation. Most importantly, we
+provide a comprehensive overview of the recent works of large language
+model-empowered agent-based modeling and simulation in multiple scenarios,
+which can be divided into four domains: cyber, physical, social, and hybrid,
+covering simulation of both real-world and virtual environments. Finally, since
+this area is new and quickly evolving, we discuss the open problems and
+promising future directions.
+
+
+
+
+
+
+
+
+ Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
+
+
+ Conversational Speech Synthesis (CSS) aims to accurately express an utterance
+with the appropriate prosody and emotional inflection within a conversational
+setting. While recognising the significance of CSS task, the prior studies have
+not thoroughly investigated the emotional expressiveness problems due to the
+scarcity of emotional conversational datasets and the difficulty of stateful
+emotion modeling. In this paper, we propose a novel emotional CSS model, termed
+ECSS, that includes two main components: 1) to enhance emotion understanding,
+we introduce a heterogeneous graph-based emotional context modeling mechanism,
+which takes the multi-source dialogue history as input to model the dialogue
+context and learn the emotion cues from the context; 2) to achieve emotion
+rendering, we employ a contrastive learning-based emotion renderer module to
+infer the accurate emotion style for the target utterance. To address the issue
+of data scarcity, we meticulously create emotional labels in terms of category
+and intensity, and annotate additional emotional information on the existing
+conversational dataset (DailyTalk). Both objective and subjective evaluations
+suggest that our model outperforms the baseline models in understanding and
+rendering emotions. These evaluations also underscore the importance of
+comprehensive emotional annotations. Code and audio samples can be found at:
+https://github.com/walker-hyf/ECSS.
+
+
+
+ comment: 9 pages, 4 figures, Accepted by AAAI'2024, Code and audio samples:
+ https://github.com/walker-hyf/ECSS
+
+
+
+
+
+
+ ☆ Multi-Granularity Information Interaction Framework for Incomplete
+ Utterance Rewriting EMNLP2023
+
+
+
+
+
+
+
+
+ Haowei Du, Dingyu Zhang, Chen Li, Yang Li, Dongyan Zhao
+
+
+ Recent approaches in Incomplete Utterance Rewriting (IUR) fail to capture the
+source of important words, which is crucial to edit the incomplete utterance,
+and introduce words from irrelevant utterances. We propose a novel and
+effective multi-task information interaction framework including context
+selection, edit matrix construction, and relevance merging to capture the
+multi-granularity of semantic information. Benefiting from fetching the
+relevant utterance and figuring out the important words, our approach
+outperforms existing state-of-the-art models on two benchmark datasets
+Restoration-200K and CANAND in this field. Code will be provided on
+\url{https://github.com/yanmenxue/QR}.
+
+
+
+
+
+
+
+
+ Haowei Du, Quzhe Huang, Chen Li, Chen Zhang, Yang Li, Dongyan Zhao
+
+
+ Multi-hop Knowledge Base Question Answering(KBQA) aims to find the answer
+entity in a knowledge graph (KG), which requires multiple steps of reasoning.
+Existing retrieval-based approaches solve this task by concentrating on the
+specific relation at different hops and predicting the intermediate entity
+within the reasoning path. During the reasoning process of these methods, the
+representation of relations are fixed but the initial relation representation
+may not be optimal. We claim they fail to utilize information from head-tail
+entities and the semantic connection between relations to enhance the current
+relation representation, which undermines the ability to capture information of
+relations in KGs. To address this issue, we construct a \textbf{dual relation
+graph} where each node denotes a relation in the original KG (\textbf{primal
+entity graph}) and edges are constructed between relations sharing same head or
+tail entities. Then we iteratively do primal entity graph reasoning, dual
+relation graph information propagation, and interaction between these two
+graphs. In this way, the interaction between entity and relation is enhanced,
+and we derive better entity and relation representations. Experiments on two
+public datasets, WebQSP and CWQ, show that our approach achieves a significant
+performance gain over the prior state-of-the-art. Our code is available on
+\url{https://github.com/yanmenxue/RAH-KBQA}.
+
+
+
+ comment: Findings of EMNLP2023 (Long)
+
+
+
+
+
+
+ ☆ External Knowledge Augmented Polyphone Disambiguation Using Large
+ Language Model
+
+
+ One of the key issues in Mandarin Chinese text-to-speech (TTS) systems is
+polyphone disambiguation when doing grapheme-to-phoneme (G2P) conversion. In
+this paper, we introduce a novel method to solve the problem as a generation
+task. Following the trending research of large language models (LLM) and prompt
+learning, the proposed method consists of three modules. Retrieval module
+incorporates external knowledge which is a multi-level semantic dictionary of
+Chinese polyphonic characters to format the sentence into a prompt. Generation
+module adopts the decoder-only Transformer architecture to induce the target
+text. Postprocess module corrects the generated text into a valid result if
+needed. Experimental results show that our method outperforms the existing
+methods on a public dataset called CPP. We also empirically study the impacts
+of different templates of the prompt, different sizes of training data, and
+whether to incorporate external knowledge.
+
+
+
+
+
+
+
+ ☆ Analyzing Public Reactions, Perceptions, and Attitudes during the MPox
+ Outbreak: Findings from Topic Modeling of Tweets
+
+
+ The recent outbreak of the MPox virus has resulted in a tremendous increase
+in the usage of Twitter. Prior works in this area of research have primarily
+focused on the sentiment analysis and content analysis of these Tweets, and the
+few works that have focused on topic modeling have multiple limitations. This
+paper aims to address this research gap and makes two scientific contributions
+to this field. First, it presents the results of performing Topic Modeling on
+601,432 Tweets about the 2022 Mpox outbreak that were posted on Twitter between
+7 May 2022 and 3 March 2023. The results indicate that the conversations on
+Twitter related to Mpox during this time range may be broadly categorized into
+four distinct themes - Views and Perspectives about Mpox, Updates on Cases and
+Investigations about Mpox, Mpox and the LGBTQIA+ Community, and Mpox and
+COVID-19. Second, the paper presents the findings from the analysis of these
+Tweets. The results show that the theme that was most popular on Twitter (in
+terms of the number of Tweets posted) during this time range was Views and
+Perspectives about Mpox. This was followed by the theme of Mpox and the
+LGBTQIA+ Community, which was followed by the themes of Mpox and COVID-19 and
+Updates on Cases and Investigations about Mpox, respectively. Finally, a
+comparison with related studies in this area of research is also presented to
+highlight the novelty and significance of this research work.
+
+
+
+
+
+
+
+ ☆ Difficulty-Focused Contrastive Learning for Knowledge Tracing with a
+ Large Language Model-Based Difficulty Prediction
+
+
+
+
+
+
+
+
+ Unggi Lee, Sungjun Yoon, Joon Seo Yun, Kyoungsoo Park, YoungHoon Jung, Damji Stratton, Hyeoncheol Kim
+
+
+ This paper presents novel techniques for enhancing the performance of
+knowledge tracing (KT) models by focusing on the crucial factor of question and
+concept difficulty level. Despite the acknowledged significance of difficulty,
+previous KT research has yet to exploit its potential for model optimization
+and has struggled to predict difficulty from unseen data. To address these
+problems, we propose a difficulty-centered contrastive learning method for KT
+models and a Large Language Model (LLM)-based framework for difficulty
+prediction. These innovative methods seek to improve the performance of KT
+models and provide accurate difficulty estimates for unseen data. Our ablation
+study demonstrates the efficacy of these techniques by demonstrating enhanced
+KT model performance. Nonetheless, the complex relationship between language
+and difficulty merits further investigation.
+
+
+
+ comment: 10 pages, 4 figures, 2 tables
+
+
+
+
+
+
+ ☆ ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for
+ Accelerating Language Models Inference AAAI24
+
+
+ Early Exiting is one of the most popular methods to achieve efficient
+inference. Current early exiting methods adopt the (weighted) sum of the cross
+entropy loss of all internal classifiers during training, imposing all these
+classifiers to predict all instances correctly. However, during inference, as
+long as one internal classifier predicts an instance correctly, it can
+accelerate without losing accuracy. Thus, there is a notable gap between
+training and inference. We propose ConsistentEE, an early exiting method that
+is consistent in training and inference. ConsistentEE formulates the early
+exiting process as a reinforcement learning problem. A policy network is added
+to decide whether an instance should exit or continue. The training objective
+of ConsistentEE only require each instance to be predicted correctly by one
+internal classifier. Additionally, we introduce the concept Memorize Layer to
+measure the hardness of an instance. We incorporate memorized layer into reward
+function design, which allows ``easy'' instances to focus more on acceleration
+while ``hard'' instances to focus more on accuracy. Experimental results show
+that our method outperforms other baselines on various natural language
+understanding and generation tasks.
+
+
+
+ comment: Accepted in AAAI24
+
+
+
+
+
+
+ ☆ Punctuation restoration Model and Spacing Model for Korean Ancient
+ Document
+
+
+ In Korean ancient documents, there is no spacing or punctuation, and they are
+written in classical Chinese characters. This makes it challenging for modern
+individuals and translation models to accurately interpret and translate them.
+While China has models predicting punctuation and spacing, applying them
+directly to Korean texts is problematic due to data differences. Therefore, we
+developed the first models which predict punctuation and spacing for Korean
+historical texts and evaluated their performance. Our punctuation restoration
+model achieved an F1 score of 0.84, and Spacing model achieved a score of 0.96.
+It has the advantage of enabling inference on low-performance GPUs with less
+VRAM while maintaining quite high accuracy.
+
+
+
+ comment: 5 Pages, 2 Figures
+
+
+
+
+
+
+ ☆ Sparse is Enough in Fine-tuning Pre-trained Large Language Model
+
+
+
+
+
+
+
+
+ Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
+
+
+ With the prevalence of pre-training-fine-tuning paradigm, how to efficiently
+adapt the pre-trained model to the downstream tasks has been an intriguing
+issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for
+low-cost adaptation, including Adapters, Bia-only, and the recently widely used
+Low-Rank Adaptation. Although these methods have demonstrated their
+effectiveness to some extent and have been widely applied, the underlying
+principles are still unclear. In this paper, we reveal the transition of loss
+landscape in the downstream domain from random initialization to pre-trained
+initialization, that is, from low-amplitude oscillation to high-amplitude
+oscillation. The parameter gradients exhibit a property akin to sparsity, where
+a small fraction of components dominate the total gradient norm, for instance,
+1% of the components account for 99% of the gradient. This property ensures
+that the pre-trained model can easily find a flat minimizer which guarantees
+the model's ability to generalize even with a low number of trainable
+parameters. Based on this, we propose a gradient-based sparse fine-tuning
+algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its
+effectiveness on a range of tasks including the GLUE Benchmark and
+Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
+
+
+
+
+
+
+
+ ☆ A Revisit of Fake News Dataset with Augmented Fact-checking by ChatGPT
+
+
+ The proliferation of fake news has emerged as a critical issue in recent
+years, requiring significant efforts to detect it. However, the existing fake
+news detection datasets are sourced from human journalists, which are likely to
+have inherent bias limitations due to the highly subjective nature of this
+task. In this paper, we revisit the existing fake news dataset verified by
+human journalists with augmented fact-checking by large language models
+(ChatGPT), and we name the augmented fake news dataset ChatGPT-FC. We
+quantitatively analyze the distinctions and resemblances between human
+journalists and LLM in assessing news subject credibility, news creator
+credibility, time-sensitive, and political framing. Our findings highlight
+LLM's potential to serve as a preliminary screening method, offering a
+promising avenue to mitigate the inherent biases of human journalists and
+enhance fake news detection.
+
+
+
+
+
+
+
+ ☆ Predicting Human Translation Difficulty with Neural Machine Translation
+
+
+ Human translators linger on some words and phrases more than others, and
+predicting this variation is a step towards explaining the underlying cognitive
+processes. Using data from the CRITT Translation Process Research Database, we
+evaluate the extent to which surprisal and attentional features derived from a
+Neural Machine Translation (NMT) model account for reading and production times
+of human translators. We find that surprisal and attention are complementary
+predictors of translation difficulty, and that surprisal derived from a NMT
+model is the single most successful predictor of production duration. Our
+analyses draw on data from hundreds of translators operating across 13 language
+pairs, and represent the most comprehensive investigation of human translation
+difficulty to date.
+
+
+
+
+
+
+
+ ☆ TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with
+ Decentralized Natural Language Understanding Models
+
+
+ Chatbots have become one of the main pathways for the delivery of business
+automation tools. Multi-agent systems offer a framework for designing chatbots
+at scale, making it easier to support complex conversations that span across
+multiple domains as well as enabling developers to maintain and expand their
+capabilities incrementally over time. However, multi-agent systems complicate
+the natural language understanding (NLU) of user intents, especially when they
+rely on decentralized NLU models: some utterances (termed single intent) may
+invoke a single agent while others (termed multi-intent) may explicitly invoke
+multiple agents. Without correctly parsing multi-intent inputs, decentralized
+NLU approaches will not achieve high prediction accuracy. In this paper, we
+propose an efficient parsing and orchestration pipeline algorithm to service
+multi-intent utterances from the user in the context of a multi-agent system.
+Our proposed approach achieved comparable performance to competitive deep
+learning models on three different datasets while being up to 48 times faster.
+
+
+
+ comment: 16 pages
+
+
+
+
+
+
+ ☆ An Adaptive Placement and Parallelism Framework for Accelerating RLHF
+ Training
+
+
+
+
+
+
+
+
+ Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou
+
+
+ Recently, ChatGPT or InstructGPT like large language models (LLM) has made a
+significant impact in the AI world. These models are incredibly versatile,
+capable of performing language tasks on par or even exceeding the capabilities
+of human experts. Many works have attempted to reproduce the complex
+InstructGPT's RLHF (Reinforcement Learning with Human Feedback) training
+pipeline. However, the mainstream distributed RLHF training methods typically
+adopt a fixed model placement strategy, referred to as the Flattening strategy.
+This strategy treats all four models involved in RLHF as a single entity and
+places them on all devices, regardless of their differences. Unfortunately,
+this strategy exacerbates the generation bottlenecks in the RLHF training and
+degrades the overall training efficiency. To address these issues, we propose
+an adaptive model placement framework that offers two flexible model placement
+strategies. These strategies allow for the agile allocation of models across
+devices in a fine-grained manner. The Interleaving strategy helps reduce memory
+redundancy and communication costs during RLHF training. On the other hand, the
+Separation strategy improves the throughput of model training by separating the
+training and generation stages of the RLHF pipeline. Notably, this framework
+seamlessly integrates with other mainstream techniques for acceleration and
+enables automatic hyperparameter search. Extensive experiments have
+demonstrated that our Interleaving and Separation strategies can achieve
+notable improvements up to 11x, compared to the current state-of-the-art (SOTA)
+approaches. These experiments encompassed a wide range of training scenarios,
+involving models of varying sizes and devices of different scales. The results
+highlight the effectiveness and superiority of our approaches in accelerating
+the training of distributed RLHF.
+
+
+
+
+
+
+
+ ☆ Gemini: A Family of Highly Capable Multimodal Models
+
+
+
+
+
+
+
+
+ Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse Hartman, Martin Chadwick, Gaurav Singh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki, Antoine Miech, Annie Louis, Laurent El Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, Clara Huiyi Hu, Raoul de Liedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, Reinald Kim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George van den Driessche, Tao Wang, Fan Yang, Shuo-yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, Paul Kishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, Raphaël Lopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, Charline Le Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, Jaime Alonso Lorenzo, Lars Lowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, Livio Baldini Soares, Kate Baumli, Michael B. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, Dalia El Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, Ce Zheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao, Charlie Chen, Gamaleldin Elsayed, Ed Chi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, Pam G Rabinovitch, Piotr Stanczyk, Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu, Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Lam Nguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, Elena Allica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, Shirley Chung, Harry Askham, Luis C. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, Daniel J. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, Tian Huey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, Soheil Hassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, Christopher A. Choquette-Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ishita Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mihir Sanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, Oriol Vinyals
+
+
+ This report introduces a new family of multimodal models, Gemini, that
+exhibit remarkable capabilities across image, audio, video, and text
+understanding. The Gemini family consists of Ultra, Pro, and Nano sizes,
+suitable for applications ranging from complex reasoning tasks to on-device
+memory-constrained use-cases. Evaluation on a broad range of benchmarks shows
+that our most-capable Gemini Ultra model advances the state of the art in 30 of
+32 of these benchmarks - notably being the first model to achieve human-expert
+performance on the well-studied exam benchmark MMLU, and improving the state of
+the art in every one of the 20 multimodal benchmarks we examined. We believe
+that the new capabilities of Gemini models in cross-modal reasoning and
+language understanding will enable a wide variety of use cases and we discuss
+our approach toward deploying them responsibly to users.
+
+
+
+
+
+
+
+ ☆ Designing Guiding Principles for NLP for Healthcare: A Case Study of
+ Maternal Health
+
+
+
+
+
+
+
+
+ Maria Antoniak, Aakanksha Naik, Carla S. Alvarado, Lucy Lu Wang, Irene Y. Chen
+
+
+ Objective: An ethical framework for the use of large language models (LLMs)
+is urgently needed to shape how natural language processing (NLP) tools are
+used for healthcare applications. Drawing directly from the voices of those
+most affected, we propose a set of guiding principles for the use of NLP in
+healthcare, with examples based on applications in maternal health.
+ Materials and Methods: We led an interactive session centered on an LLM-based
+chatbot demonstration during a full-day workshop with 39 participants, and
+additionally surveyed 30 healthcare workers and 30 birthing people about their
+values, needs, and perceptions of AI and LLMs. We conducted quantitative and
+qualitative analyses of the interactive discussions to consolidate our findings
+into a set of guiding principles.
+ Results: Using the case study of maternal health, we propose nine principles
+for ethical use of LLMs, grouped into three categories: (i) contextual
+significance, (ii) measurements, and (iii) who/what is valued. We describe
+rationales underlying these principles and provide practical advice.
+ Discussion: Healthcare faces existing challenges including the balance of
+power in clinician-patient relationships, systemic health disparities,
+historical injustices, and economic constraints. Our principles serve as a
+framework for surfacing key considerations when deploying LLMs in medicine, as
+well as providing a methodological pattern for other researchers to follow.
+ Conclusion: This set of principles can serve as a resource to practitioners
+working on maternal health and other healthcare fields to emphasize the
+importance of technical nuance, historical context, and inclusive design when
+developing LLMs for use in clinical settings.
+
+
+
+
+
+
+
+ ☆ MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA AAAI
+
+
+
+
+
+
+
+
+ Lang Yu, Qin Chen, Jie Zhou, Liang He
+
+
+ Large language models (LLMs) have shown great success in various Natural
+Language Processing (NLP) tasks, whist they still need updates after deployment
+to fix errors or keep pace with the changing knowledge in the world.
+Researchers formulate such problem as Model Editing and have developed various
+editors focusing on different axes of editing properties. However, current
+editors can hardly support all properties and rely on heavy computational
+resources. In this paper, we propose a plug-in Model Editing method based on
+neuron-indexed dynamic LoRA (MELO), which alters the behavior of language
+models by dynamically activating certain LoRA blocks according to the index
+built in an inner vector database. Our method satisfies various editing
+properties with high efficiency and can be easily integrated into multiple LLM
+backbones. Experimental results show that our proposed MELO achieves
+state-of-the-art editing performance on three sequential editing tasks
+(document classification, question answering and hallucination correction),
+while requires the least trainable parameters and computational cost.
+
+
+
+ comment: In Proceedings of The 38th Annual AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+ ☆ COOPER: Coordinating Specialized Agents towards a Complex Dialogue Goal AAAI 2024
+
+
+
+
+
+
+
+
+ Yi Cheng, Wenge Liu, Jian Wang, Chak Tou Leong, Yi Ouyang, Wenjie Li, Xian Wu, Yefeng Zheng
+
+
+ In recent years, there has been a growing interest in exploring dialogues
+with more complex goals, such as negotiation, persuasion, and emotional
+support, which go beyond traditional service-focused dialogue systems. Apart
+from the requirement for much more sophisticated strategic reasoning and
+communication skills, a significant challenge of these tasks lies in the
+difficulty of objectively measuring the achievement of their goals in a
+quantifiable way, making it difficult for existing research to directly
+optimize the dialogue procedure towards them. In our work, we emphasize the
+multifaceted nature of complex dialogue goals and argue that it is more
+feasible to accomplish them by comprehensively considering and jointly
+promoting their different aspects. To this end, we propose a novel dialogue
+framework, Cooper, which coordinates multiple specialized agents, each
+dedicated to a specific dialogue goal aspect separately, to approach the
+complex objective. Through this divide-and-conquer manner, we make complex
+dialogue goals more approachable and elicit greater intelligence via the
+collaboration of individual agents. Experiments on persuasion and emotional
+support dialogues demonstrate the superiority of our method over a set of
+competitive baselines.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Zero-Shot Fact-Checking with Semantic Triples and Knowledge Graphs
+
+
+ Despite progress in automated fact-checking, most systems require a
+significant amount of labeled training data, which is expensive. In this paper,
+we propose a novel zero-shot method, which instead of operating directly on the
+claim and evidence sentences, decomposes them into semantic triples augmented
+using external knowledge graphs, and uses large language models trained for
+natural language inference. This allows it to generalize to adversarial
+datasets and domains that supervised models require specific training data for.
+Our empirical results show that our approach outperforms previous zero-shot
+approaches on FEVER, FEVER-Symmetric, FEVER 2.0, and Climate-FEVER, while being
+comparable or better than supervised models on the adversarial and the
+out-of-domain datasets.
+
+
+
+
+
+
+
+ ☆ Are you talking to ['xem'] or ['x', 'em']? On Tokenization and
+ Addressing Misgendering in LLMs with Pronoun Tokenization Parity
+
+
+ A large body of NLP research has documented the ways gender biases manifest
+and amplify within large language models (LLMs), though this research has
+predominantly operated within a gender binary-centric context. A growing body
+of work has identified the harmful limitations of this gender-exclusive
+framing; many LLMs cannot correctly and consistently refer to persons outside
+the gender binary, especially if they use neopronouns. While data scarcity has
+been identified as a possible culprit, the precise mechanisms through which it
+influences LLM misgendering remain underexplored. Our work addresses this gap
+by studying data scarcity's role in subword tokenization and, consequently, the
+formation of LLM word representations. We uncover how the Byte-Pair Encoding
+(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun
+misgendering through out-of-vocabulary behavior. We introduce pronoun
+tokenization parity (PTP), a novel approach to reduce LLM neopronoun
+misgendering by preserving a token's functional structure. We evaluate PTP's
+efficacy using pronoun consistency-based metrics and a novel syntax-based
+metric. Through several controlled experiments, finetuning LLMs with PTP
+improves neopronoun consistency from 14.5% to 58.4%, highlighting the
+significant role tokenization plays in LLM pronoun consistency.
+
+
+
+ comment: Accepted to 2023 Neurips Queer in AI workshop
+
+
+
+
+
+
+ ☆ Is post-editing really faster than human translation?
+
+
+ Time efficiency is paramount for the localisation industry, which demands
+ever-faster turnaround times. However, translation speed is largely
+underresearched, and there is a lack of clarity about how language service
+providers (LSPs) can evaluate the performance of their post-editing (PE) and
+human translation (HT) services. This study constitutes the first large-scale
+investigation of translation and revision speed in HT and in the PE of neural
+machine translation, based on real-world data from an LSP. It uses an
+exploratory data analysis approach to investigate data for 90 million words
+translated by 879 linguists across 11 language pairs, over 2.5 years. The
+results of this research indicate that (a) PE is usually but not always faster
+than HT; (b) average speed values may be misleading; (c) translation speed is
+highly variable; and (d) edit distance cannot be used as a proxy for
+post-editing productivity, because it does not correlate strongly with speed.
+
+
+
+ comment: 30 pages, 11 tables, 7 figures. This article has been published in
+ Translation Spaces. This is the author accepted manuscript. Please find the
+ published version at: https://doi.org/10.1075/ts.22044.ter
+
+
+
+
+
+
+ ☆ Can Transformers Learn Sequential Function Classes In Context?
+
+
+
+
+
+
+
+
+ Ryan Campbell, Emma Guo, Evan Hu, Reya Vir, Ethan Hsiao
+
+
+ In-context learning (ICL) has revolutionized the capabilities of transformer
+models in NLP. In our project, we extend the understanding of the mechanisms
+underpinning ICL by exploring whether transformers can learn from sequential,
+non-textual function class data distributions. We introduce a novel sliding
+window sequential function class and employ toy-sized transformers with a GPT-2
+architecture to conduct our experiments. Our analysis indicates that these
+models can indeed leverage ICL when trained on non-textual sequential function
+classes. Additionally, our experiments with randomized y-label sequences
+highlights that transformers retain some ICL capabilities even when the label
+associations are obfuscated. We provide evidence that transformers can reason
+with and understand sequentiality encoded within function classes, as reflected
+by the effective learning of our proposed tasks. Our results also show that the
+performance deteriorated with increasing randomness in the labels, though not
+to the extent one might expect, implying a potential robustness of learned
+sequentiality against label noise. Future research may want to look into how
+previous explanations of transformers, such as induction heads and task
+vectors, relate to sequentiality in ICL in these toy examples. Our
+investigation lays the groundwork for further research into how transformers
+process and perceive sequential data.
+
+
+
+ comment: 8 pages, 8 figures
+
+
+
+
+
+
+ ☆ MotionScript: Natural Language Descriptions for Expressive 3D Human
+ Motions
+
+
+
+
+
+
+
+
+ Payam Jome Yazdian, Eric Liu, Li Cheng, Angelica Lim
+
+
+ This paper proposes MotionScript, a motion-to-text conversion algorithm and
+natural language representation for human body motions. MotionScript aims to
+describe movements in greater detail and with more accuracy than previous
+natural language approaches. Many motion datasets describe relatively objective
+and simple actions with little variation on the way they are expressed (e.g.
+sitting, walking, dribbling a ball). But for expressive actions that contain a
+diversity of movements in the class (e.g. being sad, dancing), or for actions
+outside the domain of standard motion capture datasets (e.g. stylistic walking,
+sign-language), more specific and granular natural language descriptions are
+needed. Our proposed MotionScript descriptions differ from existing natural
+language representations in that it provides direct descriptions in natural
+language instead of simple action labels or high-level human captions. To the
+best of our knowledge, this is the first attempt at translating 3D motions to
+natural language descriptions without requiring training data. Our experiments
+show that when MotionScript representations are used in a text-to-motion neural
+task, body movements are more accurately reconstructed, and large language
+models can be used to generate unseen complex motions.
+
+
+
+
+
+
+
+ ☆ Building a Llama2-finetuned LLM for Odia Language Utilizing Domain
+ Knowledge Instruction Set
+
+
+ Building LLMs for languages other than English is in great demand due to the
+unavailability and performance of multilingual LLMs, such as understanding the
+local context. The problem is critical for low-resource languages due to the
+need for instruction sets. In a multilingual country like India, there is a
+need for LLMs supporting Indic languages to provide generative AI and LLM-based
+technologies and services to its citizens.
+ This paper presents our approach of i) generating a large Odia instruction
+set, including domain knowledge data suitable for LLM fine-tuning, and ii)
+building a Llama2-finetuned model tailored for enhanced performance in the Odia
+domain. The proposed work will help researchers build an instruction set and
+LLM, particularly for Indic languages. We will release the model and
+instruction set for the public for research and noncommercial purposes.
+
+
+
+
+
+
+
+ ☆ An Empirical study of Unsupervised Neural Machine Translation: analyzing
+ NMT output, model's behavior and sentences' contribution
+
+
+ Unsupervised Neural Machine Translation (UNMT) focuses on improving NMT
+results under the assumption there is no human translated parallel data, yet
+little work has been done so far in highlighting its advantages compared to
+supervised methods and analyzing its output in aspects other than translation
+accuracy. We focus on three very diverse languages, French, Gujarati, and
+Kazakh, and train bilingual NMT models, to and from English, with various
+levels of supervision, in high- and low- resource setups, measure quality of
+the NMT output and compare the generated sequences' word order and semantic
+similarity to source and reference sentences. We also use Layer-wise Relevance
+Propagation to evaluate the source and target sentences' contribution to the
+result, expanding the findings of previous works to the UNMT paradigm.
+
+
+
+
+
+
+
+ ☆ Users Approach on Providing Feedback for Smart Home Devices
+
+
+ Smart Home technology has accomplished extraordinary interest in making
+individuals' lives more straightforward and more relaxing as of late.
+Technology as of late brought about delivering numerous savvy and refined
+frameworks which advanced clever living innovation. In this paper, we will be
+investigating the behavioural intention of user's approach on providing
+feedback for smart home devices. We will be conducting an online survey for
+sample of three to five students selected by simple random sampling to study
+the user's motto for giving feedback on smart home devices and their
+expectations. We have observed that most users are ready to share their
+feedback on smart home devices actively to improvise the service and quality of
+the product to fulfill the user needs and make their lives easier.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2312.11817
+
+
+
+
+
+
+ ♻ ☆ PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in
+ Poetry Generation AAAI2024
+
+
+ Controllable text generation is a challenging and meaningful field in natural
+language generation (NLG). Especially, poetry generation is a typical one with
+well-defined and strict conditions for text generation which is an ideal
+playground for the assessment of current methodologies. While prior works
+succeeded in controlling either semantic or metrical aspects of poetry
+generation, simultaneously addressing both remains a challenge. In this paper,
+we pioneer the use of the Diffusion model for generating sonnets and Chinese
+SongCi poetry to tackle such challenges. In terms of semantics, our
+PoetryDiffusion model, built upon the Diffusion model, generates entire
+sentences or poetry by comprehensively considering the entirety of sentence
+information. This approach enhances semantic expression, distinguishing it from
+autoregressive and large language models (LLMs). For metrical control, the
+separation feature of diffusion generation and its constraint control module
+enable us to flexibly incorporate a novel metrical controller to manipulate and
+evaluate metrics (format and rhythm). The denoising process in PoetryDiffusion
+allows for gradual enhancement of semantics and flexible integration of the
+metrical controller which can calculate and impose penalties on states that
+stray significantly from the target control distribution. Experimental results
+on two datasets demonstrate that our model outperforms existing models in
+automatic evaluation of semantic, metrical, and overall performance as well as
+human evaluation.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ♻ ☆ A Baseline Analysis of Reward Models' Ability To Accurately Analyze
+ Foundation Models Under Distribution Shift
+
+
+
+
+
+
+
+
+ Will LeVine, Ben Pikus, Tony Chen, Sean Hendryx
+
+
+ Foundation models, specifically Large Language Models (LLM's), have lately
+gained wide-spread attention and adoption. Reinforcement Learning with Human
+Feedback (RLHF) involves training a reward model to capture desired behaviors,
+which is then used to align LLM's. These reward models are additionally used at
+inference-time to estimate LLM responses' adherence to those desired behaviors.
+However, there is little work measuring how robust these reward models are to
+distribution shifts. In this work, we evaluate how reward model performance -
+measured via accuracy and calibration (i.e. alignment between accuracy and
+confidence) - is affected by distribution shift. We show novel calibration
+patterns and accuracy drops due to OOD prompts and responses, and that the
+reward model is more sensitive to shifts in responses than prompts.
+Additionally, we adapt an OOD detection technique commonly used in
+classification to the reward model setting to detect these distribution shifts
+in prompts and responses.
+
+
+ Despite commendable achievements made by existing work, prevailing multimodal
+sarcasm detection studies rely more on textual content over visual information.
+It unavoidably induces spurious correlations between textual words and labels,
+thereby significantly hindering the models' generalization capability. To
+address this problem, we define the task of out-of-distribution (OOD)
+multimodal sarcasm detection, which aims to evaluate models' generalizability
+when the word distribution is different in training and testing settings.
+Moreover, we propose a novel debiasing multimodal sarcasm detection framework
+with contrastive learning, which aims to mitigate the harmful effect of biased
+textual factors for robust OOD generalization. In particular, we first design
+counterfactual data augmentation to construct the positive samples with
+dissimilar word biases and negative samples with similar word biases.
+Subsequently, we devise an adapted debiasing contrastive learning mechanism to
+empower the model to learn robust task-relevant features and alleviate the
+adverse effect of biased words. Extensive experiments show the superiority of
+the proposed framework.
+
+
+
+
+
+
+
+ ♻ ☆ Label Words are Anchors: An Information Flow Perspective for
+ Understanding In-Context Learning EMNLP 2023
+
+
+
+
+
+
+
+
+ Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
+
+
+ In-context learning (ICL) emerges as a promising capability of large language
+models (LLMs) by providing them with demonstration examples to perform diverse
+tasks. However, the underlying mechanism of how LLMs learn from the provided
+context remains under-explored. In this paper, we investigate the working
+mechanism of ICL through an information flow lens. Our findings reveal that
+label words in the demonstration examples function as anchors: (1) semantic
+information aggregates into label word representations during the shallow
+computation layers' processing; (2) the consolidated information in label words
+serves as a reference for LLMs' final predictions. Based on these insights, we
+introduce an anchor re-weighting method to improve ICL performance, a
+demonstration compression technique to expedite inference, and an analysis
+framework for diagnosing ICL errors in GPT2-XL. The promising applications of
+our findings again validate the uncovered ICL working mechanism and pave the
+way for future studies.
+
+
+
+ comment: Accepted by EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Chain-of-Questions Training with Latent Answers for Robust Multistep
+ Question Answering EMNLP 2023
+
+
+
+
+
+
+
+
+ Wang Zhu, Jesse Thomason, Robin Jia
+
+
+ We train a language model (LM) to robustly answer multistep questions by
+generating and answering sub-questions. We propose Chain-of-Questions, a
+framework that trains a model to generate sub-questions and sub-answers one at
+a time by leveraging human annotated question decomposition meaning
+representation (QDMR). The key technical challenge is that QDMR only contains
+sub-questions but not answers to those sub-questions, so we treat sub-answers
+as latent variables and optimize them using a novel dynamic mixture of Hard-EM
+and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods
+by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA
+adversarial set, thus demonstrating the effectiveness and robustness of our
+framework.
+
+
+
+ comment: Accepted by the EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions? CVPR 2023
+
+
+
+
+
+
+
+
+ Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason
+
+
+ Data augmentation via back-translation is common when pretraining
+Vision-and-Language Navigation (VLN) models, even though the generated
+instructions are noisy. But: does that noise matter? We find that nonsensical
+or irrelevant language instructions during pretraining can have little effect
+on downstream performance for both HAMT and VLN-BERT on R2R, and is still
+better than only using clean, human data. To underscore these results, we
+concoct an efficient augmentation method, Unigram + Object, which generates
+nonsensical instructions that nonetheless improve downstream performance. Our
+findings suggest that what matters for VLN R2R pretraining is the quantity of
+visual trajectories, not the quality of instructions.
+
+
+
+ comment: Accepted by O-DRUM @ CVPR 2023
+
+
+
+
+
+
+ ♻ ☆ "Paraphrasing The Original Text" Makes High Accuracy Long-Context QA
+
+
+ Although LLMs continue to iterate and improve, most open-source models still
+have a context window of no more than 4k, limiting their ability to handle
+long-context problems. Most existing open-source models for long-context chat
+still lack satisfactory accuracy. To address this issue, I approach it from the
+perspective of training data and theoretically prove that training the
+capability to handle long contexts requires "effective" rather than "long"
+data. Based on this, I propose using the "original text paraphrase" task, and
+successfully extend the context window of the existing model to 32k by a
+low-cost and effective method, achieving extremely high accuracy in
+multi-document-QA and surpassing all existing open-source models of the same
+scale. The model and training data have been open-sourced on
+HuggingFace(https://huggingface.co/yuyijiong/Qwen-14b-chat-yarn-32k) and
+WiseModel(https://wisemodel.cn/models/yuyijiong/Qwen-14b-chat-yarn-32k).
+
+
+
+ comment: Chinese version of this paper can be downloaded from
+ (https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)
+
+
+
+
+
+
+ ♻ ☆ GraphGPT: Graph Instruction Tuning for Large Language Models
+
+
+ Graph Neural Networks (GNNs) have advanced graph structure understanding via
+recursive information exchange and aggregation among graph nodes. To improve
+model robustness, self-supervised learning (SSL) has emerged as a promising
+approach for data augmentation. However, existing methods for generating
+pre-trained graph embeddings often rely on fine-tuning with specific downstream
+task labels, which limits their usability in scenarios where labeled data is
+scarce or unavailable. To address this, our research focuses on advancing the
+generalization capabilities of graph models in challenging zero-shot learning
+scenarios. Inspired by the success of large language models (LLMs), we aim to
+develop a graph-oriented LLM that can achieve high generalization across
+diverse downstream datasets and tasks, even without any information available
+from the downstream graph data. In this work, we present the GraphGPT framework
+that aligns LLMs with graph structural knowledge with a graph instruction
+tuning paradigm. Our framework incorporates a text-graph grounding component to
+establish a connection between textual information and graph structures.
+Additionally, we propose a dual-stage instruction tuning paradigm, accompanied
+by a lightweight graph-text alignment projector. This paradigm explores
+self-supervised graph structural signals and task-specific graph instructions,
+to guide LLMs in understanding complex graph structures and improving their
+adaptability across different downstream tasks. Our framework is evaluated on
+supervised and zero-shot graph learning tasks, demonstrating superior
+generalization and outperforming state-of-the-art baselines.
+
+
+
+
+
+
+
+ ♻ ☆ Inducing Character-level Structure in Subword-based Language Models with
+ Type-level Interchange Intervention Training ACL 2023
+
+
+ Language tasks involving character-level manipulations (e.g., spelling
+corrections, arithmetic operations, word games) are challenging for models
+operating on subword units. To address this, we develop a causal intervention
+framework to learn robust and interpretable character representations inside
+subword-based language models. Our method treats each character as a typed
+variable in a causal model and learns such causal structures by adapting the
+interchange intervention training method of Geiger et al. (2021). We
+additionally introduce a suite of character-level tasks that systematically
+vary in their dependence on meaning and sequence-level context. While
+character-level models still perform best on purely form-based tasks like
+string reversal, our method outperforms character-level models on more complex
+tasks that blend form, meaning, and context, such as spelling correction in
+context and word search games. Compared with standard subword-based models, our
+approach also significantly improves robustness on unseen token sequences and
+leads to human-interpretable internal representations of characters.
+
+
+
+ comment: Findings of the Association for Computational Linguistics: ACL 2023
+
+
+
+
+
+
+ ♻ ☆ VLIS: Unimodal Language Models Guide Multimodal Language Generation EMNLP 2023
+
+
+ Multimodal language generation, which leverages the synergy of language and
+vision, is a rapidly expanding field. However, existing vision-language models
+face challenges in tasks that require complex linguistic understanding. To
+address this issue, we introduce Visual-Language models as Importance Sampling
+weights (VLIS), a novel framework that combines the visual conditioning
+capability of vision-language models with the language understanding of
+unimodal text-only language models without further training. It extracts
+pointwise mutual information of each image and text from a visual-language
+model and uses the value as an importance sampling weight to adjust the token
+likelihood from a text-only model. VLIS improves vision-language models on
+diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and
+ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning,
+and ROCStories). Our results suggest that VLIS represents a promising new
+direction for multimodal language generation.
+
+
+ In text generation, a large language model (LM) makes a choice of each new
+word based only on the former selection of its context using the softmax
+function. Nevertheless, the link statistics information of concurrent words
+based on a scene-specific corpus is valuable in choosing the next word, which
+can help to ensure the topic of the generated text to be aligned with the
+current task. To fully explore the co-occurrence information,we propose a
+graphmax function for task-specific text generation. Using the graph-based
+regularization, graphmax enables the final word choice to be determined by both
+the global knowledge from the LM and the local knowledge from the
+scene-specific corpus. The traditional softmax function is regularized with a
+graph total variation (GTV) term, which incorporates the local knowledge into
+the LM and encourages the model to consider the statistical relationships
+between words in a scene-specific corpus. The proposed graphmax is versatile
+and can be readily plugged into any large pre-trained LM for text generation
+and machine translation. Through extensive experiments, we demonstrate that the
+new GTV-based regularization can improve performances in various natural
+language processing tasks in comparison with existing methods. Moreover,
+through human experiments, we observe that participants can easily distinguish
+the text generated by graphmax or softmax.
+
+
+
+
+
+
+
+ ♻ ☆ Communicative Agents for Software Development
+
+
+ Software engineering is a domain characterized by intricate decision-making
+processes, often relying on nuanced intuition and consultation. Recent
+advancements in deep learning have started to revolutionize software
+engineering practices through elaborate designs implemented at various stages
+of software development. In this paper, we present an innovative paradigm that
+leverages large language models (LLMs) throughout the entire software
+development process, streamlining and unifying key processes through natural
+language communication, thereby eliminating the need for specialized models at
+each phase. At the core of this paradigm lies ChatDev, a virtual chat-powered
+software development company that mirrors the established waterfall model,
+meticulously dividing the development process into four distinct chronological
+stages: designing, coding, testing, and documenting. Each stage engages a team
+of "software agents", such as programmers, code reviewers, and test engineers,
+fostering collaborative dialogue and facilitating a seamless workflow. The chat
+chain acts as a facilitator, breaking down each stage into atomic subtasks.
+This enables dual roles, allowing for proposing and validating solutions
+through context-aware communication, leading to efficient resolution of
+specific subtasks. The instrumental analysis of ChatDev highlights its
+remarkable efficacy in software generation, enabling the completion of the
+entire software development process in under seven minutes at a cost of less
+than one dollar. It not only identifies and alleviates potential
+vulnerabilities but also rectifies potential hallucinations while maintaining
+commendable efficiency and cost-effectiveness. The potential of ChatDev unveils
+fresh possibilities for integrating LLMs into the realm of software
+development. Our code is available at https://github.com/OpenBMB/ChatDev.
+
+
+
+ comment: https://github.com/OpenBMB/ChatDev
+
+
+
+
+
+
+ ♻ ☆ FP8-LM: Training FP8 Large Language Models
+
+
+ In this paper, we explore FP8 low-bit data formats for efficient training of
+large language models (LLMs). Our key insight is that most variables, such as
+gradients and optimizer states, in LLM training can employ low-precision data
+formats without compromising model accuracy and requiring no changes to
+hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision
+framework for training LLMs. This framework offers three levels of FP8
+utilization to streamline mixed-precision and distributed parallel training for
+LLMs. It gradually incorporates 8-bit gradients, optimizer states, and
+distributed learning in an incremental manner. Experiment results show that,
+during the training of GPT-175B model on H100 GPU platform, our FP8
+mixed-precision training framework not only achieved a remarkable 39% reduction
+in real memory usage but also ran 75% faster than the widely adopted BF16
+framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer
+Engine by 37%. This largely reduces the training costs for large foundation
+models. Furthermore, our FP8 mixed-precision training methodology is generic.
+It can be seamlessly applied to other tasks such as LLM instruction tuning and
+reinforcement learning with human feedback, offering savings in fine-tuning
+expenses. Our FP8 low-precision training framework is open-sourced at
+{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
+
+
+
+
+
+
+
+ ♻ ☆ Narrowing the Gap between Supervised and Unsupervised Sentence
+ Representation Learning with Large Language Model AAAI24
+
+
+
+
+
+
+
+
+ Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao
+
+
+ Sentence Representation Learning (SRL) is a fundamental task in Natural
+Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings
+(CSE) being the mainstream technique due to its superior performance. An
+intriguing phenomenon in CSE is the significant performance gap between
+supervised and unsupervised methods, with their only difference lying in the
+training data. Previous works attribute this performance gap to differences in
+two representation properties (alignment and uniformity). However, since
+alignment and uniformity only measure the results, they fail to answer "What
+aspects of the training data contribute to the performance gap?" and "How can
+the performance gap be narrowed?", In this paper, we conduct empirical
+experiments to answer these "What" and "How" questions. We first answer the
+"What" question by thoroughly comparing the behavior of supervised and
+unsupervised CSE during their respective training processes. From the
+comparison, we identify the similarity pattern as a key factor to the
+performance gap, and introduce a metric, called Relative Fitting Difficulty
+(RFD), to measure the complexity of the similarity pattern. Then, based on the
+insights gained from the "What" question, we tackle the "How" question by
+increasing the pattern complexity of the training data. We achieve this by
+leveraging the In-Context Learning (ICL) capability of the Large Language Model
+(LLM) to generate data that simulates complex patterns. By utilizing the
+hierarchical patterns in the LLM-generated data, we effectively narrow the gap
+between supervised and unsupervised CSE. We release our codes and appendix at
+https://github.com/BDBC-KG-NLP/NGCSE.
+
+
+
+ comment: Accepted at AAAI24
+
+
+
+
+
+
+ ♻ ☆ Recurrent Neural Language Models as Probabilistic Finite-state Automata
+
+
+ Studying language models (LMs) in terms of well-understood formalisms allows
+us to precisely characterize their abilities and limitations. Previous work has
+investigated the representational capacity of recurrent neural network (RNN)
+LMs in terms of their capacity to recognize unweighted formal languages.
+However, LMs do not describe unweighted formal languages -- rather, they define
+\emph{probability distributions} over strings. In this work, we study what
+classes of such probability distributions RNN LMs can represent, which allows
+us to make more direct statements about their capabilities. We show that simple
+RNNs are equivalent to a subclass of probabilistic finite-state automata, and
+can thus model a strict subset of probability distributions expressible by
+finite-state models. Furthermore, we study the space complexity of representing
+finite-state LMs with RNNs. We show that, to represent an arbitrary
+deterministic finite-state LM with $N$ states over an alphabet $\alphabet$, an
+RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a
+first step towards characterizing the classes of distributions RNN LMs can
+represent and thus help us understand their capabilities and limitations.
+
+
+
+ comment: 9 pages
+
+
+
+
+
+
+ ♻ ☆ Word-Graph2vec: An efficient word embedding approach on word
+ co-occurrence graph using random walk sampling
+
+
+
+
+
+
+
+
+ Wenting Li, Jiahong Xue, Xi Zhang, Huacan Chen, Zeyu Chen, Yuanzhe Cai
+
+
+ Word embedding has become ubiquitous and is widely used in various text
+mining and natural language processing (NLP) tasks, such as information
+retrieval, semantic analysis, and machine translation, among many others.
+Unfortunately, it is prohibitively expensive to train the word embedding in a
+relatively large corpus. We propose a graph-based word embedding algorithm,
+called Word-Graph2vec, which converts the large corpus into a word
+co-occurrence graph, then takes the word sequence samples from this graph by
+randomly traveling and trains the word embedding on this sampling corpus in the
+end. We posit that because of the stable vocabulary, relative idioms, and fixed
+expressions in English, the size and density of the word co-occurrence graph
+change slightly with the increase in the training corpus. So that
+Word-Graph2vec has stable runtime on the large scale data set, and its
+performance advantage becomes more and more obvious with the growth of the
+training corpus. Extensive experiments conducted on real-world datasets show
+that the proposed algorithm outperforms traditional Skip-Gram by four-five
+times in terms of efficiency, while the error generated by the random walk
+sampling is small.
+
+
+
+
+
+
+
+ ♻ ☆ Meta-Referential Games to Learn Compositional Learning Behaviours
+
+
+
+
+
+
+
+
+ Kevin Denamganaï, Sondess Missaoui, James Alfred Walker
+
+
+ Human beings use compositionality to generalise from past experiences to
+novel experiences. We assume a separation of our experiences into fundamental
+atomic components that can be recombined in novel ways to support our ability
+to engage with novel experiences. We frame this as the ability to learn to
+generalise compositionally, and we will refer to behaviours making use of this
+ability as compositional learning behaviours (CLBs). A central problem to
+learning CLBs is the resolution of a binding problem (BP). While it is another
+feat of intelligence that human beings perform with ease, it is not the case
+for state-of-the-art artificial agents. Thus, in order to build artificial
+agents able to collaborate with human beings, we propose to develop a novel
+benchmark to investigate agents' abilities to exhibit CLBs by solving a
+domain-agnostic version of the BP. We take inspiration from the language
+emergence and grounding framework of referential games and propose a
+meta-learning extension of referential games, entitled Meta-Referential Games,
+and use this framework to build our benchmark, the Symbolic Behaviour Benchmark
+(S2B). We provide baseline results and error analysis showing that our
+benchmark is a compelling challenge that we hope will spur the research
+community towards developing more capable artificial agents.
+
+
+ Entity alignment (EA) seeks identical entities in different knowledge graphs,
+which is a long-standing task in the database research. Recent work leverages
+deep learning to embed entities in vector space and align them via nearest
+neighbor search. Although embedding-based EA has gained marked success in
+recent years, it lacks explanations for alignment decisions. In this paper, we
+present the first framework that can generate explanations for understanding
+and repairing embedding-based EA results. Given an EA pair produced by an
+embedding model, we first compare its neighbor entities and relations to build
+a matching subgraph as a local explanation. We then construct an alignment
+dependency graph to understand the pair from an abstract perspective. Finally,
+we repair the pair by resolving three types of alignment conflicts based on
+dependency graphs. Experiments on a variety of EA datasets demonstrate the
+effectiveness, generalization, and robustness of our framework in explaining
+and repairing embedding-based EA results.
+
+
+
+ comment: Accepted in the 40th IEEE International Conference on Data
+ Engineering (ICDE 2024)
+
+
+
+
+
+
+ ♻ ☆ SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment
+ to Cultural Reasoning
+
+
+
+
+
+
+
+
+ Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, Nancy F. Chen
+
+
+ We present SeaEval, a benchmark for multilingual foundation models. In
+addition to characterizing how these models understand and reason with natural
+language, we also investigate how well they comprehend cultural practices,
+nuances, and values. Alongside standard accuracy metrics, we investigate the
+brittleness of foundation models in the dimensions of semantics and
+multilinguality. Our analyses span both open-sourced and closed models, leading
+to empirical results across classic NLP tasks, reasoning, and cultural
+comprehension. Key findings indicate (1) Most models exhibit varied behavior
+when given paraphrased instructions. (2) Many models still suffer from exposure
+bias (e.g., positional bias, majority label bias). (3) For questions rooted in
+factual, scientific, and commonsense knowledge, consistent responses are
+expected across multilingual queries that are semantically equivalent. Yet,
+most models surprisingly demonstrate inconsistent performance on these queries.
+(4) Multilingually-trained models have not attained "balanced multilingual"
+capabilities. Our endeavors underscore the need for more generalizable semantic
+representations and enhanced multilingual contextualization. SeaEval can serve
+as a launchpad for more thorough investigations and evaluations for
+multilingual and multicultural scenarios.
+
+
+
+ comment: 20 pages. More datasets (2 on Cross-Lingual Consistency and 4 on
+ Cultural Understanding) and more supported languages. Code:
+ https://github.com/SeaEval/SeaEval
+
+ Length extrapolation has attracted considerable attention recently since it
+allows transformers to be tested on longer sequences than those used in
+training. Previous research has shown that this property can be attained by
+using carefully designed Relative Positional Encodings (RPEs). While these
+methods perform well on a variety of corpora, the conditions for length
+extrapolation have yet to be investigated. This paper attempts to determine
+what types of RPEs allow for length extrapolation through a thorough
+mathematical and empirical analysis. We discover that a transformer is certain
+to possess this property as long as the series that corresponds to the RPE's
+exponential converges. Two practices are derived from the conditions and
+examined in language modeling tasks on a variety of corpora. As a bonus from
+the conditions, we derive a new Theoretical Receptive Field (TRF) to measure
+the receptive field of RPEs without taking any training steps. Extensive
+experiments are conducted on the Wikitext-103, Books, Github, and WikiBook
+datasets to demonstrate the viability of our discovered conditions. We also
+compare TRF to Empirical Receptive Field (ERF) across different models, showing
+consistently matched trends on the aforementioned datasets. The code is
+available at https://github.com/OpenNLPLab/Rpe.
+
+
+
+ comment: AAAI Camera Ready. Zhen Qin and Yiran Zhong contribute equally to
+ this paper; Yiran Zhong is the corresponding author. The code is available at
+ https://github.com/OpenNLPLab/Rpe
+
+
+
+
+
+
+ ♻ ☆ Split and Rephrase with Large Language Models
+
+
+ The Split and Rephrase task, which consists in splitting complex sentences
+into a sequence of shorter grammatical sentences, while preserving the original
+meaning, can facilitate the processing of complex texts for humans and machines
+alike. In this work, we describe an approach based on large language models,
+which improves over the state of the art by large margins on all the major
+metrics for the task, on publicly available datasets. We also describe results
+from two human evaluations that further establish the significant improvements
+obtained with large language models and the viability of the approach. We
+evaluate different strategies, including fine-tuning pretrained language models
+of varying parameter size, and applying both zero-shot and few-shot in-context
+learning on instruction-tuned language models. Although the latter were
+markedly outperformed by fine-tuned models, they still achieved promising
+results overall. Our results thus demonstrate the strong potential of different
+variants of large language models for the Split and Rephrase task, using
+relatively small amounts of training samples and model parameters overall.
+
+
+
+
+
+
+
+ ♻ ☆ GPT-Fathom: Benchmarking Large Language Models to Decipher the
+ Evolutionary Path towards GPT-4 and Beyond
+
+
+ With the rapid advancement of large language models (LLMs), there is a
+pressing need for a comprehensive evaluation suite to assess their capabilities
+and limitations. Existing LLM leaderboards often reference scores reported in
+other papers without consistent settings and prompts, which may inadvertently
+encourage cherry-picking favored settings and prompts for better results. In
+this work, we introduce GPT-Fathom, an open-source and reproducible LLM
+evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+
+leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across
+7 capability categories, all under aligned settings. Our retrospective study on
+OpenAI's earlier models offers valuable insights into the evolutionary path
+from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3
+progressively improves to GPT-4, including technical details like whether
+adding code data improves LLM's reasoning capability, which aspects of LLM
+capability can be improved by SFT and RLHF, how much is the alignment tax, etc.
+Our analysis sheds light on many of these questions, aiming to improve the
+transparency of advanced LLMs.
+
+
+
+
+
+
+
+ ♻ ☆ Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse
+ Biomedical Tasks
+
+
+ Objective: Most existing fine-tuned biomedical large language models (LLMs)
+focus on enhancing performance in monolingual biomedical question answering and
+conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on
+diverse biomedical NLP tasks in different languages, We present Taiyi, a
+bilingual fine-tuned LLM for diverse biomedical tasks. Materials and Methods:
+We first curated a comprehensive collection of 140 existing biomedical text
+mining datasets (102 English and 38 Chinese datasets) across over 10 task
+types. Subsequently, a two-stage strategy is proposed for supervised
+fine-tuning to optimize the model performance across varied tasks. Results:
+Experimental results on 13 test sets covering named entity recognition,
+relation extraction, text classification, question answering tasks demonstrate
+that Taiyi achieves superior performance compared to general LLMs. The case
+study involving additional biomedical NLP tasks further shows Taiyi's
+considerable potential for bilingual biomedical multi-tasking. Conclusion:
+Leveraging rich high-quality biomedical corpora and developing effective
+fine-tuning strategies can significantly improve the performance of LLMs within
+the biomedical domain. Taiyi shows the bilingual multi-tasking capability
+through supervised fine-tuning. However, those tasks such as information
+extraction that are not generation tasks in nature remain challenging for
+LLM-based generative approaches, and they still underperform the conventional
+discriminative approaches of smaller language models.
+
+
+
+
+
+
+
+ ♻ ☆ ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models
+ with Enhanced Adapter
+
+
+
+
+
+
+
+
+ Zhengqing Yuan, Xinyi Wang, Kun Wang, Lichao Sun, Yanfang Ye
+
+
+ In recent years, advancements in large language models have been remarkable,
+with models such as ChatGPT demonstrating exceptional proficiency in diverse
+linguistic tasks. The pre-training of large models with billions of parameters,
+poses a formidable challenge, primarily due to the scarcity of datasets of a
+commensurate scale for effective training. Nevertheless, innovative strategies
+have emerged, including methods to fine-tune these pre-trained models using
+fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite
+their potential in various domains, these models remain limited in their
+understanding of artistic imagery. They have yet to fully grasp the intricate
+nuances of art images or to provide an objective articulation of the emotions
+they evoke, in a manner akin to human perception. This work introduces
+ArtGPT-4, a pioneering large vision-language model tailored to address the
+deficiencies of contemporary models in artistic comprehension. ArtGPT-4
+underwent training on image-text pairs utilizing a Tesla A100 device in a mere
+2 hours, with a dataset comprising approximately 0.52M entries. Impressively,
+the model can render images with an artistic-understanding and convey the
+emotions they inspire, mirroring human interpretation. Additionally, this work
+presents a unique dataset designed to evaluate the efficacy of vision-language
+models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art
+performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the
+established benchmarks introduced in This study, lagging behind professional
+artists' descriptions by a negligible 0.15 points on a 6-point scale. The code
+and the pre-trained model are accessible in
+https://huggingface.co/Tyrannosaurus/ArtGPT-4.
+
+
+
+ comment: 20 pages
+
+
+
+
+
+
+ ♻ ☆ Compositional Generalization for Multi-label Text Classification: A
+ Data-Augmentation Approach AAAI'24
+
+
+
+
+
+
+
+
+ Yuyang Chai, Zhuang Li, Jiahui Liu, Lei Chen, Fei Li, Donghong Ji, Chong Teng
+
+
+ Despite significant advancements in multi-label text classification, the
+ability of existing models to generalize to novel and seldom-encountered
+complex concepts, which are compositions of elementary ones, remains
+underexplored. This research addresses this gap. By creating unique data splits
+across three benchmarks, we assess the compositional generalization ability of
+existing multi-label text classification models. Our results show that these
+models often fail to generalize to compositional concepts encountered
+infrequently during training, leading to inferior performance on tests with
+these new combinations. To address this, we introduce a data augmentation
+method that leverages two innovative text generation models designed to enhance
+the classification models' capacity for compositional generalization. Our
+experiments show that this data augmentation approach significantly improves
+the compositional generalization capabilities of classification models on our
+benchmarks, with both generation models surpassing other text generation
+baselines.
+
+
+
+ comment: Accepted by AAAI'24
+
+
+
+
+
+
+ ♻ ☆ Understanding the Instruction Mixture for Large Language Model
+ Fine-tuning
+
+
+ While instructions fine-tuning of large language models (LLMs) has been
+proven to enhance performance across various applications, the influence of the
+instruction dataset mixture on LLMs has not been thoroughly explored. In this
+study, we classify instructions into three main types: NLP downstream tasks,
+coding, and general chatting, and investigate their impact on LLMs. Our
+findings reveal that specific types of instructions are more beneficial for
+particular uses, while it may cause harms to other aspects, emphasizing the
+importance of meticulously designing the instruction mixture to maximize model
+performance. This study sheds light on the instruction mixture and paves the
+way for future research.
+
+
+
+ comment: Instruction Tuning, Large Language Model, Alignment
+
+
+
+
+
+
+ ♻ ☆ The Good, The Bad, and Why: Unveiling Emotions in Generative AI
+
+
+ Emotion significantly impacts our daily behaviors and interactions. While
+recent generative AI models, such as large language models, have shown
+impressive performance in various tasks, it remains unclear whether they truly
+comprehend emotions. This paper aims to address this gap by incorporating
+psychological theories to gain a holistic understanding of emotions in
+generative AI models. Specifically, we propose three approaches: 1)
+EmotionPrompt to enhance AI model performance, 2) EmotionAttack to impair AI
+model performance, and 3) EmotionDecode to explain the effects of emotional
+stimuli, both benign and malignant. Through extensive experiments involving
+language and multi-modal models on semantic understanding, logical reasoning,
+and generation tasks, we demonstrate that both textual and visual EmotionPrompt
+can boost the performance of AI models while EmotionAttack can hinder it.
+Additionally, EmotionDecode reveals that AI models can comprehend emotional
+stimuli akin to the mechanism of dopamine in the human brain. Our work heralds
+a novel avenue for exploring psychology to enhance our understanding of
+generative AI models. This paper is an extended version of our previous work
+EmotionPrompt (arXiv:2307.11760).
+
+
+
+ comment: Technical report; an extension to EmotionPrompt (arXiv:2307.11760);
+ 34 pages
+
+
+
+
+
+
+ ♻ ☆ One Shot Learning as Instruction Data Prospector for Large Language
+ Models
+
+
+
+
+
+
+
+
+ Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Tongliang Liu, Fei Huang, Yongbin Li
+
+
+ Aligning large language models(LLMs) with human is a critical step in
+effectively utilizing their pre-trained capabilities across a wide array of
+language tasks. Current instruction tuning practices often rely on expanding
+dataset size without a clear strategy for ensuring data quality, which can
+inadvertently introduce noise and degrade model performance. To address this
+challenge, we introduce Nuggets, a novel and efficient methodology that employs
+one shot learning to select high-quality instruction data from expansive
+datasets. Nuggets assesses the potential of individual instruction examples to
+act as effective one shot examples, thereby identifying those that can
+significantly enhance diverse task performance. Nuggets utilizes a scoring
+system based on the impact of candidate examples on the perplexity of a diverse
+anchor set, facilitating the selection of the most beneficial data for
+instruction tuning. Through rigorous testing on two benchmarks, including
+MT-Bench and Alpaca-Eval, we demonstrate that instruction tuning with the top
+1% of Nuggets-curated examples substantially outperforms conventional methods
+that use the full dataset. These findings advocate for a data selection
+paradigm that prioritizes quality, offering a more efficient pathway to align
+LLMs with humans.
+
+
+
+
+
+
+
+ ♻ ☆ How to Bridge the Gap between Modalities: A Comprehensive Survey on
+ Multimodal Large Language Model
+
+
+ This review paper explores Multimodal Large Language Models (MLLMs), which
+integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data
+such as text and vision. MLLMs demonstrate capabilities like generating image
+narratives and answering image-based questions, bridging the gap towards
+real-world human-computer interactions and hinting at a potential pathway to
+artificial general intelligence. However, MLLMs still face challenges in
+processing the semantic gap in multimodality, which may lead to erroneous
+generation, posing potential risks to society. Choosing the appropriate
+modality alignment method is crucial, as improper methods might require more
+parameters with limited performance improvement. This paper aims to explore
+modality alignment methods for LLMs and their existing capabilities.
+Implementing modality alignment allows LLMs to address environmental issues and
+enhance accessibility. The study surveys existing modal alignment methods in
+MLLMs into four groups: (1) Multimodal Converters that change data into
+something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs
+perceive different types of data; (3) Tools Assistance for changing data into
+one common format, usually text; and (4) Data-Driven methods that teach LLMs to
+understand specific types of data in a dataset. This field is still in a phase
+of exploration and experimentation, and we will organize and update various
+existing research methods for multimodal information alignment.
+
+
+
+
+
+
+
+ ♻ ☆ Addressing Token Uniformity in Transformers via Singular Value
+ Transformation UAI2022
+
+
+
+
+
+
+
+
+ Hanqi Yan, Lin Gui, Wenjie Li, Yulan He
+
+
+ Token uniformity is commonly observed in transformer-based models, in which
+different tokens share a large proportion of similar information after going
+through stacked multiple self-attention layers in a transformer. In this paper,
+we propose to use the distribution of singular values of outputs of each
+transformer layer to characterise the phenomenon of token uniformity and
+empirically illustrate that a less skewed singular value distribution can
+alleviate the `token uniformity' problem. Base on our observations, we define
+several desirable properties of singular value distributions and propose a
+novel transformation function for updating the singular values. We show that
+apart from alleviating token uniformity, the transformation function should
+preserve the local neighbourhood structure in the original embedding space. Our
+proposed singular value transformation function is applied to a range of
+transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT,
+and improved performance is observed in semantic textual similarity evaluation
+and a range of GLUE tasks. Our source code is available at
+https://github.com/hanqi-qi/tokenUni.git.
+
+
+
+ comment: UAI2022 Main Conference, Spotlight, combined with supplementary files
+
+
+
+
+
+
+ ♻ ☆ Position Bias Mitigation: A Knowledge-Aware Graph Model for Emotion
+ Cause Extraction ACL2021
+
+
+
+
+
+
+
+
+ Hanqi Yan, Lin Gui, Gabriele Pergola, Yulan He
+
+
+ The Emotion Cause Extraction (ECE)} task aims to identify clauses which
+contain emotion-evoking information for a particular emotion expressed in text.
+We observe that a widely-used ECE dataset exhibits a bias that the majority of
+annotated cause clauses are either directly before their associated emotion
+clauses or are the emotion clauses themselves. Existing models for ECE tend to
+explore such relative position information and suffer from the dataset bias. To
+investigate the degree of reliance of existing ECE models on clause relative
+positions, we propose a novel strategy to generate adversarial examples in
+which the relative position information is no longer the indicative feature of
+cause clauses. We test the performance of existing models on such adversarial
+examples and observe a significant performance drop. To address the dataset
+bias, we propose a novel graph-based method to explicitly model the emotion
+triggering paths by leveraging the commonsense knowledge to enhance the
+semantic dependencies between a candidate clause and an emotion clause.
+Experimental results show that our proposed approach performs on par with the
+existing state-of-the-art methods on the original ECE dataset, and is more
+robust against adversarial attacks compared to existing models.
+
+
+
+ comment: ACL2021 Main Conference, Oral paper
+
+
+
+
+
+
+ ♻ ☆ LLMR: Real-time Prompting of Interactive Worlds using Large Language
+ Models
+
+
+
+
+
+
+
+
+ Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski-Fahey, Judith Amores Fernandez, Jaron Lanier
+
+
+ We present Large Language Model for Mixed Reality (LLMR), a framework for the
+real-time creation and modification of interactive Mixed Reality experiences
+using LLMs. LLMR leverages novel strategies to tackle difficult cases where
+ideal training data is scarce, or where the design goal requires the synthesis
+of internal dynamics, intuitive analysis, or advanced interactivity. Our
+framework relies on text interaction and the Unity game engine. By
+incorporating techniques for scene understanding, task planning,
+self-debugging, and memory management, LLMR outperforms the standard GPT-4 by
+4x in average error rate. We demonstrate LLMR's cross-platform interoperability
+with several example worlds, and evaluate it on a variety of creation and
+modification tasks to show that it can produce and edit diverse objects, tools,
+and scenes. Finally, we conducted a usability study (N=11) with a diverse set
+that revealed participants had positive experiences with the system and would
+use it again.
+
+
+
+ comment: 60 pages, 18 figures; Expanded discussion of experiments and the
+ influence of various modules
+
+
+
+
+
+
+ ♻ ☆ GPT-4 Technical Report
+
+
+
+
+
+
+
+
+ OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph
+
+
+ We report the development of GPT-4, a large-scale, multimodal model which can
+accept image and text inputs and produce text outputs. While less capable than
+humans in many real-world scenarios, GPT-4 exhibits human-level performance on
+various professional and academic benchmarks, including passing a simulated bar
+exam with a score around the top 10% of test takers. GPT-4 is a
+Transformer-based model pre-trained to predict the next token in a document.
+The post-training alignment process results in improved performance on measures
+of factuality and adherence to desired behavior. A core component of this
+project was developing infrastructure and optimization methods that behave
+predictably across a wide range of scales. This allowed us to accurately
+predict some aspects of GPT-4's performance based on models trained with no
+more than 1/1,000th the compute of GPT-4.
+
+
+
+ comment: 100 pages; updated authors list
+
+
+
+
+
+
+ ♻ ☆ RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting
+
+
+
+
+
+
+
+
+ Lei Shu, Liangchen Luo, Jayakumar Hoskere, Yun Zhu, Yinxiao Liu, Simon Tong, Jindong Chen, Lei Meng
+
+
+ Large Language Models (LLMs) have demonstrated impressive capabilities in
+creative tasks such as storytelling and E-mail generation. However, as LLMs are
+primarily trained on final text results rather than intermediate revisions, it
+might be challenging for them to perform text rewriting tasks. Most studies in
+the rewriting tasks focus on a particular transformation type within the
+boundaries of single sentences. In this work, we develop new strategies for
+instruction tuning and reinforcement learning to better align LLMs for
+cross-sentence rewriting tasks using diverse wording and structures expressed
+through natural languages including 1) generating rewriting instruction data
+from Wiki edits and public corpus through instruction generation and
+chain-of-thought prompting; 2) collecting comparison data for reward model
+training through a new ranking function. To facilitate this research, we
+introduce OpenRewriteEval, a novel benchmark covers a wide variety of rewriting
+types expressed through natural language instructions. Our results show
+significant improvements over a variety of baselines. The public repository is
+available on GitHub under Google Research
+(https://github.com/google-research/google-research/tree/master/rewritelm).
+
+
+
+
+
+
+
+ ♻ ☆ Human-Centric Autonomous Systems With LLMs for User Command Reasoning WACV
+
+
+
+
+
+
+
+
+ Yi Yang, Qingwen Zhang, Ci Li, Daniel Simões Marta, Nazre Batool, John Folkesson
+
+
+ The evolution of autonomous driving has made remarkable advancements in
+recent years, evolving into a tangible reality. However, a human-centric
+large-scale adoption hinges on meeting a variety of multifaceted requirements.
+To ensure that the autonomous system meets the user's intent, it is essential
+to accurately discern and interpret user commands, especially in complex or
+emergency situations. To this end, we propose to leverage the reasoning
+capabilities of Large Language Models (LLMs) to infer system requirements from
+in-cabin users' commands. Through a series of experiments that include
+different LLM models and prompt designs, we explore the few-shot multivariate
+binary classification accuracy of system requirements from natural language
+textual commands. We confirm the general ability of LLMs to understand and
+reason about prompts but underline that their effectiveness is conditioned on
+the quality of both the LLM model and the design of appropriate sequential
+prompts. Code and models are public with the link
+\url{https://github.com/KTH-RPL/DriveCmd_LLM}.
+
+
+
+ comment: In Proceedings of the IEEE/CVF Winter Conference on Applications of
+ Computer Vision (WACV) Workshops, 2024
+
+
+
+
+
+
+ ♻ ☆ Characterizing Information Seeking Events in Health-Related Social
+ Discourse AAAI-2024
+
+
+
+
+
+
+
+
+ Omar Sharif, Madhusudan Basak, Tanzia Parvin, Ava Scharfstein, Alphonso Bradham, Jacob T. Borodovsky, Sarah E. Lord, Sarah M. Preum
+
+
+ Social media sites have become a popular platform for individuals to seek and
+share health information. Despite the progress in natural language processing
+for social media mining, a gap remains in analyzing health-related texts on
+social discourse in the context of events. Event-driven analysis can offer
+insights into different facets of healthcare at an individual and collective
+level, including treatment options, misconceptions, knowledge gaps, etc. This
+paper presents a paradigm to characterize health-related information-seeking in
+social discourse through the lens of events. Events here are board categories
+defined with domain experts that capture the trajectory of the
+treatment/medication. To illustrate the value of this approach, we analyze
+Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical
+global health concern. To the best of our knowledge, this is the first attempt
+to define event categories for characterizing information-seeking in OUD social
+discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel
+treatment information-seeking event dataset to analyze online discourse on an
+event-based framework. This dataset contains Reddit posts on
+information-seeking events related to recovery from OUD, where each post is
+annotated based on the type of events. We also establish a strong performance
+benchmark (77.4% F1 score) for the task by employing several machine learning
+and deep learning classifiers. Finally, we thoroughly investigate the
+performance and errors of ChatGPT on this task, providing valuable insights
+into the LLM's capabilities and ongoing characterization efforts.
+
+
+ Existing score-distilling text-to-3D generation techniques, despite their
+considerable promise, often encounter the view inconsistency problem. One of
+the most notable issues is the Janus problem, where the most canonical view of
+an object (\textit{e.g}., face or head) appears in other views. In this work,
+we explore existing frameworks for score-distilling text-to-3D generation and
+identify the main causes of the view inconsistency problem -- the embedded bias
+of 2D diffusion models. Based on these findings, we propose two approaches to
+debias the score-distillation frameworks for view-consistent text-to-3D
+generation. Our first approach, called score debiasing, involves cutting off
+the score estimated by 2D diffusion models and gradually increasing the
+truncation value throughout the optimization process. Our second approach,
+called prompt debiasing, identifies conflicting words between user prompts and
+view prompts using a language model, and adjusts the discrepancy between view
+prompts and the viewing direction of an object. Our experimental results show
+that our methods improve the realism of the generated 3D objects by
+significantly reducing artifacts and achieve a good trade-off between
+faithfulness to the 2D diffusion models and 3D consistency with little
+overhead. Our project page is available
+at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.
+
+
+
+
+
+
+
+ ♻ ☆ DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
+ Models NeurIPS 2023
+
+
+
+
+
+
+
+
+ Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li
+
+
+ Generative Pre-trained Transformer (GPT) models have exhibited exciting
+progress in their capabilities, capturing the interest of practitioners and the
+public alike. Yet, while the literature on the trustworthiness of GPT models
+remains limited, practitioners have proposed employing capable GPT models for
+sensitive applications such as healthcare and finance -- where mistakes can be
+costly. To this end, this work proposes a comprehensive trustworthiness
+evaluation for large language models with a focus on GPT-4 and GPT-3.5,
+considering diverse perspectives -- including toxicity, stereotype bias,
+adversarial robustness, out-of-distribution robustness, robustness on
+adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
+evaluations, we discover previously unpublished vulnerabilities to
+trustworthiness threats. For instance, we find that GPT models can be easily
+misled to generate toxic and biased outputs and leak private information in
+both training data and conversation history. We also find that although GPT-4
+is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
+vulnerable given jailbreaking system or user prompts, potentially because GPT-4
+follows (misleading) instructions more precisely. Our work illustrates a
+comprehensive trustworthiness evaluation of GPT models and sheds light on the
+trustworthiness gaps. Our benchmark is publicly available at
+https://decodingtrust.github.io/; our dataset can be previewed at
+https://huggingface.co/datasets/AI-Secure/DecodingTrust; a concise version of
+this work is at https://openreview.net/pdf?id=kaHpo8OZw2.
+
+
+
+ comment: NeurIPS 2023 Outstanding Paper (Datasets and Benchmarks Track)
+
+
+
+
+
+
+ ♻ ☆ Robust Contrastive Language-Image Pre-training against Data Poisoning
+ and Backdoor Attacks
+
+
+ Contrastive vision-language representation learning has achieved
+state-of-the-art performance for zero-shot classification, by learning from
+millions of image-caption pairs crawled from the internet. However, the massive
+data that powers large multimodal models such as CLIP, makes them extremely
+vulnerable to various types of targeted data poisoning and backdoor attacks.
+Despite this vulnerability, robust contrastive vision-language pre-training
+against such attacks has remained unaddressed. In this work, we propose ROCLIP,
+the first effective method for robust pre-training multimodal vision-language
+models against targeted data poisoning and backdoor attacks. ROCLIP effectively
+breaks the association between poisoned image-caption pairs by considering a
+relatively large and varying pool of random captions, and matching every image
+with the text that is most similar to it in the pool instead of its own
+caption, every few epochs.It also leverages image and text augmentations to
+further strengthen the defense and improve the performance of the model. Our
+extensive experiments show that ROCLIP renders state-of-the-art targeted data
+poisoning and backdoor attacks ineffective during pre-training CLIP models. In
+particular, ROCLIP decreases the success rate for targeted data poisoning
+attacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while
+improving the model's linear probe performance by 10% and maintains a similar
+zero shot performance compared to CLIP. By increasing the frequency of
+matching, ROCLIP is able to defend strong attacks, which add up to 1% poisoned
+examples to the data, and successfully maintain a low attack success rate of
+12.5%, while trading off the performance on some tasks.
+
+
+
+
+
+
+
+
+ Jianghang Lin, Yunhang Shen, Bingquan Wang, Shaohui Lin, Ke Li, Liujuan Cao
+
+
+ Despite weakly supervised object detection (WSOD) being a promising step
+toward evading strong instance-level annotations, its capability is confined to
+closed-set categories within a single training dataset. In this paper, we
+propose a novel weakly supervised open-vocabulary object detection framework,
+namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize
+diverse datasets with only image-level annotations. To achieve this, we explore
+three vital strategies, including dataset-level feature adaptation, image-level
+salient object localization, and region-level vision-language alignment. First,
+we perform data-aware feature extraction to produce an input-conditional
+coefficient, which is leveraged into dataset attribute prototypes to identify
+dataset bias and help achieve cross-dataset generalization. Second, a
+customized location-oriented weakly supervised region proposal network is
+proposed to utilize high-level semantic layouts from the category-agnostic
+segment anything model to distinguish object boundaries. Lastly, we introduce a
+proposal-concept synchronized multiple-instance network, i.e., object mining
+and refinement with visual-semantic alignment, to discover objects matched to
+the text embeddings of concepts. Extensive experiments on Pascal VOC and MS
+COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art
+compared with previous WSOD methods in both close-set object localization and
+detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary
+learning to achieve on-par or even better performance than well-established
+fully-supervised open-vocabulary object detection (FSOVOD).
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ☆ A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
+
+
+ The surge of interest towards Multi-modal Large Language Models (MLLMs),
+e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
+academia and industry. They endow Large Language Models (LLMs) with powerful
+capabilities in visual understanding, enabling them to tackle diverse
+multi-modal tasks. Very recently, Google released Gemini, its newest and most
+capable MLLM built from the ground up for multi-modality. In light of the
+superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
+in multi-modal learning? In this paper, we present a preliminary exploration of
+Gemini Pro's visual understanding proficiency, which comprehensively covers
+four domains: fundamental perception, advanced cognition, challenging vision
+tasks, and various expert capacities. We compare Gemini Pro with the
+state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
+open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
+black-box systems. The qualitative samples indicate that, while GPT-4V and
+Gemini showcase different answering styles and preferences, they can exhibit
+comparable visual reasoning capabilities, and Sphinx still trails behind them
+concerning domain generalizability. Specifically, GPT-4V tends to elaborate
+detailed explanations and intermediate steps, and Gemini prefers to output a
+direct and concise answer. The quantitative evaluation on the popular MME
+benchmark also demonstrates the potential of Gemini to be a strong challenger
+to GPT-4V. Our early investigation of Gemini also observes some common issues
+of MLLMs, indicating that there still remains a considerable distance towards
+artificial general intelligence. Our project for tracking the progress of MLLM
+is released at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Total 120 pages. See our project at
+ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+ Amodal perception, the ability to comprehend complete object structures from
+partial visibility, is a fundamental skill, even for infants. Its significance
+extends to applications like autonomous driving, where a clear understanding of
+heavily occluded objects is essential. However, modern detection and tracking
+algorithms often overlook this critical capability, perhaps due to the
+prevalence of modal annotations in most datasets. To address the scarcity of
+amodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse
+categories in thousands of video sequences. Our dataset includes amodal and
+modal bounding boxes for visible and occluded objects, including objects that
+are partially out-of-frame. To enhance amodal tracking with object permanence,
+we leverage a lightweight plug-in module, the amodal expander, to transform
+standard, modal trackers into amodal ones through fine-tuning on a few hundred
+video sequences with data augmentation. We achieve a 3.3\% and 1.6\%
+improvement on the detection and tracking of occluded objects on TAO-Amodal.
+When evaluated on people, our method produces dramatic improvements of 2x
+compared to state-of-the-art modal baselines.
+
+
+ Denoising Probabilistic Models (DPMs) represent an emerging domain of
+generative models that excel in generating diverse and high-quality images.
+However, most current training methods for DPMs often neglect the correlation
+between timesteps, limiting the model's performance in generating images
+effectively. Notably, we theoretically point out that this issue can be caused
+by the cumulative estimation gap between the predicted and the actual
+trajectory. To minimize that gap, we propose a novel \textit{sequence-aware}
+loss that aims to reduce the estimation gap to enhance the sampling quality.
+Furthermore, we theoretically show that our proposed loss function is a tighter
+upper bound of the estimation loss in comparison with the conventional loss in
+DPMs. Experimental results on several benchmark datasets including CIFAR10,
+CelebA, and CelebA-HQ consistently show a remarkable improvement of our
+proposed method regarding the image generalization quality measured by FID and
+Inception Score compared to several DPM baselines. Our code and pre-trained
+checkpoints are available at \url{https://github.com/viettmab/SA-DPM}.
+
+
+
+
+
+
+
+ ☆ The Endoscapes Dataset for Surgical Scene Segmentation, Object
+ Detection, and Critical View of Safety Assessment: Official Splits and
+ Benchmark
+
+
+
+
+
+
+
+
+ Aditya Murali, Deepak Alapatt, Pietro Mascagni, Armine Vardazaryan, Alain Garcia, Nariaki Okamoto, Guido Costamagna, Didier Mutter, Jacques Marescaux, Bernard Dallemagne, Nicolas Padoy
+
+
+ This technical report provides a detailed overview of Endoscapes, a dataset
+of laparoscopic cholecystectomy (LC) videos with highly intricate annotations
+targeted at automated assessment of the Critical View of Safety (CVS).
+Endoscapes comprises 201 LC videos with frames annotated sparsely but regularly
+with segmentation masks, bounding boxes, and CVS assessment by three different
+clinical experts. Altogether, there are 11090 frames annotated with CVS and
+1933 frames annotated with tool and anatomy bounding boxes from the 201 videos,
+as well as an additional 422 frames from 50 of the 201 videos annotated with
+tool and anatomy segmentation masks. In this report, we provide detailed
+dataset statistics (size, class distribution, dataset splits, etc.) and a
+comprehensive performance benchmark for instance segmentation, object
+detection, and CVS prediction. The dataset and model checkpoints are publically
+available at https://github.com/CAMMA-public/Endoscapes.
+
+
+
+ comment: 7 pages; 3 figures
+
+
+
+
+
+
+ ☆ SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete
+ Diffusion Process NeurIPS 2023
+
+
+ In this paper, we explore a principal way to enhance the quality of object
+masks produced by different segmentation models. We propose a model-agnostic
+solution called SegRefiner, which offers a novel perspective on this problem by
+interpreting segmentation refinement as a data generation process. As a result,
+the refinement process can be smoothly implemented through a series of
+denoising diffusion steps. Specifically, SegRefiner takes coarse masks as
+inputs and refines them using a discrete diffusion process. By predicting the
+label and corresponding states-transition probabilities for each pixel,
+SegRefiner progressively refines the noisy masks in a conditional denoising
+manner. To assess the effectiveness of SegRefiner, we conduct comprehensive
+experiments on various segmentation tasks, including semantic segmentation,
+instance segmentation, and dichotomous image segmentation. The results
+demonstrate the superiority of our SegRefiner from multiple aspects. Firstly,
+it consistently improves both the segmentation metrics and boundary metrics
+across different types of coarse masks. Secondly, it outperforms previous
+model-agnostic refinement methods by a significant margin. Lastly, it exhibits
+a strong capability to capture extremely fine details when refining
+high-resolution images. The source code and trained models are available at
+https://github.com/MengyuWang826/SegRefiner.
+
+
+ The ability of large language models (LLMs) to process visual inputs has
+given rise to general-purpose vision systems, unifying various vision-language
+(VL) tasks by instruction tuning. However, due to the enormous diversity in
+input-output formats in the vision domain, existing general-purpose models fail
+to successfully integrate segmentation and multi-image inputs with coarse-level
+tasks into a single framework. In this work, we introduce VistaLLM, a powerful
+visual system that addresses coarse- and fine-grained VL tasks over single and
+multiple input images using a unified framework. VistaLLM utilizes an
+instruction-guided image tokenizer that filters global embeddings using task
+descriptions to extract compressed and refined features from numerous images.
+Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to
+represent binary segmentation masks as sequences, significantly improving over
+previously used uniform sampling. To bolster the desired capability of
+VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning
+dataset with 6.8M samples. We also address the lack of multi-image grounding
+datasets by introducing a novel task, AttCoSeg (Attribute-level
+Co-Segmentation), which boosts the model's reasoning and grounding capability
+over multiple input images. Extensive experiments on a wide range of V- and VL
+tasks demonstrate the effectiveness of VistaLLM by achieving consistent
+state-of-the-art performance over strong baselines across all downstream tasks.
+Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.
+
+
+
+ comment: 24 pages including references and supplementary
+
+
+
+
+
+
+ ☆ Scene-Conditional 3D Object Stylization and Composition
+
+
+
+
+
+
+
+
+ Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht
+
+
+ Recently, 3D generative models have made impressive progress, enabling the
+generation of almost arbitrary 3D assets from text or image inputs. However,
+these approaches generate objects in isolation without any consideration for
+the scene where they will eventually be placed. In this paper, we propose a
+framework that allows for the stylization of an existing 3D asset to fit into a
+given 2D scene, and additionally produce a photorealistic composition as if the
+asset was placed within the environment. This not only opens up a new level of
+control for object stylization, for example, the same assets can be stylized to
+reflect changes in the environment, such as summer to winter or fantasy versus
+futuristic settings-but also makes the object-scene composition more
+controllable. We achieve this by combining modeling and optimizing the object's
+texture and environmental lighting through differentiable ray tracing with
+image priors from pre-trained text-to-image diffusion models. We demonstrate
+that our method is applicable to a wide variety of indoor and outdoor scenes
+and arbitrary objects.
+
+
+
+
+
+
+
+ ☆ LASA: Instance Reconstruction from Real Scans using A Large-scale
+ Aligned Shape Annotation Dataset
+
+
+
+
+
+
+
+
+ Haolin Liu, Chongjie Ye, Yinyu Nie, Yingfan He, Xiaoguang Han
+
+
+ Instance shape reconstruction from a 3D scene involves recovering the full
+geometries of multiple objects at the semantic instance level. Many methods
+leverage data-driven learning due to the intricacies of scene complexity and
+significant indoor occlusions. Training these methods often requires a
+large-scale, high-quality dataset with aligned and paired shape annotations
+with real-world scans. Existing datasets are either synthetic or misaligned,
+restricting the performance of data-driven methods on real data. To this end,
+we introduce LASA, a Large-scale Aligned Shape Annotation Dataset comprising
+10,412 high-quality CAD annotations aligned with 920 real-world scene scans
+from ArkitScenes, created manually by professional artists. On this top, we
+propose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo)
+method. It is empowered by a hybrid feature aggregation design to fuse
+multi-modal inputs and recover high-fidelity object geometries. Besides, we
+present an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate
+that our shape annotations provide scene occupancy clues that can further
+improve 3D object detection. Supported by LASA, extensive experiments show that
+our methods achieve state-of-the-art performance in both instance-level scene
+reconstruction and 3D object detection tasks.
+
+
+ The quality of the prompts provided to text-to-image diffusion models
+determines how faithful the generated content is to the user's intent, often
+requiring `prompt engineering'. To harness visual concepts from target images
+without prompt engineering, current approaches largely rely on embedding
+inversion by optimizing and then mapping them to pseudo-tokens. However,
+working with such high-dimensional vector representations is challenging
+because they lack semantics and interpretability, and only allow simple vector
+operations when using them. Instead, this work focuses on inverting the
+diffusion model to obtain interpretable language prompts directly. The
+challenge of doing this lies in the fact that the resulting optimization
+problem is fundamentally discrete and the space of prompts is exponentially
+large; this makes using standard optimization techniques, such as stochastic
+gradient descent, difficult. To this end, we utilize a delayed projection
+scheme to optimize for prompts representative of the vocabulary space in the
+model. Further, we leverage the findings that different timesteps of the
+diffusion process cater to different levels of detail in an image. The later,
+noisy, timesteps of the forward diffusion process correspond to the semantic
+information, and therefore, prompt inversion in this range provides tokens
+representative of the image semantics. We show that our approach can identify
+semantically interpretable and meaningful prompts for a target image which can
+be used to synthesize diverse images with similar content. We further
+illustrate the application of the optimized prompts in evolutionary image
+generation and concept removal.
+
+
+
+
+
+
+
+ ☆ Mixture of Cluster-conditional LoRA Experts for Vision-language
+ Instruction Tuning
+
+
+
+
+
+
+
+
+ Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang
+
+
+ Instruction tuning of the Large Vision-language Models (LVLMs) has
+revolutionized the development of versatile models with zero-shot
+generalization across a wide range of downstream vision-language tasks.
+However, diversity of training tasks of different sources and formats would
+lead to inevitable task conflicts, where different tasks conflicts for the same
+set of model parameters, resulting in sub-optimal instruction-following
+abilities. To address that, we propose the Mixture of Cluster-conditional LoRA
+Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to
+activate the task-customized model parameters based on the instruction
+clusters. A separate universal expert is further incorporated to improve the
+generalization capabilities of MoCLE for novel instructions. Extensive
+experiments on 10 zero-shot tasks demonstrate the effectiveness of MoCLE.
+
+
+
+
+
+
+
+ ☆ CLIP-DINOiser: Teaching CLIP a few DINO tricks
+
+
+
+
+
+
+
+
+ Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, Patrick Pérez
+
+
+ The popular CLIP model displays impressive zero-shot capabilities thanks to
+its seamless interaction with arbitrary text prompts. However, its lack of
+spatial awareness makes it unsuitable for dense computer vision tasks, e.g.,
+semantic segmentation, without an additional fine-tuning step that often uses
+annotations and can potentially suppress its original open-vocabulary
+properties. Meanwhile, self-supervised representation methods have demonstrated
+good localization properties without human-made annotations nor explicit
+supervision. In this work, we take the best of both worlds and propose a
+zero-shot open-vocabulary semantic segmentation method, which does not require
+any annotations. We propose to locally improve dense MaskCLIP features,
+computed with a simple modification of CLIP's last pooling layer, by
+integrating localization priors extracted from self-supervised features. By
+doing so, we greatly improve the performance of MaskCLIP and produce smooth
+outputs. Moreover, we show that the used self-supervised feature properties can
+directly be learnt from CLIP features therefore allowing us to obtain the best
+results with a single pass through CLIP model. Our method CLIP-DINOiser needs
+only a single forward pass of CLIP and two light convolutional layers at
+inference, no extra supervision nor extra memory and reaches state-of-the-art
+results on challenging and fine-grained benchmarks such as COCO, Pascal
+Context, Cityscapes and ADE20k. The code to reproduce our results is available
+at https://github.com/wysoczanska/clip_dinoiser.
+
+
+ Semi-supervised action segmentation aims to perform frame-wise classification
+in long untrimmed videos, where only a fraction of videos in the training set
+have labels. Recent studies have shown the potential of contrastive learning in
+unsupervised representation learning using unlabelled data. However, learning
+the representation of each frame by unsupervised contrastive learning for
+action segmentation remains an open and challenging problem. In this paper, we
+propose a novel Semantic-guided Multi-level Contrast scheme with a
+Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise
+representations for semi-supervised action segmentation. Specifically, for
+representation learning, SMC is firstly used to explore intra- and
+inter-information variations in a unified and contrastive way, based on dynamic
+clustering process of the original input, encoded semantic and temporal
+features. Then, the NCA module, which is responsible for enforcing spatial
+consistency between neighbourhoods centered at different frames to alleviate
+over-segmentation issues, works alongside SMC for semi-supervised learning. Our
+SMC outperforms the other state-of-the-art methods on three benchmarks,
+offering improvements of up to 17.8% and 12.6% in terms of edit distance and
+accuracy, respectively. Additionally, the NCA unit results in significant
+better segmentation performance against the others in the presence of only 5%
+labelled videos. We also demonstrate the effectiveness of the proposed method
+on our Parkinson's Disease Mouse Behaviour (PDMB) dataset. The code and
+datasets will be made publicly available.
+
+
+
+
+
+
+
+ ☆ Scalable Geometric Fracture Assembly via Co-creation Space among
+ Assemblers
+
+
+ Geometric fracture assembly presents a challenging practical task in
+archaeology and 3D computer vision. Previous methods have focused solely on
+assembling fragments based on semantic information, which has limited the
+quantity of objects that can be effectively assembled. Therefore, there is a
+need to develop a scalable framework for geometric fracture assembly without
+relying on semantic information. To improve the effectiveness of assembling
+geometric fractures without semantic information, we propose a co-creation
+space comprising several assemblers capable of gradually and unambiguously
+assembling fractures. Additionally, we introduce a novel loss function, i.e.,
+the geometric-based collision loss, to address collision issues during the
+fracture assembly process and enhance the results. Our framework exhibits
+better performance on both PartNet and Breaking Bad datasets compared to
+existing state-of-the-art frameworks. Extensive experiments and quantitative
+comparisons demonstrate the effectiveness of our proposed framework, which
+features linear computational complexity, enhanced abstraction, and improved
+generalization. Our code is publicly available at
+https://github.com/Ruiyuan-Zhang/CCS.
+
+
+
+
+
+
+
+ ☆ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
+ Generalizable 3D Reconstruction
+
+
+
+
+
+
+
+
+ David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann
+
+
+ We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D
+radiance fields parameterized by 3D Gaussian primitives from pairs of images.
+Our model features real-time and memory-efficient rendering for scalable
+training as well as fast 3D reconstruction at inference time. To overcome local
+minima inherent to sparse and locally supported representations, we predict a
+dense probability distribution over 3D and sample Gaussian means from that
+probability distribution. We make this sampling operation differentiable via a
+reparameterization trick, allowing us to back-propagate gradients through the
+Gaussian splatting representation. We benchmark our method on wide-baseline
+novel view synthesis on the real-world RealEstate10k and ACID datasets, where
+we outperform state-of-the-art light field transformers and accelerate
+rendering by 2.5 orders of magnitude while reconstructing an interpretable and
+editable 3D radiance field.
+
+
+ This study investigates the application of single and two-stage 2D-object
+detection algorithms like You Only Look Once (YOLO), Real-Time DEtection
+TRansformer (RT-DETR) algorithm for automated object detection to enhance road
+safety for autonomous driving on Austrian roads. The YOLO algorithm is a
+state-of-the-art real-time object detection system known for its efficiency and
+accuracy. In the context of driving, its potential to rapidly identify and
+track objects is crucial for advanced driver assistance systems (ADAS) and
+autonomous vehicles. The research focuses on the unique challenges posed by the
+road conditions and traffic scenarios in Austria. The country's diverse
+landscape, varying weather conditions, and specific traffic regulations
+necessitate a tailored approach for reliable object detection. The study
+utilizes a selective dataset comprising images and videos captured on Austrian
+roads, encompassing urban, rural, and alpine environments.
+
+
+
+ comment: draft
+
+
+
+
+
+
+ ☆ Intrinsic Image Diffusion for Single-view Material Estimation
+
+
+
+
+
+
+
+
+ Peter Kocsis, Vincent Sitzmann, Matthias Nießner
+
+
+ We present Intrinsic Image Diffusion, a generative model for appearance
+decomposition of indoor scenes. Given a single input view, we sample multiple
+possible material explanations represented as albedo, roughness, and metallic
+maps. Appearance decomposition poses a considerable challenge in computer
+vision due to the inherent ambiguity between lighting and material properties
+and the lack of real datasets. To address this issue, we advocate for a
+probabilistic formulation, where instead of attempting to directly predict the
+true material properties, we employ a conditional generative model to sample
+from the solution space. Furthermore, we show that utilizing the strong learned
+prior of recent diffusion models trained on large-scale real-world images can
+be adapted to material estimation and highly improves the generalization to
+real images. Our method produces significantly sharper, more consistent, and
+more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on
+PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the
+effectiveness of our approach through experiments on both synthetic and
+real-world datasets.
+
+
+
+
+
+
+
+
+ Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo, Xinxing Xu, Rick Siow Mong Goh, Yong Liu
+
+
+ Albeit progress has been made in Composed Image Retrieval (CIR), we
+empirically find that a certain percentage of failure retrieval results are not
+consistent with their relative captions. To address this issue, this work
+provides a Visual Question Answering (VQA) perspective to boost the performance
+of CIR. The resulting VQA4CIR is a post-processing approach and can be directly
+plugged into existing CIR methods. Given the top-C retrieved images by a CIR
+method, VQA4CIR aims to decrease the adverse effect of the failure retrieval
+results being inconsistent with the relative caption. To find the retrieved
+images inconsistent with the relative caption, we resort to the "QA generation
+to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning
+LLM (e.g., LLaMA) to generate several pairs of questions and answers from each
+relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model.
+By feeding the retrieved image and question to the VQA model, one can find the
+images inconsistent with relative caption when the answer by VQA is
+inconsistent with the answer in the QA pair. Consequently, the CIR performance
+can be boosted by modifying the ranks of inconsistently retrieved images.
+Experimental results show that our proposed method outperforms state-of-the-art
+CIR methods on the CIRR and Fashion-IQ datasets.
+
+
+ Federated learning with noisy labels (F-LNL) aims at seeking an optimal
+server model via collaborative distributed learning by aggregating multiple
+client models trained with local noisy or clean samples. On the basis of a
+federated learning framework, recent advances primarily adopt label noise
+filtering to separate clean samples from noisy ones on each client, thereby
+mitigating the negative impact of label noise. However, these prior methods do
+not learn noise filters by exploiting knowledge across all clients, leading to
+sub-optimal and inferior noise filtering performance and thus damaging training
+stability. In this paper, we present FedDiv to tackle the challenges of F-LNL.
+Specifically, we propose a global noise filter called Federated Noise Filter
+for effectively identifying samples with noisy labels on every client, thereby
+raising stability during local training sessions. Without sacrificing data
+privacy, this is achieved by modeling the global distribution of label noise
+across all clients. Then, in an effort to make the global model achieve higher
+performance, we introduce a Predictive Consistency based Sampler to identify
+more credible local data for local model training, thus preventing noise
+memorization and further boosting the training stability. Extensive experiments
+on CIFAR-10, CIFAR-100, and Clothing1M demonstrate that \texttt{FedDiv}
+achieves superior performance over state-of-the-art F-LNL methods under
+different label noise settings for both IID and non-IID data partitions. Source
+code is publicly available at https://github.com/lijichang/FLNL-FedDiv.
+
+
+
+ comment: To appear in AAAI-2024
+
+
+
+
+
+
+ ☆ ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition
+ in the Operating Room
+
+
+
+
+
+
+
+
+ Idris Hamoud, Muhammad Abdullah Jamal, Vinkle Srivastav, Didier Mutter, Nicolas Padoy, Omid Mohareri
+
+
+ Surgical robotics holds much promise for improving patient safety and
+clinician experience in the Operating Room (OR). However, it also comes with
+new challenges, requiring strong team coordination and effective OR management.
+Automatic detection of surgical activities is a key requirement for developing
+AI-based intelligent tools to tackle these challenges. The current
+state-of-the-art surgical activity recognition methods however operate on
+image-based representations and depend on large-scale labeled datasets whose
+collection is time-consuming and resource-expensive. This work proposes a new
+sample-efficient and object-based approach for surgical activity recognition in
+the OR. Our method focuses on the geometric arrangements between clinicians and
+surgical devices, thus utilizing the significant object interaction dynamics in
+the OR. We conduct experiments in a low-data regime study for long video
+activity recognition. We also benchmark our method againstother object-centric
+approaches on clip-level action classification and show superior performance.
+
+
+
+
+
+
+
+ ☆ MDD-UNet: Domain Adaptation for Medical Image Segmentation with
+ Theoretical Guarantees, a Proof of Concept
+
+
+ The current state-of-the art techniques for image segmentation are often
+based on U-Net architectures, a U-shaped encoder-decoder networks with skip
+connections. Despite the powerful performance, the architecture often does not
+perform well when used on data which has different characteristics than the
+data it was trained on. Many techniques for improving performance in the
+presence of domain shift have been developed, however typically only have loose
+connections to the theory of domain adaption. In this work, we propose an
+unsupervised domain adaptation framework for U-Nets with theoretical guarantees
+based on the Margin Disparity Discrepancy [1] called the MDD-UNet. We evaluate
+the proposed technique on the task of hippocampus segmentation, and find that
+the MDD-UNet is able to learn features which are domain-invariant with no
+knowledge about the labels in the target domain. The MDD-UNet improves
+performance over the standard U-Net on 11 out of 12 combinations of datasets.
+This work serves as a proof of concept by demonstrating an improvement on the
+U-Net in it's standard form without modern enhancements, which opens up a new
+avenue of studying domain adaptation for models with very large hypothesis
+spaces from both methodological and practical perspectives. Code is available
+at https://github.com/asbjrnmunk/mdd-unet.
+
+
+
+ comment: Published at NLDL 2024
+
+
+
+
+
+
+ ☆ GeomVerse: A Systematic Evaluation of Large Models for Geometric
+ Reasoning
+
+
+
+
+
+
+
+
+ Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, Radu Soricut
+
+
+ Large language models have shown impressive results for multi-hop
+mathematical reasoning when the input question is only textual. Many
+mathematical reasoning problems, however, contain both text and image. With the
+ever-increasing adoption of vision language models (VLMs), understanding their
+reasoning abilities for such problems is crucial. In this paper, we evaluate
+the reasoning capabilities of VLMs along various axes through the lens of
+geometry problems. We procedurally create a synthetic dataset of geometry
+questions with controllable difficulty levels along multiple axes, thus
+enabling a systematic evaluation. The empirical results obtained using our
+benchmark for state-of-the-art VLMs indicate that these models are not as
+capable in subjects like geometry (and, by generalization, other topics
+requiring similar reasoning) as suggested by previous benchmarks. This is made
+especially clear by the construction of our benchmark at various depth levels,
+since solving higher-depth problems requires long chains of reasoning rather
+than additional memorized knowledge. We release the dataset for further
+research in this area.
+
+
+
+
+
+
+
+ ☆ Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model AAAI 2024
+
+
+ Recently, diffusion-based image generation methods are credited for their
+remarkable text-to-image generation capabilities, while still facing challenges
+in accurately generating multilingual scene text images. To tackle this
+problem, we propose Diff-Text, which is a training-free scene text generation
+framework for any language. Our model outputs a photo-realistic image given a
+text of any language along with a textual description of a scene. The model
+leverages rendered sketch images as priors, thus arousing the potential
+multilingual-generation ability of the pre-trained Stable Diffusion. Based on
+the observation from the influence of the cross-attention map on object
+placement in generated images, we propose a localized attention constraint into
+the cross-attention layer to address the unreasonable positioning problem of
+scene text. Additionally, we introduce contrastive image-level prompts to
+further refine the position of the textual region and achieve more accurate
+scene text generation. Experiments demonstrate that our method outperforms the
+existing method in both the accuracy of text recognition and the naturalness of
+foreground-background blending.
+
+
+
+ comment: Accepted to AAAI 2024. Code:
+ https://github.com/ecnuljzhang/brush-your-text
+
+ We introduce HuTuMotion, an innovative approach for generating natural human
+motions that navigates latent motion diffusion models by leveraging few-shot
+human feedback. Unlike existing approaches that sample latent variables from a
+standard normal prior distribution, our method adapts the prior distribution to
+better suit the characteristics of the data, as indicated by human feedback,
+thus enhancing the quality of motion generation. Furthermore, our findings
+reveal that utilizing few-shot feedback can yield performance levels on par
+with those attained through extensive human feedback. This discovery emphasizes
+the potential and efficiency of incorporating few-shot human-guided
+optimization within latent diffusion models for personalized and style-aware
+human motion generation applications. The experimental results show the
+significantly superior performance of our method over existing state-of-the-art
+approaches.
+
+
+
+ comment: Accepted by AAAI 2024 Main Track
+
+
+
+
+
+
+ ☆ Self-Supervised Detection of Perfect and Partial Input-Dependent
+ Symmetries
+
+
+ Group equivariance ensures consistent responses to group transformations of
+the input, leading to more robust models and enhanced generalization
+capabilities. However, this property can lead to overly constrained models if
+the symmetries considered in the group differ from those observed in data.
+While common methods address this by determining the appropriate level of
+symmetry at the dataset level, they are limited to supervised settings and
+ignore scenarios in which multiple levels of symmetry co-exist in the same
+dataset. For instance, pictures of cars and planes exhibit different levels of
+rotation, yet both are included in the CIFAR-10 dataset. In this paper, we
+propose a method able to detect the level of symmetry of each input without the
+need for labels. To this end, we derive a sufficient and necessary condition to
+learn the distribution of symmetries in the data. Using the learned
+distribution, we generate pseudo-labels that allow us to learn the levels of
+symmetry of each input in a self-supervised manner. We validate the
+effectiveness of our approach on synthetic datasets with different per-class
+levels of symmetries e.g. MNISTMultiple, in which digits are uniformly rotated
+within a class-dependent interval. We demonstrate that our method can be used
+for practical applications such as the generation of standardized datasets in
+which the symmetries are not present, as well as the detection of
+out-of-distribution symmetries during inference. By doing so, both the
+generalization and robustness of non-equivariant models can be improved. Our
+code is publicly available at https://github.com/aurban0/ssl-sym.
+
+
+ Earth vision research typically focuses on extracting geospatial object
+locations and categories but neglects the exploration of relations between
+objects and comprehensive reasoning. Based on city planning needs, we develop a
+multi-modal multi-task VQA dataset (EarthVQA) to advance relational
+reasoning-based judging, counting, and comprehensive analysis. The EarthVQA
+dataset contains 6000 images, corresponding semantic masks, and 208,593 QA
+pairs with urban and rural governance requirements embedded. As objects are the
+basis for complex relational reasoning, we propose a Semantic OBject Awareness
+framework (SOBA) to advance VQA in an object-centric way. To preserve refined
+spatial locations and semantics, SOBA leverages a segmentation network for
+object semantics generation. The object-guided attention aggregates object
+interior features via pseudo masks, and bidirectional cross-attention further
+models object external relations hierarchically. To optimize object counting,
+we propose a numerical difference loss that dynamically adds difference
+penalties, unifying the classification and regression tasks. Experimental
+results show that SOBA outperforms both advanced general and remote sensing
+methods. We believe this dataset and framework provide a strong benchmark for
+Earth vision's complex analysis. The project page is at
+https://Junjue-Wang.github.io/homepage/EarthVQA.
+
+
+ Referring Image Segmentation (RIS) is a challenging task that requires an
+algorithm to segment objects referred by free-form language expressions.
+Despite significant progress in recent years, most state-of-the-art (SOTA)
+methods still suffer from considerable language-image modality gap at the pixel
+and word level. These methods generally 1) rely on sentence-level language
+features for language-image alignment and 2) lack explicit training supervision
+for fine-grained visual grounding. Consequently, they exhibit weak object-level
+correspondence between visual and language features. Without well-grounded
+features, prior methods struggle to understand complex expressions that require
+strong reasoning over relationships among multiple objects, especially when
+dealing with rarely used or ambiguous clauses. To tackle this challenge, we
+introduce a novel Mask Grounding auxiliary task that significantly improves
+visual grounding within language features, by explicitly teaching the model to
+learn fine-grained correspondence between masked textual tokens and their
+matching visual objects. Mask Grounding can be directly used on prior RIS
+methods and consistently bring improvements. Furthermore, to holistically
+address the modality gap, we also design a cross-modal alignment loss and an
+accompanying alignment module. These additions work synergistically with Mask
+Grounding. With all these techniques, our comprehensive approach culminates in
+MagNet Mask-grounded Network), an architecture that significantly outperforms
+prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating
+our method's effectiveness in addressing current limitations of RIS algorithms.
+Our code and pre-trained weights will be released.
+
+
+
+
+
+
+
+ ☆ Teeth Localization and Lesion Segmentation in CBCT Images using
+ SpatialConfiguration-Net and U-Net
+
+
+
+
+
+
+
+
+ Arnela Hadzic, Barbara Kirnbauer, Darko Stern, Martin Urschler
+
+
+ The localization of teeth and segmentation of periapical lesions in cone-beam
+computed tomography (CBCT) images are crucial tasks for clinical diagnosis and
+treatment planning, which are often time-consuming and require a high level of
+expertise. However, automating these tasks is challenging due to variations in
+shape, size, and orientation of lesions, as well as similar topologies among
+teeth. Moreover, the small volumes occupied by lesions in CBCT images pose a
+class imbalance problem that needs to be addressed. In this study, we propose a
+deep learning-based method utilizing two convolutional neural networks: the
+SpatialConfiguration-Net (SCN) and a modified version of the U-Net. The SCN
+accurately predicts the coordinates of all teeth present in an image, enabling
+precise cropping of teeth volumes that are then fed into the U-Net which
+detects lesions via segmentation. To address class imbalance, we compare the
+performance of three reweighting loss functions. After evaluation on 144 CBCT
+images, our method achieves a 97.3% accuracy for teeth localization, along with
+a promising sensitivity and specificity of 0.97 and 0.88, respectively, for
+subsequent lesion detection.
+
+
+
+
+
+
+
+ ☆ All for One, and One for All: UrbanSyn Dataset, the third Musketeer of
+ Synthetic Driving Scenes
+
+
+
+
+
+
+
+
+ Jose L. Gómez, Manuel Silva, Antonio Seoane, Agnès Borrás, Mario Noriega, Germán Ros, Jose A. Iglesias-Guitian, Antonio M. López
+
+
+ We introduce UrbanSyn, a photorealistic dataset acquired through
+semi-procedurally generated synthetic urban driving scenarios. Developed using
+high-quality geometry and materials, UrbanSyn provides pixel-level ground
+truth, including depth, semantic segmentation, and instance segmentation with
+object bounding boxes and occlusion degree. It complements GTAV and Synscapes
+datasets to form what we coin as the 'Three Musketeers'. We demonstrate the
+value of the Three Musketeers in unsupervised domain adaptation for image
+semantic segmentation. Results on real-world datasets, Cityscapes, Mapillary
+Vistas, and BDD100K, establish new benchmarks, largely attributed to UrbanSyn.
+We make UrbanSyn openly and freely accessible (www.urbansyn.org).
+
+
+
+ comment: The UrbanSyn Dataset is available in http://urbansyn.org/
+
+
+
+
+
+
+ ☆ Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video
+ Moment Retrieval AAAI 2024
+
+
+ Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed
+videos corresponding to a given language query by constructing cross-modal
+alignment strategies. However, these existing strategies are often sub-optimal
+since they ignore the modality imbalance problem, \textit{i.e.}, the semantic
+richness inherent in videos far exceeds that of a given limited-length
+sentence. Therefore, in pursuit of better alignment, a natural idea is
+enhancing the video modality to filter out query-irrelevant semantics, and
+enhancing the text modality to capture more segment-relevant knowledge. In this
+paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework
+for more balanced alignment through enhancing features at two levels. First, we
+enhance the video modality at the frame-word level through word reconstruction.
+This strategy emphasizes the portions associated with query words in
+frame-level features while suppressing irrelevant parts. Therefore, the
+enhanced video contains less redundant semantics and is more balanced with the
+textual modality. Second, we enhance the textual modality at the
+segment-sentence level by learning complementary knowledge from context
+sentences and ground-truth segments. With the knowledge added to the query, the
+textual modality thus maintains more meaningful semantics and is more balanced
+with the video modality. By implementing two levels of MESM, the semantic
+information from both modalities is more balanced to align, thereby bridging
+the modality gap. Experiments on three widely used benchmarks, including the
+out-of-distribution settings, show that the proposed framework achieves a new
+start-of-the-art performance with notable generalization ability (e.g., 4.42%
+and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code
+will be available at https://github.com/lntzm/MESM.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ SoftCTM: Cell detection by soft instance segmentation and consideration
+ of cell-tissue interaction
+
+
+
+
+
+
+
+
+ Lydia A. Schoenpflug, Viktor H. Koelzer
+
+
+ Detecting and classifying cells in histopathology H\&E stained whole-slide
+images is a core task in computational pathology, as it provides valuable
+insight into the tumor microenvironment. In this work we investigate the impact
+of ground truth formats on the models performance. Additionally, cell-tissue
+interactions are considered by providing tissue segmentation predictions as
+input to the cell detection model. We find that a "soft", probability-map
+instance segmentation ground truth leads to best model performance. Combined
+with cell-tissue interaction and test-time augmentation our Soft
+Cell-Tissue-Model (SoftCTM) achieves 0.7172 mean F1-Score on the Overlapped
+Cell On Tissue (OCELOT) test set, achieving the third best overall score in the
+OCELOT 2023 Challenge. The source code for our approach is made publicly
+available at https://github.com/lely475/ocelot23algo.
+
+
+ 3D perception is a critical problem in autonomous driving. Recently, the
+Bird-Eye-View (BEV) approach has attracted extensive attention, due to low-cost
+deployment and desirable vision detection capacity. However, the existing
+models ignore a realistic scenario during the driving procedure, i.e., one or
+more view cameras may be failed, which largely deteriorates the performance. To
+tackle this problem, we propose a generic Masked BEV (M-BEV) perception
+framework, which can effectively improve robustness to this challenging
+scenario, by random masking and reconstructing camera views in the end-to-end
+training. More specifically, we develop a novel Masked View Reconstruction
+(MVR) module for M-BEV. It mimics various missing cases by randomly masking
+features of different camera views, then leverages the original features of
+these views as self-supervision, and reconstructs the masked ones with the
+distinct spatio-temporal context across views. Via such a plug-and-play MVR,
+our M-BEV is capable of learning the missing views from the resting ones, and
+thus well generalized for robust view recovery and accurate perception in the
+testing. We perform extensive experiments on the popular NuScenes benchmark,
+where our framework can significantly boost 3D perception performance of the
+state-of-the-art models on various missing view cases, e.g., for the absence of
+back view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain.
+
+
+ In this paper, we propose an novel methodology aimed at simulating the
+learning phenomenon of nystagmus through the application of differential
+blurring on datasets. Nystagmus is a biological phenomenon that influences
+human vision throughout life, notably by diminishing head shake from infancy to
+adulthood. Leveraging this concept, we address the issue of waste
+classification, a pressing global concern. The proposed framework comprises two
+modules, with the second module closely resembling the original Vision
+Transformer, a state of the art model model in classification tasks. The
+primary motivation behind our approach is to enhance the model's precision and
+adaptability, mirroring the real world conditions that the human visual system
+undergoes. This novel methodology surpasses the standard Vision Transformer
+model in waste classification tasks, exhibiting an improvement with a margin of
+2%. This improvement underscores the potential of our methodology in improving
+model precision by drawing inspiration from human vision perception. Further
+research in the proposed methodology could yield greater performance results,
+and can extrapolated to other global tasks.
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ☆ FontDiffuser: One-Shot Font Generation via Denoising Diffusion with
+ Multi-Scale Content Aggregation and Style Contrastive Learning AAAI 2024
+
+
+
+
+
+
+
+
+ Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin
+
+
+ Automatic font generation is an imitation task, which aims to create a font
+library that mimics the style of reference images while preserving the content
+from source images. Although existing font generation methods have achieved
+satisfactory performance, they still struggle with complex characters and large
+style variations. To address these issues, we propose FontDiffuser, a
+diffusion-based image-to-image one-shot font generation method, which
+innovatively models the font imitation task as a noise-to-denoise paradigm. In
+our method, we introduce a Multi-scale Content Aggregation (MCA) block, which
+effectively combines global and local content cues across different scales,
+leading to enhanced preservation of intricate strokes of complex characters.
+Moreover, to better manage the large variations in style transfer, we propose a
+Style Contrastive Refinement (SCR) module, which is a novel structure for style
+representation learning. It utilizes a style extractor to disentangle styles
+from images, subsequently supervising the diffusion model via a meticulously
+designed style contrastive loss. Extensive experiments demonstrate
+FontDiffuser's state-of-the-art performance in generating diverse characters
+and styles. It consistently excels on complex characters and large style
+changes compared to previous methods. The code is available at
+https://github.com/yeungchenwa/FontDiffuser.
+
+
+ In the era of digital medicine, medical imaging serves as a widespread
+technique for early disease detection, with a substantial volume of images
+being generated and stored daily in electronic patient records. X-ray
+angiography imaging is a standard and one of the most common methods for
+rapidly diagnosing coronary artery diseases. The notable achievements of recent
+deep learning algorithms align with the increased use of electronic health
+records and diagnostic imaging. Deep neural networks, leveraging abundant data,
+advanced algorithms, and powerful computational capabilities, prove highly
+effective in the analysis and interpretation of images. In this context, Object
+detection methods have become a promising approach, particularly through
+convolutional neural networks (CNN), streamlining medical image analysis by
+eliminating manual feature extraction. This allows for direct feature
+extraction from images, ensuring high accuracy in results. Therefore, in our
+paper, we utilized the object detection method on X-ray angiography images to
+precisely identify the location of coronary artery stenosis. As a result, this
+model enables automatic and real-time detection of stenosis locations,
+assisting in the crucial and sensitive decision-making process for healthcare
+professionals.
+
+
+ Single-domain generalization (S-DG) aims to generalize a model to unseen
+environments with a single-source domain. However, most S-DG approaches have
+been conducted in the field of classification. When these approaches are
+applied to object detection, the semantic features of some objects can be
+damaged, which can lead to imprecise object localization and misclassification.
+To address these problems, we propose an object-aware domain generalization
+(OA-DG) method for single-domain generalization in object detection. Our method
+consists of data augmentation and training strategy, which are called OA-Mix
+and OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level
+transformation and object-aware mixing strategy. OA-Loss enables models to
+learn domain-invariant representations for objects and backgrounds from the
+original and OA-Mixed images. Our proposed method outperforms state-of-the-art
+works on standard benchmarks. Our code is available at
+https://github.com/WoojuLee24/OA-DG.
+
+
+
+ comment: Accepted by AAAI-24. The first two authors contributed equally
+
+
+
+
+
+
+ ☆ ZS-SRT: An Efficient Zero-Shot Super-Resolution Training Method for
+ Neural Radiance Fields
+
+
+ Neural Radiance Fields (NeRF) have achieved great success in the task of
+synthesizing novel views that preserve the same resolution as the training
+views. However, it is challenging for NeRF to synthesize high-quality
+high-resolution novel views with low-resolution training data. To solve this
+problem, we propose a zero-shot super-resolution training framework for NeRF.
+This framework aims to guide the NeRF model to synthesize high-resolution novel
+views via single-scene internal learning rather than requiring any external
+high-resolution training data. Our approach consists of two stages. First, we
+learn a scene-specific degradation mapping by performing internal learning on a
+pretrained low-resolution coarse NeRF. Second, we optimize a super-resolution
+fine NeRF by conducting inverse rendering with our mapping function so as to
+backpropagate the gradients from low-resolution 2D space into the
+super-resolution 3D sampling space. Then, we further introduce a temporal
+ensemble strategy in the inference phase to compensate for the scene estimation
+errors. Our method is featured on two points: (1) it does not consume
+high-resolution views or additional scene data to train super-resolution NeRF;
+(2) it can speed up the training process by adopting a coarse-to-fine strategy.
+By conducting extensive experiments on public datasets, we have qualitatively
+and quantitatively demonstrated the effectiveness of our method.
+
+
+
+
+
+
+
+ ☆ I-CEE: Tailoring Explanations of Image Classifications Models to User
+ Expertise
+
+
+ Effectively explaining decisions of black-box machine learning models is
+critical to responsible deployment of AI systems that rely on them. Recognizing
+their importance, the field of explainable AI (XAI) provides several techniques
+to generate these explanations. Yet, there is relatively little emphasis on the
+user (the explainee) in this growing body of work and most XAI techniques
+generate "one-size-fits-all" explanations. To bridge this gap and achieve a
+step closer towards human-centered XAI, we present I-CEE, a framework that
+provides Image Classification Explanations tailored to User Expertise. Informed
+by existing work, I-CEE explains the decisions of image classification models
+by providing the user with an informative subset of training data (i.e.,
+example images), corresponding local explanations, and model decisions.
+However, unlike prior work, I-CEE models the informativeness of the example
+images to depend on user expertise, resulting in different examples for
+different users. We posit that by tailoring the example set to user expertise,
+I-CEE can better facilitate users' understanding and simulatability of the
+model. To evaluate our approach, we conduct detailed experiments in both
+simulation and with human participants (N = 100) on multiple datasets.
+Experiments with simulated users show that I-CEE improves users' ability to
+accurately predict the model's decisions (simulatability) compared to
+baselines, providing promising preliminary results. Experiments with human
+participants demonstrate that our method significantly improves user
+simulatability accuracy, highlighting the importance of human-centered XAI
+
+
+
+
+
+
+
+ ☆ Domain Generalization in LiDAR Semantic Segmentation Leveraged by
+ Density Discriminative Feature Embedding
+
+
+
+
+
+
+
+
+ Jaeyeul Kim, Jungwan Woo, Jeonghoon Kim, Sunghoon Im
+
+
+ While significant progress has been achieved in LiDAR-based perception,
+domain generalization continues to present challenges, often resulting in
+reduced performance when encountering unfamiliar datasets due to domain
+discrepancies. One of the primary hurdles stems from the variability of LiDAR
+sensors, leading to inconsistencies in point cloud density distribution. Such
+inconsistencies can undermine the effectiveness of perception models. We
+address this challenge by introducing a new approach that acknowledges a
+fundamental characteristic of LiDAR: the variation in point density due to the
+distance from the LiDAR to the scene, and the number of beams relative to the
+field of view. Understanding this, we view each LiDAR's point cloud at various
+distances as having distinct density distributions, which can be consistent
+across different LiDAR models. With this insight, we propose the Density
+Discriminative Feature Embedding (DDFE) module, crafted to specifically extract
+features related to density while ensuring domain invariance across different
+LiDAR sensors. In addition, we introduce a straightforward but effective
+density augmentation technique, designed to broaden the density spectrum and
+enhance the capabilities of the DDFE. The proposed DDFE stands out as a
+versatile and lightweight domain generalization module. It can be seamlessly
+integrated into various 3D backbone networks, consistently outperforming
+existing state-of-the-art domain generalization approaches. We commit to
+releasing the source code publicly to foster community collaboration and
+advancement.
+
+
+ Reconstructing a dynamic human with loose clothing is an important but
+difficult task. To address this challenge, we propose a method named DLCA-Recon
+to create human avatars from monocular videos. The distance from loose clothing
+to the underlying body rapidly changes in every frame when the human freely
+moves and acts. Previous methods lack effective geometric initialization and
+constraints for guiding the optimization of deformation to explain this
+dramatic change, resulting in the discontinuous and incomplete reconstruction
+surface. To model the deformation more accurately, we propose to initialize an
+estimated 3D clothed human in the canonical space, as it is easier for
+deformation fields to learn from the clothed human than from SMPL. With both
+representations of explicit mesh and implicit SDF, we utilize the physical
+connection information between consecutive frames and propose a dynamic
+deformation field (DDF) to optimize deformation fields. DDF accounts for
+contributive forces on loose clothing to enhance the interpretability of
+deformations and effectively capture the free movement of loose clothing.
+Moreover, we propagate SMPL skinning weights to each individual and refine pose
+and skinning weights during the optimization to improve skinning
+transformation. Based on more reasonable initialization and DDF, we can
+simulate real-world physics more accurately. Extensive experiments on public
+and our own datasets validate that our method can produce superior results for
+humans with loose clothing compared to the SOTA methods.
+
+
+
+
+
+
+
+ ☆ GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion
+ Prediction
+
+
+
+
+
+
+
+
+ Haodong Yan, Zhiming Hu, Syn Schmitt, Andreas Bulling
+
+
+ Human motion prediction is important for virtual reality (VR) applications,
+e.g., for realistic avatar animation. Existing methods have synthesised body
+motion only from observed past motion, despite the fact that human gaze is
+known to correlate strongly with body movements and is readily available in
+recent VR headsets. We present GazeMoDiff -- a novel gaze-guided denoising
+diffusion model to generate stochastic human motions. Our method first uses a
+graph attention network to learn the spatio-temporal correlations between eye
+gaze and human movements and to fuse them into cross-modal gaze-motion
+features. These cross-modal features are injected into a noise prediction
+network via a cross-attention mechanism and progressively denoised to generate
+realistic human full-body motions. Experimental results on the MoGaze and GIMO
+datasets demonstrate that our method outperforms the state-of-the-art methods
+by a large margin in terms of average displacement error (15.03% on MoGaze and
+9.20% on GIMO). We further conducted an online user study to compare our method
+with state-of-the-art methods and the responses from 23 participants validate
+that the motions generated by our method are more realistic than those from
+other methods. Taken together, our work makes a first important step towards
+gaze-guided stochastic human motion prediction and guides future work on this
+important topic in VR research.
+
+
+
+
+
+
+
+ ☆ Learning Subject-Aware Cropping by Outpainting Professional Photos AAAI 24
+
+
+
+
+
+
+
+
+ James Hong, Lu Yuan, Michaël Gharbi, Matthew Fisher, Kayvon Fatahalian
+
+
+ How to frame (or crop) a photo often depends on the image subject and its
+context; e.g., a human portrait. Recent works have defined the subject-aware
+image cropping task as a nuanced and practical version of image cropping. We
+propose a weakly-supervised approach (GenCrop) to learn what makes a
+high-quality, subject-aware crop from professional stock images. Unlike
+supervised prior work, GenCrop requires no new manual annotations beyond the
+existing stock image collection. The key challenge in learning from this data,
+however, is that the images are already cropped and we do not know what regions
+were removed. Our insight is combine a library of stock images with a modern,
+pre-trained text-to-image diffusion model. The stock image collection provides
+diversity and its images serve as pseudo-labels for a good crop, while the
+text-image diffusion model is used to out-paint (i.e., outward inpainting)
+realistic uncropped images. Using this procedure, we are able to automatically
+generate a large dataset of cropped-uncropped training pairs to train a
+cropping model. Despite being weakly-supervised, GenCrop is competitive with
+state-of-the-art supervised methods and significantly better than comparable
+weakly-supervised baselines on quantitative and qualitative evaluation metrics.
+
+
+
+ comment: AAAI 24. Extended version with supplemental materials
+
+
+
+
+
+
+ ☆ PICNN: A Pathway towards Interpretable Convolutional Neural Networks
+
+
+ Convolutional Neural Networks (CNNs) have exhibited great performance in
+discriminative feature learning for complex visual tasks. Besides
+discrimination power, interpretability is another important yet under-explored
+property for CNNs. One difficulty in the CNN interpretability is that filters
+and image classes are entangled. In this paper, we introduce a novel pathway to
+alleviate the entanglement between filters and image classes. The proposed
+pathway groups the filters in a late conv-layer of CNN into class-specific
+clusters. Clusters and classes are in a one-to-one relationship. Specifically,
+we use the Bernoulli sampling to generate the filter-cluster assignment matrix
+from a learnable filter-class correspondence matrix. To enable end-to-end
+optimization, we develop a novel reparameterization trick for handling the
+non-differentiable Bernoulli sampling. We evaluate the effectiveness of our
+method on ten widely used network architectures (including nine CNNs and a ViT)
+and five benchmark datasets. Experimental results have demonstrated that our
+method PICNN (the combination of standard CNNs with our proposed pathway)
+exhibits greater interpretability than standard CNNs while achieving higher or
+comparable discrimination power.
+
+
+
+
+
+
+
+ ☆ MPI Planar Correction of Pulse Based ToF Cameras
+
+
+ Time-of-Flight (ToF) cameras are becoming popular in a wide span of areas
+ranging from consumer-grade electronic devices to safety-critical industrial
+robots. This is mainly due to their high frame rate, relative good precision
+and the lowered costs. Although ToF cameras are in continuous development,
+especially pulse-based variants, they still face different problems, including
+spurious noise over the points or multipath inference (MPI). The latter can
+cause deformed surfaces to manifest themselves on curved surfaces instead of
+planar ones, making standard spatial data preprocessing, such as plane
+extraction, difficult. In this paper, we focus on the MPI reduction problem
+using Feature Pyramid Networks (FPN) which allow the mitigation of this type of
+artifact for pulse-based ToF cameras. With our end-to-end network, we managed
+to attenuate the MPI effect on planar surfaces using a learning-based method on
+real ToF data. Both the custom dataset used for our model training as well as
+the code is available on the author's Github homepage.
+
+
+
+
+
+
+
+ ☆ Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body
+ Poses using an Eye-body Coordination Model
+
+
+
+
+
+
+
+
+ Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling
+
+
+ While generating realistic body movements, e.g., for avatars in virtual
+reality, is widely studied in computer vision and graphics, the generation of
+eye movements that exhibit realistic coordination with the body remains
+under-explored. We first report a comprehensive analysis of the coordination of
+human eye and full-body movements during everyday activities based on data from
+the MoGaze and GIMO datasets. We show that eye gaze has strong correlations
+with head directions and also full-body motions and there exists a noticeable
+time delay between body and eye movements. Inspired by the analyses, we then
+present Pose2Gaze -- a novel eye-body coordination model that first uses a
+convolutional neural network and a spatio-temporal graph convolutional neural
+network to extract features from head directions and full-body poses
+respectively and then applies a convolutional neural network to generate
+realistic eye movements. We compare our method with state-of-the-art methods
+that predict eye gaze only from head movements for three different generation
+tasks and demonstrate that Pose2Gaze significantly outperforms these baselines
+on both datasets with an average improvement of 26.4% and 21.6% in mean angular
+error, respectively. Our findings underline the significant potential of
+cross-modal human gaze behaviour analysis and modelling.
+
+
+
+
+
+
+
+ ☆ Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint
+ Method
+
+
+
+
+
+
+
+
+ Jiachun Pan, Hanshu Yan, Jun Hao Liew, Jiashi Feng, Vincent Y. F. Tan
+
+
+ Training-free guided sampling in diffusion models leverages off-the-shelf
+pre-trained networks, such as an aesthetic evaluation model, to guide the
+generation process. Current training-free guided sampling algorithms obtain the
+guidance energy function based on a one-step estimate of the clean image.
+However, since the off-the-shelf pre-trained networks are trained on clean
+images, the one-step estimation procedure of the clean image may be inaccurate,
+especially in the early stages of the generation process in diffusion models.
+This causes the guidance in the early time steps to be inaccurate. To overcome
+this problem, we propose Symplectic Adjoint Guidance (SAG), which calculates
+the gradient guidance in two inner stages. Firstly, SAG estimates the clean
+image via $n$ function calls, where $n$ serves as a flexible hyperparameter
+that can be tailored to meet specific image quality requirements. Secondly, SAG
+uses the symplectic adjoint method to obtain the gradients accurately and
+efficiently in terms of the memory requirements. Extensive experiments
+demonstrate that SAG generates images with higher qualities compared to the
+baselines in both guided image and video generation tasks.
+
+
+
+
+
+
+
+
+ Siamul Karim Khan, Patrick Tinsley, Mahsa Mitcheff, Patrick Flynn, Kevin W. Bowyer, Adam Czajka
+
+
+ Synthesis of same-identity biometric iris images, both for existing and
+non-existing identities while preserving the identity across a wide range of
+pupil sizes, is complex due to intricate iris muscle constriction mechanism,
+requiring a precise model of iris non-linear texture deformations to be
+embedded into the synthesis pipeline. This paper presents the first method of
+fully data-driven, identity-preserving, pupil size-varying s ynthesis of iris
+images. This approach is capable of synthesizing images of irises with
+different pupil sizes representing non-existing identities as well as
+non-linearly deforming the texture of iris images of existing subjects given
+the segmentation mask of the target iris image. Iris recognition experiments
+suggest that the proposed deformation model not only preserves the identity
+when changing the pupil size but offers better similarity between same-identity
+iris samples with significant differences in pupil size, compared to
+state-of-the-art linear and non-linear (bio-mechanical-based) iris deformation
+models. Two immediate applications of the proposed approach are: (a) synthesis
+of, or enhancement of the existing biometric datasets for iris recognition,
+mimicking those acquired with iris sensors, and (b) helping forensic human
+experts in examining iris image pairs with significant differences in pupil
+dilation. Source codes and weights of the models are made available with the
+paper.
+
+
+ Laparoscopic surgery offers minimally invasive procedures with better patient
+outcomes, but smoke presence challenges visibility and safety. Existing
+learning-based methods demand large datasets and high computational resources.
+We propose the Progressive Frequency-Aware Network (PFAN), a lightweight GAN
+framework for laparoscopic image desmoking, combining the strengths of CNN and
+Transformer for progressive information extraction in the frequency domain.
+PFAN features CNN-based Multi-scale Bottleneck-Inverting (MBI) Blocks for
+capturing local high-frequency information and Locally-Enhanced Axial Attention
+Transformers (LAT) for efficiently handling global low-frequency information.
+PFAN efficiently desmokes laparoscopic images even with limited training data.
+Our method outperforms state-of-the-art approaches in PSNR, SSIM, CIEDE2000,
+and visual quality on the Cholec80 dataset and retains only 629K parameters.
+Our code and models are made publicly available at:
+https://github.com/jlzcode/PFAN.
+
+
+
+
+
+
+
+ ☆ Diffusing More Objects for Semi-Supervised Domain Adaptation with Less
+ Labeling NeurIPS 2023
+
+
+
+
+
+
+
+
+ Leander van den Heuvel, Gertjan Burghouts, David W. Zhang, Gwenn Englebienne, Sabina B. van Rooij
+
+
+ For object detection, it is possible to view the prediction of bounding boxes
+as a reverse diffusion process. Using a diffusion model, the random bounding
+boxes are iteratively refined in a denoising step, conditioned on the image. We
+propose a stochastic accumulator function that starts each run with random
+bounding boxes and combines the slightly different predictions. We empirically
+verify that this improves detection performance. The improved detections are
+leveraged on unlabelled images as weighted pseudo-labels for semi-supervised
+learning. We evaluate the method on a challenging out-of-domain test set. Our
+method brings significant improvements and is on par with human-selected
+pseudo-labels, while not requiring any human involvement.
+
+
+
+ comment: 4 pages, Workshop on DiffusionModels, NeurIPS 2023
+
+
+
+
+
+
+ ☆ Optimizing Diffusion Noise Can Serve As Universal Motion Priors
+
+
+ We propose Diffusion Noise Optimization (DNO), a new method that effectively
+leverages existing motion diffusion models as motion priors for a wide range of
+motion-related tasks. Instead of training a task-specific diffusion model for
+each new task, DNO operates by optimizing the diffusion latent noise of an
+existing pre-trained text-to-motion model. Given the corresponding latent noise
+of a human motion, it propagates the gradient from the target criteria defined
+on the motion space through the whole denoising process to update the diffusion
+latent noise. As a result, DNO supports any use cases where criteria can be
+defined as a function of motion. In particular, we show that, for motion
+editing and control, DNO outperforms existing methods in both achieving the
+objective and preserving the motion content. DNO accommodates a diverse range
+of editing modes, including changing trajectory, pose, joint locations, or
+avoiding newly added obstacles. In addition, DNO is effective in motion
+denoising and completion, producing smooth and realistic motion from noisy and
+partial inputs. DNO achieves these results at inference time without the need
+for model retraining, offering great versatility for any defined reward or loss
+function on the motion representation.
+
+
+
+
+
+
+
+ ☆ Continual Learning: Forget-free Winning Subnetworks for Video
+ Representations
+
+
+
+
+
+
+
+
+ Haeyong Kang, Jaehong Yoon, Sung Ju Hwang, Chang D. Yoo
+
+
+ Inspired by the Regularized Lottery Ticket Hypothesis (RLTH), which
+highlights the presence of competitive subnetworks within dense networks for
+continual learning tasks, we introduce Winning Subnetworks (WSN). This approach
+utilizes reused weights in dense networks to enhance learning in Task
+Incremental Learning (TIL) scenarios. To mitigate overfitting in Few-Shot Class
+Incremental Learning (FSCIL), we have developed WSN variants referred to as the
+Soft subnetwork (SoftNet). Furthermore, addressing WSN's limitation of sparse
+reused weights in Video Incremental Learning (VIL), we propose the Fourier
+Subneural Operator (FSO). The FSO, operating in Fourier space, adaptively and
+compactly encodes videos, discovering reusable subnetworks with diverse
+bandwidths. We have applied FSO's Fourier representations to various continual
+learning contexts, including VIL, TIL, and FSCIL. Our extensive experiments
+across these scenarios demonstrate FSO's remarkable efficacy in continual
+learning, significantly enhancing task performance at various convolutional
+representational levels: it boosts performance in the higher layers for TIL and
+FSCIL and the lower layers for VIL.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2303.14962,
+ arXiv:2306.11305
+
+
+
+
+
+
+ ☆ Expressive Forecasting of 3D Whole-body Human Motions
+
+
+
+
+
+
+
+
+ Pengxiang Ding, Qiongjie Cui, Min Zhang, Mengyuan Liu, Haofan Wang, Donglin Wang
+
+
+ Human motion forecasting, with the goal of estimating future human behavior
+over a period of time, is a fundamental task in many real-world applications.
+However, existing works typically concentrate on predicting the major joints of
+the human body without considering the delicate movements of the human hands.
+In practical applications, hand gesture plays an important role in human
+communication with the real world, and expresses the primary intention of human
+beings. In this work, we are the first to formulate a whole-body human pose
+forecasting task, which jointly predicts the future body and hand activities.
+Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI)
+framework that aims to predict both coarse (body joints) and fine-grained
+(gestures) activities collaboratively, enabling expressive and
+cross-facilitated forecasting of 3D whole-body human motions. Specifically, our
+model involves two key constituents: cross-context alignment (XCA) and
+cross-context interaction (XCI). Considering the heterogeneous information
+within the whole-body, XCA aims to align the latent features of various human
+components, while XCI focuses on effectively capturing the context interaction
+among the human components. We conduct extensive experiments on a
+newly-introduced large-scale benchmark and achieve state-of-the-art
+performance. The code is public for research purposes at
+https://github.com/Dingpx/EAI.
+
+
+
+
+
+
+
+ ☆ Context Disentangling and Prototype Inheriting for Robust Visual
+ Grounding
+
+
+
+
+
+
+
+
+ Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, Zechao Li
+
+
+ Visual grounding (VG) aims to locate a specific target in an image based on a
+given language query. The discriminative information from context is important
+for distinguishing the target from other objects, particularly for the targets
+that have the same category as others. However, most previous methods
+underestimate such information. Moreover, they are usually designed for the
+standard scene (without any novel object), which limits their generalization to
+the open-vocabulary scene. In this paper, we propose a novel framework with
+context disentangling and prototype inheriting for robust visual grounding to
+handle both scenes. Specifically, the context disentangling disentangles the
+referent and context features, which achieves better discrimination between
+them. The prototype inheriting inherits the prototypes discovered from the
+disentangled visual features by a prototype bank to fully utilize the seen
+data, especially for the open-vocabulary scene. The fused features, obtained by
+leveraging Hadamard product on disentangled linguistic and visual features of
+prototypes to avoid sharp adjusting the importance between the two types of
+features, are then attached with a special token and feed to a vision
+Transformer encoder for bounding box regression. Extensive experiments are
+conducted on both standard and open-vocabulary scenes. The performance
+comparisons indicate that our method outperforms the state-of-the-art methods
+in both scenarios. {The code is available at
+https://github.com/WayneTomas/TransCP.
+
+
+ Data mixing augmentation has been widely applied to improve the
+generalization ability of deep neural networks. Recently, offline data mixing
+augmentation, e.g. handcrafted and saliency information-based mixup, has been
+gradually replaced by automatic mixing approaches. Through minimizing two
+sub-tasks, namely, mixed sample generation and mixup classification in an
+end-to-end way, AutoMix significantly improves accuracy on image classification
+tasks. However, as the optimization objective is consistent for the two
+sub-tasks, this approach is prone to generating consistent instead of diverse
+mixed samples, which results in overfitting for target task training. In this
+paper, we propose AdAutomixup, an adversarial automatic mixup augmentation
+approach that generates challenging samples to train a robust classifier for
+image classification, by alternatively optimizing the classifier and the mixup
+sample generator. AdAutomixup comprises two modules, a mixed example generator,
+and a target classifier. The mixed sample generator aims to produce hard mixed
+examples to challenge the target classifier while the target classifier`s aim
+is to learn robust features from hard mixed examples to improve generalization.
+To prevent the collapse of the inherent meanings of images, we further
+introduce an exponential moving average (EMA) teacher and cosine similarity to
+train AdAutomixup in an end-to-end way. Extensive experiments on seven image
+benchmarks consistently prove that our approach outperforms the state of the
+art in various classification scenarios.
+
+
+
+
+
+
+
+
+ Yuang Liu, Jing Wang, Qiang Zhou, Fan Wang, Jun Wang, Wei Zhang
+
+
+ Numerous self-supervised learning paradigms, such as contrastive learning and
+masked image modeling, have been proposed to acquire powerful and general
+representations from unlabeled data. However, these models are commonly
+pretrained within their specific framework alone, failing to consider the
+complementary nature of visual representations. To tackle this issue, we
+introduce Comprehensive Distillation with Multiple Self-supervised Teachers
+(DMT) for pretrained model compression, which leverages the strengths of
+multiple off-the-shelf self-supervised models. Our experimental results on
+prominent benchmark datasets exhibit that the proposed method significantly
+surpasses state-of-the-art competitors while retaining favorable efficiency
+metrics. On classification tasks, our DMT framework utilizing three different
+self-supervised ViT-Base teachers enhances the performance of both small/tiny
+models and the base model itself. For dense tasks, DMT elevates the AP/mIoU of
+standard SSL models on MS-COCO and ADE20K datasets by 4.0%.
+
+
+
+ comment: ICASSP 2024
+
+
+
+
+
+
+ ☆ Transformer Network for Multi-Person Tracking and Re-Identification in
+ Unconstrained Environment
+
+
+
+
+
+
+
+
+ Hamza Mukhtar, Muhammad Usman Ghani Khan
+
+
+ Multi-object tracking (MOT) has profound applications in a variety of fields,
+including surveillance, sports analytics, self-driving, and cooperative
+robotics. Despite considerable advancements, existing MOT methodologies tend to
+falter when faced with non-uniform movements, occlusions, and
+appearance-reappearance scenarios of the objects. Recognizing this inadequacy,
+we put forward an integrated MOT method that not only marries object detection
+and identity linkage within a singular, end-to-end trainable framework but also
+equips the model with the ability to maintain object identity links over long
+periods of time. Our proposed model, named STMMOT, is built around four key
+modules: 1) candidate proposal generation, which generates object proposals via
+a vision-transformer encoder-decoder architecture that detects the object from
+each frame in the video; 2) scale variant pyramid, a progressive pyramid
+structure to learn the self-scale and cross-scale similarities in multi-scale
+feature maps; 3) spatio-temporal memory encoder, extracting the essential
+information from the memory associated with each object under tracking; and 4)
+spatio-temporal memory decoder, simultaneously resolving the tasks of object
+detection and identity association for MOT. Our system leverages a robust
+spatio-temporal memory module that retains extensive historical observations
+and effectively encodes them using an attention-based aggregator. The
+uniqueness of STMMOT lies in representing objects as dynamic query embeddings
+that are updated continuously, which enables the prediction of object states
+with attention mechanisms and eradicates the need for post-processing.
+
+
+
+
+
+
+
+ ☆ IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text
+ Recognition
+
+
+ Nowadays, scene text recognition has attracted more and more attention due to
+its diverse applications. Most state-of-the-art methods adopt an
+encoder-decoder framework with the attention mechanism, autoregressively
+generating text from left to right. Despite the convincing performance, this
+sequential decoding strategy constrains inference speed. Conversely,
+non-autoregressive models provide faster, simultaneous predictions but often
+sacrifice accuracy. Although utilizing an explicit language model can improve
+performance, it burdens the computational load. Besides, separating linguistic
+knowledge from vision information may harm the final prediction. In this paper,
+we propose an alternative solution, using a parallel and iterative decoder that
+adopts an easy-first decoding strategy. Furthermore, we regard text recognition
+as an image-based conditional text generation task and utilize the discrete
+diffusion strategy, ensuring exhaustive exploration of bidirectional contextual
+information. Extensive experiments demonstrate that the proposed approach
+achieves superior results on the benchmark datasets, including both Chinese and
+English text images.
+
+
+
+
+
+
+
+ ☆ EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State
+ Estimation and 3D Dense Mapping
+
+
+ Event cameras are bio-inspired, motion-activated sensors that demonstrate
+substantial potential in handling challenging situations, such as motion blur
+and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the
+problem of 6 DoF pose tracking and 3D reconstruction using monocular event
+camera. A novel event-based hybrid tracking framework is designed to estimate
+the pose, leveraging the robustness of feature matching and the precision of
+direct alignment. Specifically, we develop an event-based 2D-2D alignment to
+construct the photometric constraint, and tightly integrate it with the
+event-based reprojection constraint. The mapping module recovers the dense and
+colorful depth of the scene through the image-guided event-based mapping
+method. Subsequently, the appearance, texture, and surface mesh of the 3D scene
+can be reconstructed by fusing the dense depth map from multiple viewpoints
+using truncated signed distance function (TSDF) fusion. To the best of our
+knowledge, this is the first non-learning work to realize event-based dense
+mapping. Numerical evaluations are performed on both publicly available and
+self-collected datasets, which qualitatively and quantitatively demonstrate the
+superior performance of our method. Our EVI-SAM effectively balances accuracy
+and robustness while maintaining computational efficiency, showcasing superior
+pose tracking and dense mapping performance in challenging scenarios. Video
+Demo: https://youtu.be/Nn40U4e5Si8.
+
+
+
+
+
+
+
+ ☆ Text-Conditioned Resampler For Long Form Video Understanding
+
+
+
+
+
+
+
+
+ Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
+
+
+ Videos are highly redundant data source and it is often enough to identify a
+few key moments to solve any given task. In this paper, we present a
+text-conditioned video resampler (TCR) module that uses a pre-trained and
+frozen visual encoder and large language model (LLM) to process long video
+sequences for a task. TCR localises relevant visual features from the video
+given a text condition and provides them to a LLM to generate a text response.
+Due to its lightweight design and use of cross-attention, TCR can process more
+than 100 frames at a time allowing the model to use much longer chunks of video
+than earlier works. We make the following contributions: (i) we design a
+transformer-based sampling architecture that can process long videos
+conditioned on a task, together with a training method that enables it to
+bridge pre-trained visual and language models; (ii) we empirically validate its
+efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art
+on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks
+which require longer video contexts and that can thus be used effectively for
+further evaluation of long-range video models.
+
+
+
+
+
+
+
+ ☆ 3D-LFM: Lifting Foundation Model
+
+
+
+
+
+
+
+
+ Mosam Dabhi, Laszlo A. Jeni, Simon Lucey
+
+
+ The lifting of 3D structure and camera from 2D landmarks is at the
+cornerstone of the entire discipline of computer vision. Traditional methods
+have been confined to specific rigid objects, such as those in
+Perspective-n-Point (PnP) problems, but deep learning has expanded our
+capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL)
+with resilience to noise, occlusions, and perspective distortions. All these
+techniques, however, have been limited by the fundamental need to establish
+correspondences across the 3D training data -- significantly limiting their
+utility to applications where one has an abundance of "in-correspondence" 3D
+data. Our approach harnesses the inherent permutation equivariance of
+transformers to manage varying number of points per 3D data instance,
+withstands occlusions, and generalizes to unseen categories. We demonstrate
+state of the art performance across 2D-3D lifting task benchmarks. Since our
+approach can be trained across such a broad class of structures we refer to it
+simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
+
+
+
+ comment: Project page is available at https://3dlfm.github.io
+
+
+
+
+
+
+ ☆ Point Cloud Segmentation Using Transfer Learning with RandLA-Net: A Case
+ Study on Urban Areas
+
+
+
+
+
+
+
+
+ Alperen Enes Bayar, Ufuk Uyan, Elif Toprak, Cao Yuheng, Tang Juncheng, Ahmet Alp Kindiroglu
+
+
+ Urban environments are characterized by complex structures and diverse
+features, making accurate segmentation of point cloud data a challenging task.
+This paper presents a comprehensive study on the application of RandLA-Net, a
+state-of-the-art neural network architecture, for the 3D segmentation of
+large-scale point cloud data in urban areas. The study focuses on three major
+Chinese cities, namely Chengdu, Jiaoda, and Shenzhen, leveraging their unique
+characteristics to enhance segmentation performance.
+ To address the limited availability of labeled data for these specific urban
+areas, we employed transfer learning techniques. We transferred the learned
+weights from the Sensat Urban and Toronto 3D datasets to initialize our
+RandLA-Net model. Additionally, we performed class remapping to adapt the model
+to the target urban areas, ensuring accurate segmentation results.
+ The experimental results demonstrate the effectiveness of the proposed
+approach achieving over 80\% F1 score for each areas in 3D point cloud
+segmentation. The transfer learning strategy proves to be crucial in overcoming
+data scarcity issues, providing a robust solution for urban point cloud
+analysis. The findings contribute to the advancement of point cloud
+segmentation methods, especially in the context of rapidly evolving Chinese
+urban areas.
+
+
+ One of the ultimate goals of representation learning is to achieve
+compactness within a class and well-separability between classes. Many
+outstanding metric-based and prototype-based methods following the
+Expectation-Maximization paradigm, have been proposed for this objective.
+However, they inevitably introduce biases into the learning process,
+particularly with long-tail distributed training data. In this paper, we reveal
+that the class prototype is not necessarily to be derived from training
+features and propose a novel perspective to use pre-defined class anchors
+serving as feature centroid to unidirectionally guide feature learning.
+However, the pre-defined anchors may have a large semantic distance from the
+pixel features, which prevents them from being directly applied. To address
+this issue and generate feature centroid independent from feature learning, a
+simple yet effective Semantic Anchor Regularization (SAR) is proposed. SAR
+ensures the interclass separability of semantic anchors in the semantic space
+by employing a classifier-aware auxiliary cross-entropy loss during training
+via disentanglement learning. By pulling the learned features to these semantic
+anchors, several advantages can be attained: 1) the intra-class compactness and
+naturally inter-class separability, 2) induced bias or errors from feature
+learning can be avoided, and 3) robustness to the long-tailed problem. The
+proposed SAR can be used in a plug-and-play manner in the existing models.
+Extensive experiments demonstrate that the SAR performs better than previous
+sophisticated prototype-based methods. The implementation is available at
+https://github.com/geyanqi/SAR.
+
+
+
+
+
+
+
+ ☆ Point Cloud Part Editing: Segmentation, Generation, Assembly, and
+ Selection AAAI 2024
+
+
+
+
+
+
+
+
+ Kaiyi Zhang, Yang Chen, Ximing Yang, Weizhong Zhang, Cheng Jin
+
+
+ Ideal part editing should guarantee the diversity of edited parts, the
+fidelity to the remaining parts, and the quality of the results. However,
+previous methods do not disentangle each part completely, which means the
+edited parts will affect the others, resulting in poor diversity and fidelity.
+In addition, some methods lack constraints between parts, which need manual
+selections of edited results to ensure quality. Therefore, we propose a
+four-stage process for point cloud part editing: Segmentation, Generation,
+Assembly, and Selection. Based on this process, we introduce SGAS, a model for
+part editing that employs two strategies: feature disentanglement and
+constraint. By independently fitting part-level feature distributions, we
+realize the feature disentanglement. By explicitly modeling the transformation
+from object-level distribution to part-level distributions, we realize the
+feature constraint. Considerable experiments on different datasets demonstrate
+the efficiency and effectiveness of SGAS on point cloud part editing. In
+addition, SGAS can be pruned to realize unsupervised part-aware point cloud
+generation and achieves state-of-the-art results.
+
+
+
+ comment: 9 pages, 7 figures, AAAI 2024
+
+
+
+
+
+
+ ☆ Topo-MLP : A Simplicial Network Without Message Passing
+
+
+ Due to their ability to model meaningful higher order relations among a set
+of entities, higher order network models have emerged recently as a powerful
+alternative for graph-based network models which are only capable of modeling
+binary relationships. Message passing paradigm is still dominantly used to
+learn representations even for higher order network models. While powerful,
+message passing can have disadvantages during inference, particularly when the
+higher order connectivity information is missing or corrupted. To overcome such
+limitations, we propose Topo-MLP, a purely MLP-based simplicial neural network
+algorithm to learn the representation of elements in a simplicial complex
+without explicitly relying on message passing. Our framework utilizes a novel
+Higher Order Neighborhood Contrastive (HONC) loss which implicitly incorporates
+the simplicial structure into representation learning. Our proposed model's
+simplicity makes it faster during inference. Moreover, we show that our model
+is robust when faced with missing or corrupted connectivity structure.
+
+
+ 3D-aware Generative Adversarial Networks (3D-GANs) currently exhibit
+artifacts in their 3D geometrical modeling, such as mesh imperfections and
+holes. These shortcomings are primarily attributed to the limited availability
+of annotated 3D data, leading to a constrained "valid latent area" for
+satisfactory modeling. To address this, we present a Self-Supervised Learning
+(SSL) technique tailored as an auxiliary loss for any 3D-GAN, designed to
+improve its 3D geometrical modeling capabilities. Our approach pioneers an
+inversion technique for 3D-GANs, integrating an encoder that performs adaptive
+spatially-varying range operations. Utilizing this inversion, we introduce the
+Cyclic Generative Constraint (CGC), aiming to densify the valid latent space.
+The CGC operates via augmented local latent vectors that maintain the same
+geometric form, and it imposes constraints on the cycle path outputs,
+specifically the generator-encoder-generator sequence. This SSL methodology
+seamlessly integrates with the inherent GAN loss, ensuring the integrity of
+pre-existing 3D-GAN architectures without necessitating alterations. We
+validate our approach with comprehensive experiments across various datasets
+and architectures, underscoring its efficacy. Our project website:
+https://3dgan-ssl.github.io
+
+
+
+ comment: 13 pages, 12 figures, 6 tables
+
+
+
+
+
+
+ ☆ GCNext: Towards the Unity of Graph Convolutions for Human Motion
+ Prediction AAAI
+
+
+ The past few years has witnessed the dominance of Graph Convolutional
+Networks (GCNs) over human motion prediction.Various styles of graph
+convolutions have been proposed, with each one meticulously designed and
+incorporated into a carefully-crafted network architecture. This paper breaks
+the limits of existing knowledge by proposing Universal Graph Convolution
+(UniGC), a novel graph convolution concept that re-conceptualizes different
+graph convolutions as its special cases. Leveraging UniGC on network-level, we
+propose GCNext, a novel GCN-building paradigm that dynamically determines the
+best-fitting graph convolutions both sample-wise and layer-wise. GCNext offers
+multiple use cases, including training a new GCN from scratch or refining a
+preexisting GCN. Experiments on Human3.6M, AMASS, and 3DPW datasets show that,
+by incorporating unique module-to-network designs, GCNext yields up to 9x lower
+computational cost than existing GCN methods, on top of achieving
+state-of-the-art performance.
+
+
+
+ comment: to be published in the 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI-24)
+
+
+
+
+
+
+ ☆ Active contours driven by local and global intensity fitting energy with
+ application to SAR image segmentation and its fast solvers
+
+
+ In this paper, we propose a novel variational active contour model based on
+Aubert-Aujol (AA) denoising model, which hybrides geodesic active contour (GAC)
+model with active contours without edges (ACWE) model and can be used to
+segment images corrupted by multiplicative gamma noise. We transform the
+proposed model into classic ROF model by adding a proximity term. Inspired by a
+fast denosing algorithm proposed by Jia-Zhao recently, we propose two fast
+fixed point algorithms to solve SAR image segmentation question. Experimental
+results for real SAR images show that the proposed image segmentation model can
+efficiently stop the contours at weak or blurred edges, and can automatically
+detect the exterior and interior boundaries of images with multiplicative gamma
+noise. The proposed fast fixed point algorithms are robustness to
+initialization contour, and can further reduce about 15% of the time needed for
+algorithm proposed by Goldstein-Osher.
+
+
+
+ comment: 20 pages,28 figures. arXiv admin note: substantial text overlap with
+ arXiv:2312.08376, arXiv:2312.09365
+
+
+
+
+
+
+
+ Chaojian Li, Bichen Wu, Peter Vajda, Yingyan, Lin
+
+
+ Neural Radiance Field (NeRF) has emerged as a leading technique for novel
+view synthesis, owing to its impressive photorealistic reconstruction and
+rendering capability. Nevertheless, achieving real-time NeRF rendering in
+large-scale scenes has presented challenges, often leading to the adoption of
+either intricate baked mesh representations with a substantial number of
+triangles or resource-intensive ray marching in baked representations. We
+challenge these conventions, observing that high-quality geometry, represented
+by meshes with substantial triangles, is not necessary for achieving
+photorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF
+representation that includes a low-quality mesh, a view-dependent displacement
+map, and a compressed NeRF model. This design effectively harnesses the
+capabilities of existing graphics hardware, thus enabling real-time NeRF
+rendering on edge devices. Leveraging a highly-optimized WebGL-based rendering
+framework, our proposed MixRT attains real-time rendering speeds on edge
+devices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop),
+better rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360
+datasets), and a smaller storage size (less than 80% compared to
+state-of-the-art methods).
+
+
+
+ comment: Accepted by 3DV'24. Project Page: https://licj15.github.io/MixRT/
+
+
+
+
+
+
+ ☆ Regulating Intermediate 3D Features for Vision-Centric Autonomous
+ Driving AAAI 2024
+
+
+
+
+
+
+
+
+ Junkai Xu, Liang Peng, Haoran Cheng, Linxuan Xia, Qi Zhou, Dan Deng, Wei Qian, Wenxiao Wang, Deng Cai
+
+
+ Multi-camera perception tasks have gained significant attention in the field
+of autonomous driving. However, existing frameworks based on Lift-Splat-Shoot
+(LSS) in the multi-camera setting cannot produce suitable dense 3D features due
+to the projection nature and uncontrollable densification process. To resolve
+this problem, we propose to regulate intermediate dense 3D features with the
+help of volume rendering. Specifically, we employ volume rendering to process
+the dense 3D features to obtain corresponding 2D features (e.g., depth maps,
+semantic maps), which are supervised by associated labels in the training. This
+manner regulates the generation of dense 3D features on the feature level,
+providing appropriate dense and unified features for multiple perception tasks.
+Therefore, our approach is termed Vampire, stands for "Volume rendering As
+Multi-camera Perception Intermediate feature REgulator". Experimental results
+on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates
+fine-grained and appropriate extraction of dense 3D features, and is
+competitive with existing SOTA methods across diverse downstream perception
+tasks like 3D occupancy prediction, LiDAR segmentation and 3D objection
+detection, while utilizing moderate GPU resources. We provide a video
+demonstration in the supplementary materials and Codes are available at
+github.com/cskkxjk/Vampire.
+
+
+ 3D occupancy prediction is an emerging task that aims to estimate the
+occupancy states and semantics of 3D scenes using multi-view images. However,
+image-based scene perception encounters significant challenges in achieving
+accurate prediction due to the absence of geometric priors. In this paper, we
+address this issue by exploring cross-modal knowledge distillation in this
+task, i.e., we leverage a stronger multi-modal model to guide the visual model
+during training. In practice, we observe that directly applying features or
+logits alignment, proposed and widely used in bird's-eyeview (BEV) perception,
+does not yield satisfactory results. To overcome this problem, we introduce
+RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
+By employing differentiable volume rendering, we generate depth and semantic
+maps in perspective views and propose two novel consistency criteria between
+the rendered outputs of teacher and student models. Specifically, the depth
+consistency loss aligns the termination distributions of the rendered rays,
+while the semantic consistency loss mimics the intra-segment similarity guided
+by vision foundation models (VLMs). Experimental results on the nuScenes
+dataset demonstrate the effectiveness of our proposed method in improving
+various 3D occupancy prediction approaches, e.g., our proposed methodology
+enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D
+benchmark.
+
+
+
+
+
+
+
+
+ Yufei Cai, Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hu Han, Wangmeng Zuo
+
+
+ Customized text-to-image generation, which aims to learn user-specified
+concepts with a few images, has drawn significant attention recently. However,
+existing methods usually suffer from overfitting issues and entangle the
+subject-unrelated information (e.g., background and pose) with the learned
+concept, limiting the potential to compose concept into new scenes. To address
+these issues, we propose the DETEX, a novel approach that learns the
+disentangled concept embedding for flexible customized text-to-image
+generation. Unlike conventional methods that learn a single concept embedding
+from the given images, our DETEX represents each image using multiple word
+embeddings during training, i.e., a learnable image-shared subject embedding
+and several image-specific subject-unrelated embeddings. To decouple irrelevant
+attributes (i.e., background and pose) from the subject embedding, we further
+present several attribute mappers that encode each image as several
+image-specific subject-unrelated embeddings. To encourage these unrelated
+embeddings to capture the irrelevant information, we incorporate them with
+corresponding attribute words and propose a joint training strategy to
+facilitate the disentanglement. During inference, we only use the subject
+embedding for image generation, while selectively using image-specific
+embeddings to retain image-specified attributes. Extensive experiments
+demonstrate that the subject embedding obtained by our method can faithfully
+represent the target concept, while showing superior editability compared to
+the state-of-the-art methods. Our code will be made published available.
+
+
+
+ comment: 16 pages, 16 figures
+
+
+
+
+
+
+ ☆ A Dual-way Enhanced Framework from Text Matching Point of View for
+ Multimodal Entity Linking
+
+
+ Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with
+multimodal information to entity in Knowledge Graph (KG) such as Wikipedia,
+which plays a key role in many applications. However, existing methods suffer
+from shortcomings, including modality impurity such as noise in raw image and
+ambiguous textual entity representation, which puts obstacles to MEL. We
+formulate multimodal entity linking as a neural text matching problem where
+each multimodal information (text and image) is treated as a query, and the
+model learns the mapping from each query to the relevant entity from candidate
+entities. This paper introduces a dual-way enhanced (DWE) framework for MEL:
+(1) our model refines queries with multimodal data and addresses semantic gaps
+using cross-modal enhancers between text and image information. Besides, DWE
+innovatively leverages fine-grained image attributes, including facial
+characteristic and scene feature, to enhance and refine visual features. (2)By
+using Wikipedia descriptions, DWE enriches entity semantics and obtains more
+comprehensive textual representation, which reduces between textual
+representation and the entities in KG. Extensive experiments on three public
+benchmarks demonstrate that our method achieves state-of-the-art (SOTA)
+performance, indicating the superiority of our model. The code is released on
+https://github.com/season1blue/DWE
+
+
+
+
+
+
+
+ ☆ Advancements and Challenges in Arabic Optical Character Recognition: A
+ Comprehensive Survey
+
+
+ Optical character recognition (OCR) is a vital process that involves the
+extraction of handwritten or printed text from scanned or printed images,
+converting it into a format that can be understood and processed by machines.
+This enables further data processing activities such as searching and editing.
+The automatic extraction of text through OCR plays a crucial role in digitizing
+documents, enhancing productivity, improving accessibility, and preserving
+historical records. This paper seeks to offer an exhaustive review of
+contemporary applications, methodologies, and challenges associated with Arabic
+Optical Character Recognition (OCR). A thorough analysis is conducted on
+prevailing techniques utilized throughout the OCR process, with a dedicated
+effort to discern the most efficacious approaches that demonstrate enhanced
+outcomes. To ensure a thorough evaluation, a meticulous keyword-search
+methodology is adopted, encompassing a comprehensive analysis of articles
+relevant to Arabic OCR, including both backward and forward citation reviews.
+In addition to presenting cutting-edge techniques and methods, this paper
+critically identifies research gaps within the realm of Arabic OCR. By
+highlighting these gaps, we shed light on potential areas for future
+exploration and development, thereby guiding researchers toward promising
+avenues in the field of Arabic OCR. The outcomes of this study provide valuable
+insights for researchers, practitioners, and stakeholders involved in Arabic
+OCR, ultimately fostering advancements in the field and facilitating the
+creation of more accurate and efficient OCR systems for the Arabic language.
+
+
+
+
+
+
+
+ ☆ Gemini: A Family of Highly Capable Multimodal Models
+
+
+
+
+
+
+
+
+ Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse Hartman, Martin Chadwick, Gaurav Singh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki, Antoine Miech, Annie Louis, Laurent El Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, Clara Huiyi Hu, Raoul de Liedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, Reinald Kim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George van den Driessche, Tao Wang, Fan Yang, Shuo-yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, Paul Kishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, Raphaël Lopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, Charline Le Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, Jaime Alonso Lorenzo, Lars Lowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, Livio Baldini Soares, Kate Baumli, Michael B. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, Dalia El Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, Ce Zheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao, Charlie Chen, Gamaleldin Elsayed, Ed Chi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, Pam G Rabinovitch, Piotr Stanczyk, Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu, Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Lam Nguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, Elena Allica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, Shirley Chung, Harry Askham, Luis C. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, Daniel J. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, Tian Huey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, Soheil Hassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, Christopher A. Choquette-Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ishita Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mihir Sanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, Oriol Vinyals
+
+
+ This report introduces a new family of multimodal models, Gemini, that
+exhibit remarkable capabilities across image, audio, video, and text
+understanding. The Gemini family consists of Ultra, Pro, and Nano sizes,
+suitable for applications ranging from complex reasoning tasks to on-device
+memory-constrained use-cases. Evaluation on a broad range of benchmarks shows
+that our most-capable Gemini Ultra model advances the state of the art in 30 of
+32 of these benchmarks - notably being the first model to achieve human-expert
+performance on the well-studied exam benchmark MMLU, and improving the state of
+the art in every one of the 20 multimodal benchmarks we examined. We believe
+that the new capabilities of Gemini models in cross-modal reasoning and
+language understanding will enable a wide variety of use cases and we discuss
+our approach toward deploying them responsibly to users.
+
+
+
+
+
+
+
+ ☆ An effective image copy-move forgery detection using entropy image
+
+
+ Image forensics has become increasingly important in our daily lives. As a
+fundamental type of forgeries, Copy-Move Forgery Detection (CMFD) has received
+significant attention in the academic community. Keypoint-based algorithms,
+particularly those based on SIFT, have achieved good results in CMFD. However,
+the most of keypoint detection algorithms often fail to generate sufficient
+matches when tampered patches are present in smooth areas. To tackle this
+problem, we introduce entropy images to determine the coordinates and scales of
+keypoints, resulting significantly increasing the number of keypoints.
+Furthermore, we develop an entropy level clustering algorithm to avoid
+increased matching complexity caused by non-ideal distribution of grayscale
+values in keypoints. Experimental results demonstrate that our algorithm
+achieves a good balance between performance and time efficiency.
+
+
+
+
+
+
+
+ ☆ Learning Object State Changes in Videos: An Open-World Perspective SC
+
+
+ Object State Changes (OSCs) are pivotal for video understanding. While humans
+can effortlessly generalize OSC understanding from familiar to unknown objects,
+current approaches are confined to a closed vocabulary. Addressing this gap, we
+introduce a novel open-world formulation for the video OSC problem. The goal is
+to temporally localize the three stages of an OSC -- the object's initial
+state, its transitioning state, and its end state -- whether or not the object
+has been observed during training. Towards this end, we develop VidOSC, a
+holistic learning approach that: (1) leverages text and vision-language models
+for supervisory signals to obviate manually labeling OSC training data, and (2)
+abstracts fine-grained shared state representations from objects to enhance
+generalization. Furthermore, we present HowToChange, the first open-world
+benchmark for video OSC localization, which offers an order of magnitude
+increase in the label space and annotation volume compared to the best existing
+benchmark. Experimental results demonstrate the efficacy of our approach, in
+both traditional closed-world and open-world scenarios.
+
+
+
+
+
+
+
+ ☆ Towards SAMBA: Segment Anything Model for Brain Tumor Segmentation in
+ Sub-Sharan African Populations
+
+
+
+
+
+
+
+
+ Mohannad Barakat, Noha Magdy, Jjuuko George William, Ethel Phiri, Raymond Confidence, Dong Zhang, Udunna C Anazodo
+
+
+ Gliomas, the most prevalent primary brain tumors, require precise
+segmentation for diagnosis and treatment planning. However, this task poses
+significant challenges, particularly in the African population, were limited
+access to high-quality imaging data hampers algorithm performance. In this
+study, we propose an innovative approach combining the Segment Anything Model
+(SAM) and a voting network for multi-modal glioma segmentation. By fine-tuning
+SAM with bounding box-guided prompts (SAMBA), we adapt the model to the
+complexities of African datasets. Our ensemble strategy, utilizing multiple
+modalities and views, produces a robust consensus segmentation, addressing
+intra-tumoral heterogeneity. Although the low quality of scans presents
+difficulties, our methodology has the potential to profoundly impact clinical
+practice in resource-limited settings such as Africa, improving treatment
+decisions and advancing neuro-oncology research. Furthermore, successful
+application to other brain tumor types and lesions in the future holds promise
+for a broader transformation in neurological imaging, improving healthcare
+outcomes across all settings. This study was conducted on the Brain Tumor
+Segmentation (BraTS) Challenge Africa (BraTS-Africa) dataset, which provides a
+valuable resource for addressing challenges specific to resource-limited
+settings, particularly the African population, and facilitating the development
+of effective and more generalizable segmentation algorithms. To illustrate our
+approach's potential, our experiments on the BraTS-Africa dataset yielded
+compelling results, with SAM attaining a Dice coefficient of 86.6 for binary
+segmentation and 60.4 for multi-class segmentation.
+
+
+ By lifting the pre-trained 2D diffusion models into Neural Radiance Fields
+(NeRFs), text-to-3D generation methods have made great progress. Many
+state-of-the-art approaches usually apply score distillation sampling (SDS) to
+optimize the NeRF representations, which supervises the NeRF optimization with
+pre-trained text-conditioned 2D diffusion models such as Imagen. However, the
+supervision signal provided by such pre-trained diffusion models only depends
+on text prompts and does not constrain the multi-view consistency. To inject
+the cross-view consistency into diffusion priors, some recent works finetune
+the 2D diffusion model with multi-view data, but still lack fine-grained view
+coherence. To tackle this challenge, we incorporate multi-view image conditions
+into the supervision signal of NeRF optimization, which explicitly enforces
+fine-grained view consistency. With such stronger supervision, our proposed
+text-to-3D method effectively mitigates the generation of floaters (due to
+excessive densities) and completely empty spaces (due to insufficient
+densities). Our quantitative evaluations on the T$^3$Bench dataset demonstrate
+that our method achieves state-of-the-art performance over existing text-to-3D
+methods. We will make the code publicly available.
+
+
+
+
+
+
+
+
+ Emily Kaczmarek, Olivier X. Miguel, Alexa C. Bowie, Robin Ducharme, Alysha L. J. Dingwall-Harvey, Steven Hawken, Christine M. Armour, Mark C. Walker, Kevin Dick
+
+
+ Deep neural networks have been widely adopted in numerous domains due to
+their high performance and accessibility to developers and application-specific
+end-users. Fundamental to image-based applications is the development of
+Convolutional Neural Networks (CNNs), which possess the ability to
+automatically extract features from data. However, comprehending these complex
+models and their learned representations, which typically comprise millions of
+parameters and numerous layers, remains a challenge for both developers and
+end-users. This challenge arises due to the absence of interpretable and
+transparent tools to make sense of black-box models. There exists a growing
+body of Explainable Artificial Intelligence (XAI) literature, including a
+collection of methods denoted Class Activation Maps (CAMs), that seek to
+demystify what representations the model learns from the data, how it informs a
+given prediction, and why it, at times, performs poorly in certain tasks. We
+propose a novel XAI visualization method denoted CAManim that seeks to
+simultaneously broaden and focus end-user understanding of CNN predictions by
+animating the CAM-based network activation maps through all layers, effectively
+depicting from end-to-end how a model progressively arrives at the final layer
+activation. Herein, we demonstrate that CAManim works with any CAM-based method
+and various CNN architectures. Beyond qualitative model assessments, we
+additionally propose a novel quantitative assessment that expands upon the
+Remove and Debias (ROAD) metric, pairing the qualitative end-to-end network
+visual explanations assessment with our novel quantitative "yellow brick ROAD"
+assessment (ybROAD). This builds upon prior research to address the increasing
+demand for interpretable, robust, and transparent model assessment methodology,
+ultimately improving an end-user's trust in a given model's predictions.
+
+
+
+
+
+
+
+
+ Alyssa R. Amod, Alexandra Smith, Pearly Joubert, Confidence Raymond, Dong Zhang, Udunna C. Anazodo, Dodzi Motchon, Tinashe E. M. Mutsvangwa, Sébastien Quetin
+
+
+ A critical challenge for tumour segmentation models is the ability to adapt
+to diverse clinical settings, particularly when applied to poor-quality
+neuroimaging data. The uncertainty surrounding this adaptation stems from the
+lack of representative datasets, leaving top-performing models without exposure
+to common artifacts found in MRI data throughout Sub-Saharan Africa (SSA). We
+replicated a framework that secured the 2nd position in the 2022 BraTS
+competition to investigate the impact of dataset composition on model
+performance and pursued four distinct approaches through training a model with:
+1) BraTS-Africa data only (train_SSA, N=60), 2) BraTS-Adult Glioma data only
+(train_GLI, N=1251), 3) both datasets together (train_ALL, N=1311), and 4)
+through further training the train_GLI model with BraTS-Africa data
+(train_ftSSA). Notably, training on a smaller low-quality dataset alone
+(train_SSA) yielded subpar results, and training on a larger high-quality
+dataset alone (train_GLI) struggled to delineate oedematous tissue in the
+low-quality validation set. The most promising approach (train_ftSSA) involved
+pre-training a model on high-quality neuroimages and then fine-tuning it on the
+smaller, low-quality dataset. This approach outperformed the others, ranking
+second in the MICCAI BraTS Africa global challenge external testing phase.
+These findings underscore the significance of larger sample sizes and broad
+exposure to data in improving segmentation performance. Furthermore, we
+demonstrated that there is potential for improving such models by fine-tuning
+them with a wider range of data locally.
+
+
+
+ comment: 14 pages, 5 figures, 3 tables
+
+
+
+
+
+
+ ☆ ADMM-MM Algorithm for General Tensor Decomposition
+
+
+ In this paper, we propose a new unified optimization algorithm for general
+tensor decomposition which is formulated as an inverse problem for low-rank
+tensors in the general linear observation models. The proposed algorithm
+supports three basic loss functions ($\ell_2$-loss, $\ell_1$-loss and KL
+divergence) and various low-rank tensor decomposition models (CP, Tucker, TT,
+and TR decompositions). We derive the optimization algorithm based on
+hierarchical combination of the alternating direction method of multiplier
+(ADMM) and majorization-minimization (MM). We show that wide-range applications
+can be solved by the proposed algorithm, and can be easily extended to any
+established tensor decomposition models in a {plug-and-play} manner.
+
+
+
+
+
+
+
+ ☆ Convolutional Channel-wise Competitive Learning for the Forward-Forward
+ Algorithm AAAI 2024
+
+
+ The Forward-Forward (FF) Algorithm has been recently proposed to alleviate
+the issues of backpropagation (BP) commonly used to train deep neural networks.
+However, its current formulation exhibits limitations such as the generation of
+negative data, slower convergence, and inadequate performance on complex tasks.
+In this paper, we take the main ideas of FF and improve them by leveraging
+channel-wise competitive learning in the context of convolutional neural
+networks for image classification tasks. A layer-wise loss function is
+introduced that promotes competitive learning and eliminates the need for
+negative data construction. To enhance both the learning of compositional
+features and feature space partitioning, a channel-wise feature separator and
+extractor block is proposed that complements the competitive learning process.
+Our method outperforms recent FF-based models on image classification tasks,
+achieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST,
+Fashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the
+performance gap between FF learning and BP methods, indicating the potential of
+our proposed approach to learn useful representations in a layer-wise modular
+fashion, enabling more efficient and flexible learning.
+
+
+
+ comment: To be published in AAAI 2024, 11 pages, 7 figures
+
+
+
+
+
+
+
+ Bumsoo Kim, Taeho Choi, Jaewoo Kang, Hyunwoo J. Kim
+
+
+ Recent advances in deep neural networks have achieved significant progress in
+detecting individual objects from an image. However, object detection is not
+sufficient to fully understand a visual scene. Towards a deeper visual
+understanding, the interactions between objects, especially humans and objects
+are essential. Most prior works have obtained this information with a bottom-up
+approach, where the objects are first detected and the interactions are
+predicted sequentially by pairing the objects. This is a major bottleneck in
+HOI detection inference time. To tackle this problem, we propose UnionDet, a
+one-stage meta-architecture for HOI detection powered by a novel union-level
+detector that eliminates this additional inference stage by directly capturing
+the region of interaction. Our one-stage detector for human-object interaction
+shows a significant reduction in interaction prediction time 4x~14x while
+outperforming state-of-the-art methods on two public datasets: V-COCO and
+HICO-DET.
+
+
+
+
+
+
+
+
+ Bumsoo Kim, Yeonsik Jo, Jinhyung Kim, Seung Hwan Kim
+
+
+ Contrastive Language-Image Pretraining has emerged as a prominent approach
+for training vision and text encoders with uncurated image-text pairs from the
+web. To enhance data-efficiency, recent efforts have introduced additional
+supervision terms that involve random-augmented views of the image. However,
+since the image augmentation process is unaware of its text counterpart, this
+procedure could cause various degrees of image-text misalignments during
+training. Prior methods either disregarded this discrepancy or introduced
+external models to mitigate the impact of misalignments during training. In
+contrast, we propose a novel metric learning approach that capitalizes on these
+misalignments as an additional training source, which we term "Misalign,
+Contrast then Distill (MCD)". Unlike previous methods that treat augmented
+images and their text counterparts as simple positive pairs, MCD predicts the
+continuous scales of misalignment caused by the augmentation. Our extensive
+experimental results show that our proposed MCD achieves state-of-the-art
+transferability in multiple classification and retrieval downstream datasets.
+
+
+
+
+
+
+
+
+ Bumsoo Kim, Jinhyung Kim, Yeonsik Jo, Seung Hwan Kim
+
+
+ Recent advances in vision language pretraining (VLP) have been largely
+attributed to the large-scale data collected from the web. However, uncurated
+dataset contains weakly correlated image-text pairs, causing data inefficiency.
+To address the issue, knowledge distillation have been explored at the expense
+of extra image and text momentum encoders to generate teaching signals for
+misaligned image-text pairs. In this paper, our goal is to resolve the
+misalignment problem with an efficient distillation framework. To this end, we
+propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with
+Self-distilled Encoders. ECLIPSE features a distinctive distillation
+architecture wherein a shared text encoder is utilized between an online image
+encoder and a momentum image encoder. This strategic design choice enables the
+distillation to operate within a unified projected space of text embedding,
+resulting in better performance. Based on the unified text embedding space,
+ECLIPSE compensates for the additional computational cost of the momentum image
+encoder by expediting the online image encoder. Through our extensive
+experiments, we validate that there is a sweet spot between expedition and
+distillation where the partial view from the expedited online image encoder
+interacts complementarily with the momentum teacher. As a result, ECLIPSE
+outperforms its counterparts while achieving substantial acceleration in
+inference speed.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ Diagnosis Of Takotsubo Syndrome By Robust Feature Selection From The
+ Complex Latent Space Of DL-based Segmentation Network
+
+
+
+
+
+
+
+
+ Fahim Ahmed Zaman, Wahidul Alam, Tarun Kanti Roy, Amanda Chang, Kan Liu, Xiaodong Wu
+
+
+ Researchers have shown significant correlations among segmented objects in
+various medical imaging modalities and disease related pathologies. Several
+studies showed that using hand crafted features for disease prediction neglects
+the immense possibility to use latent features from deep learning (DL) models
+which may reduce the overall accuracy of differential diagnosis. However,
+directly using classification or segmentation models on medical to learn latent
+features opt out robust feature selection and may lead to overfitting. To fill
+this gap, we propose a novel feature selection technique using the latent space
+of a segmentation model that can aid diagnosis. We evaluated our method in
+differentiating a rare cardiac disease: Takotsubo Syndrome (TTS) from the ST
+elevation myocardial infarction (STEMI) using echocardiogram videos (echo). TTS
+can mimic clinical features of STEMI in echo and extremely hard to distinguish.
+Our approach shows promising results in differential diagnosis of TTS with 82%
+diagnosis accuracy beating the previous state-of-the-art (SOTA) approach.
+Moreover, the robust feature selection technique using LASSO algorithm shows
+great potential in reducing the redundant features and creates a robust
+pipeline for short- and long-term disease prognoses in the downstream analysis.
+
+
+
+ comment: 5 pages, 3 figures, conference
+
+
+
+
+
+
+ ☆ Surf-CDM: Score-Based Surface Cold-Diffusion Model For Medical Image
+ Segmentation
+
+
+
+
+
+
+
+
+ Fahim Ahmed Zaman, Mathews Jacob, Amanda Chang, Kan Liu, Milan Sonka, Xiaodong Wu
+
+
+ Diffusion models have shown impressive performance for image generation,
+often times outperforming other generative models. Since their introduction,
+researchers have extended the powerful noise-to-image denoising pipeline to
+discriminative tasks, including image segmentation. In this work we propose a
+conditional score-based generative modeling framework for medical image
+segmentation which relies on a parametric surface representation for the
+segmentation masks. The surface re-parameterization allows the direct
+application of standard diffusion theory, as opposed to when the mask is
+represented as a binary mask. Moreover, we adapted an extended variant of the
+diffusion technique known as the "cold-diffusion" where the diffusion model can
+be constructed with deterministic perturbations instead of Gaussian noise,
+which facilitates significantly faster convergence in the reverse diffusion. We
+evaluated our method on the segmentation of the left ventricle from 65
+transthoracic echocardiogram videos (2230 echo image frames) and compared its
+performance to the most popular and widely used image segmentation models. Our
+proposed model not only outperformed the compared methods in terms of
+segmentation accuracy, but also showed potential in estimating segmentation
+uncertainties for further downstream analyses due to its inherent generative
+nature.
+
+
+
+ comment: 5 pages, 5 figures, conference
+
+
+
+
+
+
+ ☆ IS-DARTS: Stabilizing DARTS through Precise Measurement on Candidate
+ Importance AAAI2024
+
+
+ Among existing Neural Architecture Search methods, DARTS is known for its
+efficiency and simplicity. This approach applies continuous relaxation of
+network representation to construct a weight-sharing supernet and enables the
+identification of excellent subnets in just a few GPU days. However,
+performance collapse in DARTS results in deteriorating architectures filled
+with parameter-free operations and remains a great challenge to the robustness.
+To resolve this problem, we reveal that the fundamental reason is the biased
+estimation of the candidate importance in the search space through theoretical
+and experimental analysis, and more precisely select operations via
+information-based measurements. Furthermore, we demonstrate that the excessive
+concern over the supernet and inefficient utilization of data in bi-level
+optimization also account for suboptimal results. We adopt a more realistic
+objective focusing on the performance of subnets and simplify it with the help
+of the information-based measurements. Finally, we explain theoretically why
+progressively shrinking the width of the supernet is necessary and reduce the
+approximation error of optimal weights in DARTS. Our proposed method, named
+IS-DARTS, comprehensively improves DARTS and resolves the aforementioned
+problems. Extensive experiments on NAS-Bench-201 and DARTS-based search space
+demonstrate the effectiveness of IS-DARTS.
+
+
+
+ comment: accepted by AAAI2024, paper + supplementary, 11 pages
+
+ In this work, we present a novel self-supervised method for Low Dose Computed
+Tomography (LDCT) reconstruction. Reducing the radiation dose to patients
+during a CT scan is a crucial challenge since the quality of the reconstruction
+highly degrades because of low photons or limited measurements. Supervised deep
+learning methods have shown the ability to remove noise in images but require
+accurate ground truth which can be obtained only by performing additional
+high-radiation CT scans. Therefore, we propose a novel self-supervised
+framework for LDCT, in which ground truth is not required for training the
+convolutional neural network (CNN). Based on the Noise2Inverse (N2I) method, we
+enforce in the training loss the equivariant property of rotation
+transformation, which is induced by the CT imaging system, to improve the
+quality of the CT image in a lower dose. Numerical and experimental results
+show that the reconstruction accuracy of N2I with sparse views is degrading
+while the proposed rotational augmented Noise2Inverse (RAN2I) method keeps
+better image quality over a different range of sampling angles. Finally, the
+quantitative results demonstrate that RAN2I achieves higher image quality
+compared to N2I, and experimental results of RAN2I on real projection data show
+comparable performance to supervised learning.
+
+
+
+ comment: 14 pages, 12 figures, accepted manuscript in IEEE Transactions on
+ Radiation and Plasma Medical Sciences
+
+
+
+
+
+
+ ☆ RealCraft: Attention Control as A Solution for Zero-shot Long Video
+ Editing
+
+
+ Although large-scale text-to-image generative models have shown promising
+performance in synthesizing high-quality images, directly applying these models
+to image editing remains a significant challenge. This challenge is further
+amplified in video editing due to the additional dimension of time. Especially
+for editing real videos as it necessitates maintaining a stable semantic layout
+across the frames while executing localized edits precisely without disrupting
+the existing backgrounds. In this paper, we propose \textit{RealCraft}, an
+attention-control-based method for zero-shot editing in real videos. By
+employing the object-centric manipulation of cross-attention between prompts
+and frames and spatial-temporal attention within the frames, we achieve precise
+shape-wise editing along with enhanced consistency. Our model can be used
+directly with Stable Diffusion and operates without the need for additional
+localized information. We showcase our zero-shot attention-control-based method
+across a range of videos, demonstrating localized, high-fidelity, shape-precise
+and time-consistent editing in videos of various lengths, up to 64 frames.
+
+
+
+
+
+
+
+ ☆ MotionScript: Natural Language Descriptions for Expressive 3D Human
+ Motions
+
+
+
+
+
+
+
+
+ Payam Jome Yazdian, Eric Liu, Li Cheng, Angelica Lim
+
+
+ This paper proposes MotionScript, a motion-to-text conversion algorithm and
+natural language representation for human body motions. MotionScript aims to
+describe movements in greater detail and with more accuracy than previous
+natural language approaches. Many motion datasets describe relatively objective
+and simple actions with little variation on the way they are expressed (e.g.
+sitting, walking, dribbling a ball). But for expressive actions that contain a
+diversity of movements in the class (e.g. being sad, dancing), or for actions
+outside the domain of standard motion capture datasets (e.g. stylistic walking,
+sign-language), more specific and granular natural language descriptions are
+needed. Our proposed MotionScript descriptions differ from existing natural
+language representations in that it provides direct descriptions in natural
+language instead of simple action labels or high-level human captions. To the
+best of our knowledge, this is the first attempt at translating 3D motions to
+natural language descriptions without requiring training data. Our experiments
+show that when MotionScript representations are used in a text-to-motion neural
+task, body movements are more accurately reconstructed, and large language
+models can be used to generate unseen complex motions.
+
+
+
+
+
+
+
+ ☆ Hierarchical Vision Transformers for Context-Aware Prostate Cancer
+ Grading in Whole Slide Images NeurIPS 2023
+
+
+
+
+
+
+
+
+ Clément Grisi, Geert Litjens, Jeroen van der Laak
+
+
+ Vision Transformers (ViTs) have ushered in a new era in computer vision,
+showcasing unparalleled performance in many challenging tasks. However, their
+practical deployment in computational pathology has largely been constrained by
+the sheer size of whole slide images (WSIs), which result in lengthy input
+sequences. Transformers faced a similar limitation when applied to long
+documents, and Hierarchical Transformers were introduced to circumvent it.
+Given the analogous challenge with WSIs and their inherent hierarchical
+structure, Hierarchical Vision Transformers (H-ViTs) emerge as a promising
+solution in computational pathology. This work delves into the capabilities of
+H-ViTs, evaluating their efficiency for prostate cancer grading in WSIs. Our
+results show that they achieve competitive performance against existing
+state-of-the-art solutions.
+
+
+
+ comment: Accepted at Medical Imaging meets NeurIPS 2023 workshop
+
+
+
+
+
+
+
+ B. A. Schreiber, J. Denholm, F. Jaeckle, M. J. Arends, K. M. Branson, C. -B. Schönlieb, E. J. Soilleux
+
+
+ We present an innovative method for rapidly segmenting hematoxylin and eosin
+(H&E)-stained tissue in whole-slide images (WSIs) that eliminates a wide range
+of undesirable artefacts such as pen marks and scanning artefacts. Our method
+involves taking a single-channel representation of a lowmagnification RGB
+overview of the WSI in which the pixel values are bimodally distributed such
+that H&E-stained tissue is easily distinguished from both background and a wide
+variety of artefacts. We demonstrate our method on 30 WSIs prepared from a wide
+range of institutions and WSI digital scanners, each containing substantial
+artefacts, and compare it to segmentations provided by Otsu thresholding and
+Histolab tissue segmentation and pen filtering tools. We found that our method
+segmented the tissue and fully removed all artefacts in 29 out of 30 WSIs,
+whereas Otsu thresholding failed to remove any artefacts, and the Histolab pen
+filtering tools only partially removed the pen marks. The beauty of our
+approach lies in its simplicity: manipulating RGB colour space and using Otsu
+thresholding allows for the segmentation of H&E-stained tissue and the rapid
+removal of artefacts without the need for machine learning or parameter tuning.
+
+
+
+ comment: 7 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ Augmentation-Aware Self-Supervision for Data-Efficient GAN Training NeurIPS 2023
+
+
+ Training generative adversarial networks (GANs) with limited data is
+challenging because the discriminator is prone to overfitting. Previously
+proposed differentiable augmentation demonstrates improved data efficiency of
+training GANs. However, the augmentation implicitly introduces undesired
+invariance to augmentation for the discriminator since it ignores the change of
+semantics in the label space caused by data transformation, which may limit the
+representation learning ability of the discriminator and ultimately affect the
+generative modeling performance of the generator. To mitigate the negative
+impact of invariance while inheriting the benefits of data augmentation, we
+propose a novel augmentation-aware self-supervised discriminator that predicts
+the augmentation parameter of the augmented data. Particularly, the prediction
+targets of real data and generated data are required to be distinguished since
+they are different during training. We further encourage the generator to
+adversarially learn from the self-supervised discriminator by generating
+augmentation-predictable real and not fake data. This formulation connects the
+learning objective of the generator and the arithmetic $-$ harmonic mean
+divergence under certain assumptions. We compare our method with
+state-of-the-art (SOTA) methods using the class-conditional BigGAN and
+unconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100,
+FFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate
+significant improvements of our method over SOTA methods in training
+data-efficient GANs.
+
+
+
+ comment: NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Learning from Mistakes: Self-Regularizing Hierarchical Representations
+ in Point Cloud Semantic Segmentation
+
+
+ Recent advances in autonomous robotic technologies have highlighted the
+growing need for precise environmental analysis. LiDAR semantic segmentation
+has gained attention to accomplish fine-grained scene understanding by acting
+directly on raw content provided by sensors. Recent solutions showed how
+different learning techniques can be used to improve the performance of the
+model, without any architectural or dataset change. Following this trend, we
+present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK)
+derived from a standard model. First, classes are clustered into macro groups
+according to mutual prediction errors; then, the learning process is
+regularized by: (1) aligning class-conditional prototypical feature
+representation for both fine and coarse classes, (2) weighting instances with a
+per-class fairness index. Our LEAK approach is very general and can be
+seamlessly applied on top of any segmentation architecture; indeed,
+experimental results showed that it enables state-of-the-art performances on
+different architectures, datasets and tasks, while ensuring more balanced
+class-wise results and faster convergence.
+
+
+
+
+
+
+
+ ♻ ☆ MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete
+ Representations
+
+
+ In this work, we present MoConVQ, a novel unified framework for physics-based
+motion control leveraging scalable discrete representations. Building upon
+vector quantized variational autoencoders (VQ-VAE) and model-based
+reinforcement learning, our approach effectively learns motion embeddings from
+a large, unstructured dataset spanning tens of hours of motion examples. The
+resultant motion representation not only captures diverse motion skills but
+also offers a robust and intuitive interface for various applications. We
+demonstrate the versatility of MoConVQ through several applications: universal
+tracking control from various motion sources, interactive character control
+with latent motion representations using supervised learning, physics-based
+motion generation from natural language descriptions using the GPT framework,
+and, most interestingly, seamless integration with large language models (LLMs)
+with in-context learning to tackle complex and abstract tasks.
+
+
+
+ comment: Project page: MoConVQ.github.io
+
+
+
+
+
+
+ ♻ ☆ Vertical Federated Alzheimer's Detection on Multimodal Data
+
+
+ In the era of rapidly advancing medical technologies, the segmentation of
+medical data has become inevitable, necessitating the development of privacy
+preserving machine learning algorithms that can train on distributed data.
+Consolidating sensitive medical data is not always an option particularly due
+to the stringent privacy regulations imposed by the Health Insurance
+Portability and Accountability Act (HIPAA). In this paper, we introduce a HIPAA
+compliant framework that can train from distributed data. We then propose a
+multimodal vertical federated model for Alzheimer's Disease (AD) detection, a
+serious neurodegenerative condition that can cause dementia, severely impairing
+brain function and hindering simple tasks, especially without preventative
+care. This vertical federated model offers a distributed architecture that
+enables collaborative learning across diverse sources of medical data while
+respecting privacy constraints imposed by HIPAA. It is also able to leverage
+multiple modalities of data, enhancing the robustness and accuracy of AD
+detection. Our proposed model not only contributes to the advancement of
+federated learning techniques but also holds promise for overcoming the hurdles
+posed by data segmentation in medical research. By using vertical federated
+learning, this research strives to provide a framework that enables healthcare
+institutions to harness the collective intelligence embedded in their
+distributed datasets without compromising patient privacy.
+
+
+
+
+
+
+
+
+ Dennis Hein, Staffan Holmin, Timothy Szczykutowicz, Jonathan S Maltz, Mats Danielsson, Ge Wang, Mats Persson
+
+
+ Diffusion and Poisson flow models have shown impressive performance in a wide
+range of generative tasks, including low-dose CT image denoising. However, one
+limitation in general, and for clinical applications in particular, is slow
+sampling. Due to their iterative nature, the number of function evaluations
+(NFE) required is usually on the order of $10-10^3$, both for conditional and
+unconditional generation. In this paper, we present posterior sampling Poisson
+flow generative models (PPFM), a novel image denoising technique for low-dose
+and photon-counting CT that produces excellent image quality whilst keeping
+NFE=1. Updating the training and sampling processes of Poisson flow generative
+models (PFGM)++, we learn a conditional generator which defines a trajectory
+between the prior noise distribution and the posterior distribution of
+interest. We additionally hijack and regularize the sampling process to achieve
+NFE=1. Our results shed light on the benefits of the PFGM++ framework compared
+to diffusion models. In addition, PPFM is shown to perform favorably compared
+to current state-of-the-art diffusion-style models with NFE=1, consistency
+models, as well as popular deep learning and non-deep learning-based image
+denoising techniques, on clinical low-dose CT images and clinical images from a
+prototype photon-counting CT system.
+
+
+
+
+
+
+
+ ♻ ☆ Poincaré ResNet
+
+
+
+
+
+
+
+
+ Max van Spengler, Erwin Berkhout, Pascal Mettes
+
+
+ This paper introduces an end-to-end residual network that operates entirely
+on the Poincar\'e ball model of hyperbolic space. Hyperbolic learning has
+recently shown great potential for visual understanding, but is currently only
+performed in the penultimate layer(s) of deep networks. All visual
+representations are still learned through standard Euclidean networks. In this
+paper we investigate how to learn hyperbolic representations of visual data
+directly from the pixel-level. We propose Poincar\'e ResNet, a hyperbolic
+counterpart of the celebrated residual network, starting from Poincar\'e 2D
+convolutions up to Poincar\'e residual connections. We identify three
+roadblocks for training convolutional networks entirely in hyperbolic space and
+propose a solution for each: (i) Current hyperbolic network initializations
+collapse to the origin, limiting their applicability in deeper networks. We
+provide an identity-based initialization that preserves norms over many layers.
+(ii) Residual networks rely heavily on batch normalization, which comes with
+expensive Fr\'echet mean calculations in hyperbolic space. We introduce
+Poincar\'e midpoint batch normalization as a faster and equally effective
+alternative. (iii) Due to the many intermediate operations in Poincar\'e
+layers, we lastly find that the computation graphs of deep learning libraries
+blow up, limiting our ability to train on deep hyperbolic networks. We provide
+manual backward derivations of core hyperbolic operations to maintain
+manageable computation graphs.
+
+
+
+ comment: International Conference on Computer Vision 2023
+
+
+
+
+
+
+ ♻ ☆ Color-NeuS: Reconstructing Neural Implicit Surfaces with Color
+
+
+
+
+
+
+
+
+ Licheng Zhong, Lixin Yang, Kailin Li, Haoyu Zhen, Mei Han, Cewu Lu
+
+
+ The reconstruction of object surfaces from multi-view images or monocular
+video is a fundamental issue in computer vision. However, much of the recent
+research concentrates on reconstructing geometry through implicit or explicit
+methods. In this paper, we shift our focus towards reconstructing mesh in
+conjunction with color. We remove the view-dependent color from neural volume
+rendering while retaining volume rendering performance through a relighting
+network. Mesh is extracted from the signed distance function (SDF) network for
+the surface, and color for each surface vertex is drawn from the global color
+network. To evaluate our approach, we conceived a in hand object scanning task
+featuring numerous occlusions and dramatic shifts in lighting conditions. We've
+gathered several videos for this task, and the results surpass those of any
+existing methods capable of reconstructing mesh alongside color. Additionally,
+our method's performance was assessed using public datasets, including DTU,
+BlendedMVS, and OmniObject3D. The results indicated that our method performs
+well across all these datasets. Project page:
+https://colmar-zlicheng.github.io/color_neus.
+
+
+ Biometrics is indispensable in this modern digital era for secure automated
+human authentication in various fields of machine learning and pattern
+recognition. Hand geometry is a promising physiological biometric trait with
+ample deployed application areas for identity verification. Due to the
+intricate anatomic foundation of the thumb and substantial inter-finger posture
+variation, satisfactory performances cannot be achieved while the thumb is
+included in the contact-free environment. To overcome the hindrances associated
+with the thumb, four finger-based (excluding the thumb) biometric approaches
+have been devised. In this chapter, a four-finger based biometric method has
+been presented. Again, selection of salient features is essential to reduce the
+feature dimensionality by eliminating the insignificant features. Weights are
+assigned according to the discriminative efficiency of the features to
+emphasize on the essential features. Two different strategies namely, the
+global and local feature selection methods are adopted based on the adaptive
+forward-selection and backward-elimination (FoBa) algorithm. The identification
+performances are evaluated using the weighted k-nearest neighbor (wk-NN) and
+random forest (RF) classifiers. The experiments are conducted using the
+selected feature subsets over the 300 subjects of the Bosphorus hand database.
+The best identification accuracy of 98.67%, and equal error rate (EER) of 4.6%
+have been achieved using the subset of 25 features which are selected by the
+rank-based local FoBa algorithm.
+
+
+
+ comment: 34 pages. The Biometric Computing: Recognition and Registration, 2019
+
+
+
+
+
+
+ ♻ ☆ Image Captioning with Multi-Context Synthetic Data AAAI 2024
+
+
+ Image captioning requires numerous annotated image-text pairs, resulting in
+substantial annotation costs. Recently, large models (e.g. diffusion models and
+large language models) have excelled in producing high-quality images and text.
+This potential can be harnessed to create synthetic image-text pairs for
+training captioning models. Synthetic data can improve cost and time efficiency
+in data collection, allow for customization to specific domains, bootstrap
+generalization capability for zero-shot performance, and circumvent privacy
+concerns associated with real-world data. However, existing methods struggle to
+attain satisfactory performance solely through synthetic data. We identify the
+issue as generated images from simple descriptions mostly capture a solitary
+perspective with limited context, failing to align with the intricate scenes
+prevalent in real-world imagery. To tackle this, we present an innovative
+pipeline that introduces multi-context data generation. Beginning with an
+initial text corpus, our approach employs a large language model to extract
+multiple sentences portraying the same scene from diverse viewpoints. These
+sentences are then condensed into a single sentence with multiple contexts.
+Subsequently, we generate intricate images using the condensed captions through
+diffusion models. Our model is exclusively trained on synthetic image-text
+pairs crafted through this process. The effectiveness of our pipeline is
+validated through experimental results in both the in-domain and cross-domain
+settings, where it achieves state-of-the-art performance on well-known datasets
+such as MSCOCO, Flickr30k, and NoCaps.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ SEPT: Towards Efficient Scene Representation Learning for Motion
+ Prediction
+
+
+
+
+
+
+
+
+ Zhiqian Lan, Yuxuan Jiang, Yao Mu, Chen Chen, Shengbo Eben Li
+
+
+ Motion prediction is crucial for autonomous vehicles to operate safely in
+complex traffic environments. Extracting effective spatiotemporal relationships
+among traffic elements is key to accurate forecasting. Inspired by the
+successful practice of pretrained large language models, this paper presents
+SEPT, a modeling framework that leverages self-supervised learning to develop
+powerful spatiotemporal understanding for complex traffic scenes. Specifically,
+our approach involves three masking-reconstruction modeling tasks on scene
+inputs including agents' trajectories and road network, pretraining the scene
+encoder to capture kinematics within trajectory, spatial structure of road
+network, and interactions among roads and agents. The pretrained encoder is
+then finetuned on the downstream forecasting task. Extensive experiments
+demonstrate that SEPT, without elaborate architectural design or manual feature
+engineering, achieves state-of-the-art performance on the Argoverse 1 and
+Argoverse 2 motion forecasting benchmarks, outperforming previous methods on
+all main metrics by a large margin.
+
+
+
+
+
+
+
+ ♻ ☆ Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions? CVPR 2023
+
+
+
+
+
+
+
+
+ Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason
+
+
+ Data augmentation via back-translation is common when pretraining
+Vision-and-Language Navigation (VLN) models, even though the generated
+instructions are noisy. But: does that noise matter? We find that nonsensical
+or irrelevant language instructions during pretraining can have little effect
+on downstream performance for both HAMT and VLN-BERT on R2R, and is still
+better than only using clean, human data. To underscore these results, we
+concoct an efficient augmentation method, Unigram + Object, which generates
+nonsensical instructions that nonetheless improve downstream performance. Our
+findings suggest that what matters for VLN R2R pretraining is the quantity of
+visual trajectories, not the quality of instructions.
+
+
+
+ comment: Accepted by O-DRUM @ CVPR 2023
+
+
+
+
+
+
+ ♻ ☆ VidToMe: Video Token Merging for Zero-Shot Video Editing
+
+
+ Diffusion models have made significant advances in generating high-quality
+images, but their application to video generation has remained challenging due
+to the complexity of temporal motion. Zero-shot video editing offers a solution
+by utilizing pre-trained image diffusion models to translate source videos into
+new ones. Nevertheless, existing methods struggle to maintain strict temporal
+consistency and efficient memory consumption. In this work, we propose a novel
+approach to enhance temporal consistency in generated videos by merging
+self-attention tokens across frames. By aligning and compressing temporally
+redundant tokens across frames, our method improves temporal coherence and
+reduces memory consumption in self-attention computations. The merging strategy
+matches and aligns tokens according to the temporal correspondence between
+frames, facilitating natural temporal consistency in generated video frames. To
+manage the complexity of video processing, we divide videos into chunks and
+develop intra-chunk local token merging and inter-chunk global token merging,
+ensuring both short-term video continuity and long-term content consistency.
+Our video editing approach seamlessly extends the advancements in image editing
+to video editing, rendering favorable results in temporal consistency over
+state-of-the-art methods.
+
+
+ For image restoration, methods leveraging priors from generative models have
+been proposed and demonstrated a promising capacity to robustly restore
+photorealistic and high-quality results. However, these methods are susceptible
+to semantic ambiguity, particularly with images that have obviously correct
+semantics such as facial images. In this paper, we propose a semantic-aware
+latent space exploration method for image restoration (SAIR). By explicitly
+modeling semantics information from a given reference image, SAIR is able to
+reliably restore severely degraded images not only to high-resolution and
+highly realistic looks but also to correct semantics. Quantitative and
+qualitative experiments collectively demonstrate the superior performance of
+the proposed SAIR. Our code is available at https://github.com/Liamkuo/SAIR.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ Mind the Gap: Federated Learning Broadens Domain Generalization in
+ Diagnostic AI Models
+
+
+
+
+
+
+
+
+ Soroosh Tayebi Arasteh, Christiane Kuhl, Marwin-Jonathan Saehn, Peter Isfort, Daniel Truhn, Sven Nebelung
+
+
+ Developing robust artificial intelligence (AI) models that generalize well to
+unseen datasets is challenging and usually requires large and variable
+datasets, preferably from multiple institutions. In federated learning (FL), a
+model is trained collaboratively at numerous sites that hold local datasets
+without exchanging them. So far, the impact of training strategy, i.e., local
+versus collaborative, on the diagnostic on-domain and off-domain performance of
+AI models interpreting chest radiographs has not been assessed. Consequently,
+using 610,000 chest radiographs from five institutions across the globe, we
+assessed diagnostic performance as a function of training strategy (i.e., local
+vs. collaborative), network architecture (i.e., convolutional vs.
+transformer-based), generalization performance (i.e., on-domain vs.
+off-domain), imaging finding (i.e., cardiomegaly, pleural effusion, pneumonia,
+atelectasis, consolidation, pneumothorax, and no abnormality), dataset size
+(i.e., from n=18,000 to 213,921 radiographs), and dataset diversity. Large
+datasets not only showed minimal performance gains with FL but, in some
+instances, even exhibited decreases. In contrast, smaller datasets revealed
+marked improvements. Thus, on-domain performance was mainly driven by training
+data size. However, off-domain performance leaned more on training diversity.
+When trained collaboratively across diverse external institutions, AI models
+consistently surpassed models trained locally for off-domain tasks, emphasizing
+FL's potential in leveraging data diversity. In conclusion, FL can bolster
+diagnostic privacy, reproducibility, and off-domain reliability of AI models
+and, potentially, optimize healthcare outcomes.
+
+
+
+ comment: Published in Nature Scientific Reports
+
+ Machine learning (ML) models trained on data from potentially untrusted
+sources are vulnerable to poisoning. A small, maliciously crafted subset of the
+training inputs can cause the model to learn a "backdoor" task (e.g.,
+misclassify inputs with a certain feature) in addition to its main task. Recent
+research proposed many hypothetical backdoor attacks whose efficacy heavily
+depends on the configuration and training hyperparameters of the target model.
+ Given the variety of potential backdoor attacks, ML engineers who are not
+security experts have no way to measure how vulnerable their current training
+pipelines are, nor do they have a practical way to compare training
+configurations so as to pick the more resistant ones. Deploying a defense
+requires evaluating and choosing from among dozens of research papers and
+re-engineering the training pipeline.
+ In this paper, we aim to provide ML engineers with pragmatic tools to audit
+the backdoor resistance of their training pipelines and to compare different
+training configurations, to help choose one that best balances accuracy and
+security.
+ First, we propose a universal, attack-agnostic resistance metric based on the
+minimum number of training inputs that must be compromised before the model
+learns any backdoor.
+ Second, we design, implement, and evaluate Mithridates a multi-stage approach
+that integrates backdoor resistance into the training-configuration search. ML
+developers already rely on hyperparameter search to find configurations that
+maximize the model's accuracy. Mithridates extends this standard tool to
+balance accuracy and resistance without disruptive changes to the training
+pipeline. We show that hyperparameters found by Mithridates increase resistance
+to multiple types of backdoor attacks by 3-5x with only a slight impact on
+accuracy. We also discuss extensions to AutoML and federated learning.
+
+
+
+
+
+
+
+ ♻ ☆ COSMOS: Cross-Modality Unsupervised Domain Adaptation for 3D Medical
+ Image Segmentation based on Target-aware Domain Translation and Iterative
+ Self-Training MICCAI 2021
+
+
+
+
+
+
+
+
+ Hyungseob Shin, Hyeongyu Kim, Sewon Kim, Yohan Jun, Taejoon Eo, Dosik Hwang
+
+
+ Recent advances in deep learning-based medical image segmentation studies
+achieve nearly human-level performance when in fully supervised condition.
+However, acquiring pixel-level expert annotations is extremely expensive and
+laborious in medical imaging fields. Unsupervised domain adaptation can
+alleviate this problem, which makes it possible to use annotated data in one
+imaging modality to train a network that can successfully perform segmentation
+on target imaging modality with no labels. In this work, we propose a
+self-training based unsupervised domain adaptation framework for 3D medical
+image segmentation named COSMOS and validate it with automatic segmentation of
+Vestibular Schwannoma (VS) and cochlea on high-resolution T2 Magnetic Resonance
+Images (MRI). Our target-aware contrast conversion network translates source
+domain annotated T1 MRI to pseudo T2 MRI to enable segmentation training on
+target domain, while preserving important anatomical features of interest in
+the converted images. Iterative self-training is followed to incorporate
+unlabeled data to training and incrementally improve the quality of
+pseudo-labels, thereby leading to improved performance of segmentation. COSMOS
+won the 1\textsuperscript{st} place in the Cross-Modality Domain Adaptation
+(crossMoDA) challenge held in conjunction with the 24th International
+Conference on Medical Image Computing and Computer Assisted Intervention
+(MICCAI 2021). It achieves mean Dice score and Average Symmetric Surface
+Distance of 0.871(0.063) and 0.437(0.270) for VS, and 0.842(0.020) and
+0.152(0.030) for cochlea.
+
+
+ Uveitis demands the precise diagnosis of anterior chamber inflammation (ACI)
+for optimal treatment. However, current diagnostic methods only rely on a
+limited single-modal disease perspective, which leads to poor performance. In
+this paper, we investigate a promising yet challenging way to fuse multimodal
+data for ACI diagnosis. Notably, existing fusion paradigms focus on empowering
+implicit modality interactions (i.e., self-attention and its variants), but
+neglect to inject explicit modality interactions, especially from clinical
+knowledge and imaging property. To this end, we propose a jointly Explicit and
+implicit Cross-Modal Interaction Network (EiCI-Net) for Anterior Chamber
+Inflammation Diagnosis that uses anterior segment optical coherence tomography
+(AS-OCT) images, slit-lamp images, and clinical data jointly. Specifically, we
+first develop CNN-Based Encoders and Tabular Processing Module (TPM) to extract
+efficient feature representations in different modalities. Then, we devise an
+Explicit Cross-Modal Interaction Module (ECIM) to generate attention maps as a
+kind of explicit clinical knowledge based on the tabular feature maps, then
+integrated them into the slit-lamp feature maps, allowing the CNN-Based Encoder
+to focus on more effective informativeness of the slit-lamp images. After that,
+the Implicit Cross-Modal Interaction Module (ICIM), a transformer-based
+network, further implicitly enhances modality interactions. Finally, we
+construct a considerable real-world dataset from our collaborative hospital and
+conduct sufficient experiments to demonstrate the superior performance of our
+proposed EiCI-Net compared with the state-of-the-art classification methods in
+various metrics.
+
+
+
+
+
+
+
+ ♻ ☆ Embedded Feature Similarity Optimization with Specific Parameter
+ Initialization for 2D/3D Medical Image Registration ICASSP 2024
+
+
+ We present a novel deep learning-based framework: Embedded Feature Similarity
+Optimization with Specific Parameter Initialization (SOPI) for 2D/3D medical
+image registration which is a most challenging problem due to the difficulty
+such as dimensional mismatch, heavy computation load and lack of golden
+evaluation standard. The framework we design includes a parameter specification
+module to efficiently choose initialization pose parameter and a
+fine-registration module to align images. The proposed framework takes
+extracting multi-scale features into consideration using a novel composite
+connection encoder with special training techniques. We compare the method with
+both learning-based methods and optimization-based methods on a in-house
+CT/X-ray dataset as well as simulated data to further evaluate performance. Our
+experiments demonstrate that the method in this paper has improved the
+registration performance, and thereby outperforms the existing methods in terms
+of accuracy and running time. We also show the potential of the proposed method
+as an initial pose estimator. The code is available at
+https://github.com/m1nhengChen/SOPI
+
+
+ The study explores the synergistic combination of Synthetic Aperture Radar
+(SAR) and Visible-Near Infrared-Short Wave Infrared (VNIR-SWIR) imageries for
+land use/land cover (LULC) classification. Image fusion, employing Bayesian
+fusion, merges SAR texture bands with VNIR-SWIR imageries. The research aims to
+investigate the impact of this fusion on LULC classification. Despite the
+popularity of random forests for supervised classification, their limitations,
+such as suboptimal performance with fewer features and accuracy stagnation, are
+addressed. To overcome these issues, ensembles of random forests (RFE) are
+created, introducing random rotations using the Forest-RC algorithm. Three
+rotation approaches: principal component analysis (PCA), sparse random rotation
+(SRP) matrix, and complete random rotation (CRP) matrix are employed.
+Sentinel-1 SAR data and Sentinel-2 VNIR-SWIR data from the IIT-Kanpur region
+constitute the training datasets, including SAR, SAR with texture, VNIR-SWIR,
+VNIR-SWIR with texture, and fused VNIR-SWIR with texture. The study evaluates
+classifier efficacy, explores the impact of SAR and VNIR-SWIR fusion on
+classification, and significantly enhances the execution speed of Bayesian
+fusion code. The SRP-based RFE outperforms other ensembles for the first two
+datasets, yielding average overall kappa values of 61.80% and 68.18%, while the
+CRP-based RFE excels for the last three datasets with average overall kappa
+values of 95.99%, 96.93%, and 96.30%. The fourth dataset achieves the highest
+overall kappa of 96.93%. Furthermore, incorporating texture with SAR bands
+results in a maximum overall kappa increment of 10.00%, while adding texture to
+VNIR-SWIR bands yields a maximum increment of approximately 3.45%.
+
+
+
+ comment: Thesis for Master of Technology. Created: July 2018. Total pages 124
+
+
+
+
+
+
+ ♻ ☆ X2-Softmax: Margin Adaptive Loss Function for Face Recognition
+
+
+ Learning the discriminative features of different faces is an important task
+in face recognition. By extracting face features in neural networks, it becomes
+easy to measure the similarity of different face images, which makes face
+recognition possible. To enhance the neural network's face feature
+separability, incorporating an angular margin during training is common
+practice. State-of-the-art loss functions CosFace and ArcFace apply fixed
+margins between weights of classes to enhance the inter-class separation of
+face features. Since the distribution of samples in the training set is
+imbalanced, similarities between different identities are unequal. Therefore,
+using an inappropriately fixed angular margin may lead to the problem that the
+model is difficult to converge or the face features are not discriminative
+enough. It is more in line with our intuition that the margins are angular
+adaptive, which could increase with the angles between classes growing. In this
+paper, we propose a new angular margin loss named X2-Softmax. X2-Softmax loss
+has adaptive angular margins, which provide the margin that increases with the
+angle between different classes growing. The angular adaptive margin ensures
+model flexibility and effectively improves the effect of face recognition. We
+have trained the neural network with X2-Softmax loss on the MS1Mv3 dataset and
+tested it on several evaluation benchmarks to demonstrate the effectiveness and
+superiority of our loss function.
+
+
+ Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a
+given set of base categories on which the detection model is trained. Recent
+OVOD methods focus on adapting the image-level pre-trained vision-language
+models (VLMs), such as CLIP, to a region-level object detection task via, eg.,
+region-level knowledge distillation, regional prompt learning, or region-text
+pre-training, to expand the detection vocabulary. These methods have
+demonstrated remarkable performance in recognizing regional visual concepts,
+but they are weak in exploiting the VLMs' powerful global scene understanding
+ability learned from the billion-scale image-level text descriptions. This
+limits their capability in detecting hard objects of small, blurred, or
+occluded appearance from novel/base categories, whose detection heavily relies
+on contextual information. To address this, we propose a novel approach, namely
+Simple Image-level Classification for Context-Aware Detection Scoring
+(SIC-CADS), to leverage the superior global knowledge yielded from CLIP for
+complementing the current OVOD models from a global perspective. The core of
+SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the
+object co-occurrence-based contextual information from CLIP to recognize all
+possible object categories in the scene. These image-level MLR scores can then
+be utilized to refine the instance-level detection scores of the current OVOD
+models in detecting those hard objects. This is verified by extensive empirical
+results on two popular benchmarks, OV-LVIS and OV-COCO, which show that
+SIC-CADS achieves significant and consistent improvement when combined with
+different types of OVOD models. Further, SIC-CADS also improves the
+cross-dataset generalization ability on Objects365 and OpenImages. The code is
+available at https://github.com/mala-lab/SIC-CADS.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Image Captioners Are Scalable Vision Learners Too NeurIPS 2023
+
+
+
+
+
+
+
+
+ Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer
+
+
+ Contrastive pretraining on image-text pairs from the web is one of the most
+popular large-scale pretraining strategies for vision backbones, especially in
+the context of large multimodal models. At the same time, image captioning on
+this type of data is commonly considered an inferior pretraining strategy. In
+this paper, we perform a fair comparison of these two pretraining strategies,
+carefully matching training data, compute, and model capacity. Using a standard
+encoder-decoder transformer, we find that captioning alone is surprisingly
+effective: on classification tasks, captioning produces vision encoders
+competitive with contrastively pretrained encoders, while surpassing them on
+vision & language tasks. We further analyze the effect of the model
+architecture and scale, as well as the pretraining data on the representation
+quality, and find that captioning exhibits the same or better scaling behavior
+along these axes. Overall our results show that plain image captioning is a
+more powerful pretraining strategy than was previously believed.
+
+
+
+ comment: Accepted at NeurIPS 2023. v2 adds SugarCrepe results and more
+ ablations, v3 has minor fixes. v4 adds a code link (
+ https://github.com/google-research/big_vision )
+
+
+
+
+
+
+ ♻ ☆ Keep the Faith: Faithful Explanations in Convolutional Neural Networks
+ for Case-Based Reasoning AAAI
+
+
+
+
+
+
+
+
+ Tom Nuno Wolf, Fabian Bongratz, Anne-Marie Rickmann, Sebastian Pölsterl, Christian Wachinger
+
+
+ Explaining predictions of black-box neural networks is crucial when applied
+to decision-critical tasks. Thus, attribution maps are commonly used to
+identify important image regions, despite prior work showing that humans prefer
+explanations based on similar examples. To this end, ProtoPNet learns a set of
+class-representative feature vectors (prototypes) for case-based reasoning.
+During inference, similarities of latent features to prototypes are linearly
+classified to form predictions and attribution maps are provided to explain the
+similarity. In this work, we evaluate whether architectures for case-based
+reasoning fulfill established axioms required for faithful explanations using
+the example of ProtoPNet. We show that such architectures allow the extraction
+of faithful explanations. However, we prove that the attribution maps used to
+explain the similarities violate the axioms. We propose a new procedure to
+extract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually,
+these explanations are Shapley values, calculated on the similarity scores of
+each prototype. They allow to faithfully answer which prototypes are present in
+an unseen image and quantify each pixel's contribution to that presence,
+thereby complying with all axioms. The theoretical violations of ProtoPNet
+manifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs,
+RSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50,
+ResNeXt50). Our experiments show a qualitative difference between the
+explanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the
+explanations with the Area Over the Perturbation Curve, on which ProtoPFaith
+outperforms ProtoPNet on all experiments by a factor $>10^3$.
+
+
+
+ comment: To be published in proceedings of AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+
+ Franz Thaler, Matthias A. F. Gsell, Gernot Plank, Martin Urschler
+
+
+ Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely
+established to assess the viability of myocardial tissue of patients after
+acute myocardial infarction (MI). We propose the Cascading Refinement CNN
+(CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that
+exploits the hierarchical structure of such labeled cardiac data. Throughout
+the three stages of the cascade, the label definition changes and CaRe-CNN
+learns to gradually refine its intermediate predictions accordingly.
+Furthermore, to obtain more consistent qualitative predictions, we propose a
+series of post-processing steps that take anatomical constraints into account.
+Our CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked
+second out of 18 participating teams. CaRe-CNN showed great improvements most
+notably when segmenting the difficult but clinically most relevant myocardial
+infarct tissue (MIT) as well as microvascular obstructions (MVO). When
+computing the average scores over all labels, our method obtained the best
+score in eight out of ten metrics. Thus, accurate cardiac segmentation after
+acute MI via our CaRe-CNN allows generating patient-specific models of the
+heart serving as an important step towards personalized medicine.
+
+
+
+ comment: Accepted at VISIGRAPP 2024, 12 pages
+
+
+
+
+
+
+ ♻ ☆ Local region-learning modules for point cloud classification
+
+
+ Data organization via forming local regions is an integral part of deep
+learning networks that process 3D point clouds in a hierarchical manner. At
+each level, the point cloud is sampled to extract representative points and
+these points are used to be centers of local regions. The organization of local
+regions is of considerable importance since it determines the location and size
+of the receptive field at a particular layer of feature aggregation. In this
+paper, we present two local region-learning modules: Center Shift Module to
+infer the appropriate shift for each center point, and Radius Update Module to
+alter the radius of each local region. The parameters of the modules are
+learned through optimizing the loss associated with the particular task within
+an end-to-end network. We present alternatives for these modules through
+various ways of modeling the interactions of the features and locations of 3D
+points in the point cloud. We integrated both modules independently and
+together to the PointNet++ and PointCNN object classification architectures,
+and demonstrated that the modules contributed to a significant increase in
+classification accuracy for the ScanObjectNN data set consisting of scans of
+real-world objects. Our further experiments on ShapeNet data set showed that
+the modules are also effective on 3D CAD models.
+
+
+
+
+
+
+
+ ♻ ☆ BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
+
+
+ The recently emerging text-to-motion advances have spired numerous attempts
+for convenient and interactive human motion generation. Yet, existing methods
+are largely limited to generating body motions only without considering the
+rich two-hand motions, let alone handling various conditions like body dynamics
+or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal
+dataset for two-hand motion generation. Our dataset includes accurate motion
+tracking for the human body and hands and provides pair-wised finger-level hand
+annotations and body descriptions. We further provide a strong baseline method,
+BOTH2Hands, for the novel task: generating vivid two-hand motions from both
+implicit body dynamics and explicit text prompts. We first warm up two parallel
+body-to-hand and text-to-hand diffusion models and then utilize the
+cross-attention transformer for motion blending. Extensive experiments and
+cross-validations demonstrate the effectiveness of our approach and dataset for
+generating convincing two-hand motions from the hybrid body-and-textual
+conditions. Our dataset and code will be disseminated to the community for
+future research.
+
+
+
+
+
+
+
+
+ Won Jo, Geuntaek Lim, Gwangjin Lee, Hyunwoo Kim, Byungsoo Ko, Yukyung Choi
+
+
+ In content-based video retrieval (CBVR), dealing with large-scale
+collections, efficiency is as important as accuracy; thus, several video-level
+feature-based studies have actively been conducted. Nevertheless, owing to the
+severe difficulty of embedding a lengthy and untrimmed video into a single
+feature, these studies have been insufficient for accurate retrieval compared
+to frame-level feature-based studies. In this paper, we show that appropriate
+suppression of irrelevant frames can provide insight into the current obstacles
+of the video-level approaches. Furthermore, we propose a Video-to-Video
+Suppression network (VVS) as a solution. VVS is an end-to-end framework that
+consists of an easy distractor elimination stage to identify which frames to
+remove and a suppression weight generation stage to determine the extent to
+suppress the remaining frames. This structure is intended to effectively
+describe an untrimmed video with varying content and meaningless information.
+Its efficacy is proved via extensive experiments, and we show that our approach
+is not only state-of-the-art in video-level approaches but also has a fast
+inference time despite possessing retrieval capabilities close to those of
+frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS
+
+
+
+ comment: AAAI-24
+
+
+
+
+
+
+ ♻ ☆ FIRe: Fast Inverse Rendering using Directional and Signed Distance
+ Functions WACV'24
+
+
+
+
+
+
+
+
+ Tarun Yenamandra, Ayush Tewari, Nan Yang, Florian Bernard, Christian Theobalt, Daniel Cremers
+
+
+ Neural 3D implicit representations learn priors that are useful for diverse
+applications, such as single- or multiple-view 3D reconstruction. A major
+downside of existing approaches while rendering an image is that they require
+evaluating the network multiple times per camera ray so that the high
+computational time forms a bottleneck for downstream applications. We address
+this problem by introducing a novel neural scene representation that we call
+the directional distance function (DDF). To this end, we learn a signed
+distance function (SDF) along with our DDF model to represent a class of
+shapes. Specifically, our DDF is defined on the unit sphere and predicts the
+distance to the surface along any given direction. Therefore, our DDF allows
+rendering images with just a single network evaluation per camera ray. Based on
+our DDF, we present a novel fast algorithm (FIRe) to reconstruct 3D shapes
+given a posed depth map. We evaluate our proposed method on 3D reconstruction
+from single-view depth images, where we empirically show that our algorithm
+reconstructs 3D shapes more accurately and it is more than 15 times faster (per
+iteration) than competing methods.
+
+
+ Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical
+role in the diagnosis of many medical diseases, such as Arteriovenous
+Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant
+application value, the reconstruction of 4D DSA demands numerous views to
+effectively model the intricate vessels and radiocontrast flow, thereby
+implying a significant radiation dose. To address this high radiation issue, we
+propose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA
+reconstruction, which paves the way for high-quality 4D imaging. Additionally,
+2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA
+images. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation
+properties from both spatial and temporal dimensions. It is optimized by
+minimizing discrepancies between the rendered images and sparse 2D DSA images.
+Without any neural network involved, TiAVox enjoys specific physical
+interpretability. The parameters of each learnable voxel represent the
+attenuation coefficients. We validated the TiAVox approach on both clinical and
+simulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for
+novel view synthesis using only 30 views on the clinically sourced dataset,
+whereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly,
+with merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32
+for novel view synthesis and 41.40 for 3D reconstruction. We also executed
+ablation studies to corroborate the essential components of TiAVox. The code
+will be publically available.
+
+
+
+
+
+
+
+
+ Joonhyun Jeong, Geondo Park, Jayeon Yoo, Hyungsik Jung, Heesu Kim
+
+
+ Open-vocabulary object detection (OVOD) aims to recognize novel objects whose
+categories are not included in the training set. In order to classify these
+unseen classes during training, many OVOD frameworks leverage the zero-shot
+capability of largely pretrained vision and language models, such as CLIP. To
+further improve generalization on the unseen novel classes, several approaches
+proposed to additionally train with pseudo region labeling on the external data
+sources that contain a substantial number of novel category labels beyond the
+existing training data. Albeit its simplicity, these pseudo-labeling methods
+still exhibit limited improvement with regard to the truly unseen novel classes
+that were not pseudo-labeled. In this paper, we present a novel, yet simple
+technique that helps generalization on the overall distribution of novel
+classes. Inspired by our observation that numerous novel classes reside within
+the convex hull constructed by the base (seen) classes in the CLIP embedding
+space, we propose to synthesize proxy-novel classes approximating novel classes
+via linear mixup between a pair of base classes. By training our detector with
+these synthetic proxy-novel classes, we effectively explore the embedding space
+of novel classes. The experimental results on various OVOD benchmarks such as
+LVIS and COCO demonstrate superior performance on novel classes compared to the
+other state-of-the-art methods. Code is available at
+https://github.com/clovaai/ProxyDet.
+
+
+ The aim of audio-visual segmentation (AVS) is to precisely differentiate
+audible objects within videos down to the pixel level. Traditional approaches
+often tackle this challenge by combining information from various modalities,
+where the contribution of each modality is implicitly or explicitly modeled.
+Nevertheless, the interconnections between different modalities tend to be
+overlooked in audio-visual modeling. In this paper, inspired by the human
+ability to mentally simulate the sound of an object and its visual appearance,
+we introduce a bidirectional generation framework. This framework establishes
+robust correlations between an object's visual characteristics and its
+associated sound, thereby enhancing the performance of AVS. To achieve this, we
+employ a visual-to-audio projection component that reconstructs audio features
+from object segmentation masks and minimizes reconstruction errors. Moreover,
+recognizing that many sounds are linked to object movements, we introduce an
+implicit volumetric motion estimation module to handle temporal dynamics that
+may be challenging to capture using conventional optical flow methods. To
+showcase the effectiveness of our approach, we conduct comprehensive
+experiments and analyses on the widely recognized AVSBench benchmark. As a
+result, we establish a new state-of-the-art performance level in the AVS
+benchmark, particularly excelling in the challenging MS3 subset which involves
+segmenting multiple sound sources. To facilitate reproducibility, we plan to
+release both the source code and the pre-trained model.
+
+
+
+ comment: AAAI Camera Ready. Dawei Hao and Yuxin Mao contribute equality to
+ this paper. Yiran Zhong is the corresponding author. The code will be
+ released at https://github.com/OpenNLPLab/AVS-bidirectional
+
+
+
+
+
+
+ ♻ ☆ Out-of-Distribution Detection in Long-Tailed Recognition with Calibrated
+ Outlier Class Learning AAAI2024
+
+
+ Existing out-of-distribution (OOD) methods have shown great success on
+balanced datasets but become ineffective in long-tailed recognition (LTR)
+scenarios where 1) OOD samples are often wrongly classified into head classes
+and/or 2) tail-class samples are treated as OOD samples. To address these
+issues, current studies fit a prior distribution of auxiliary/pseudo OOD data
+to the long-tailed in-distribution (ID) data. However, it is difficult to
+obtain such an accurate prior distribution given the unknowingness of real OOD
+samples and heavy class imbalance in LTR. A straightforward solution to avoid
+the requirement of this prior is to learn an outlier class to encapsulate the
+OOD samples. The main challenge is then to tackle the aforementioned confusion
+between OOD samples and head/tail-class samples when learning the outlier
+class. To this end, we introduce a novel calibrated outlier class learning
+(COCL) approach, in which 1) a debiased large margin learning method is
+introduced in the outlier class learning to distinguish OOD samples from both
+head and tail classes in the representation space and 2) an outlier-class-aware
+logit calibration method is defined to enhance the long-tailed classification
+confidence. Extensive empirical results on three popular benchmarks CIFAR10-LT,
+CIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms
+state-of-the-art OOD detection methods in LTR while being able to improve the
+classification accuracy on ID data. Code is available at
+https://github.com/mala-lab/COCL.
+
+
+
+ comment: AAAI2024, with supplementary material
+
+
+
+
+
+
+ ♻ ☆ Supervision Interpolation via LossMix: Generalizing Mixup for Object
+ Detection and Beyond AAAI-24
+
+
+ The success of data mixing augmentations in image classification tasks has
+been well-received. However, these techniques cannot be readily applied to
+object detection due to challenges such as spatial misalignment,
+foreground/background distinction, and plurality of instances. To tackle these
+issues, we first introduce a novel conceptual framework called Supervision
+Interpolation (SI), which offers a fresh perspective on interpolation-based
+augmentations by relaxing and generalizing Mixup. Based on SI, we propose
+LossMix, a simple yet versatile and effective regularization that enhances the
+performance and robustness of object detectors and more. Our key insight is
+that we can effectively regularize the training on mixed data by interpolating
+their loss errors instead of ground truth labels. Empirical results on the
+PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently
+outperform state-of-the-art methods widely adopted for detection. Furthermore,
+by jointly leveraging LossMix with unsupervised domain adaptation, we
+successfully improve existing approaches and set a new state of the art for
+cross-domain object detection.
+
+
+
+ comment: AAAI-24 Camera Ready Version, with supplementary material, 15 pages
+
+
+
+
+
+
+ ♻ ☆ Identifying Label Errors in Object Detection Datasets by Loss Inspection
+
+
+
+
+
+
+
+
+ Marius Schubert, Tobias Riedlinger, Karsten Kahl, Daniel Kröll, Sebastian Schoenen, Siniša Šegvić, Matthias Rottmann
+
+
+ Labeling datasets for supervised object detection is a dull and
+time-consuming task. Errors can be easily introduced during annotation and
+overlooked during review, yielding inaccurate benchmarks and performance
+degradation of deep neural networks trained on noisy labels. In this work, we
+for the first time introduce a benchmark for label error detection methods on
+object detection datasets as well as a label error detection method and a
+number of baselines. We simulate four different types of randomly introduced
+label errors on train and test sets of well-labeled object detection datasets.
+For our label error detection method we assume a two-stage object detector to
+be given and consider the sum of both stages' classification and regression
+losses. The losses are computed with respect to the predictions and the noisy
+labels including simulated label errors, aiming at detecting the latter. We
+compare our method to three baselines: a naive one without deep learning, the
+object detector's score and the entropy of the classification softmax
+distribution. We outperform all baselines and demonstrate that among the
+considered methods, ours is the only one that detects label errors of all four
+types efficiently. Furthermore, we detect real label errors a) on commonly used
+test datasets in object detection and b) on a proprietary dataset. In both
+cases we achieve low false positives rates, i.e., we detect label errors with a
+precision for a) of up to 71.5% and for b) with 97%.
+
+
+
+
+
+
+
+ ♻ ☆ Long-Tailed Classification Based on Coarse-Grained Leading Forest and
+ Multi-Center Loss
+
+
+
+
+
+
+
+
+ Jinye Yang, Ji Xu, Di Wu, Jianhang Tang, Shaobo Li, Guoyin Wang
+
+
+ Long-tailed (LT) classification is an unavoidable and challenging problem in
+the real world. Most existing long-tailed classification methods focus only on
+solving the class-wise imbalance while ignoring the attribute-wise imbalance.
+The deviation of a classification model is caused by both class-wise and
+attribute-wise imbalance. Due to the fact that attributes are implicit in most
+datasets and the combination of attributes is complex, attribute-wise imbalance
+is more difficult to handle. For this purpose, we proposed a novel long-tailed
+classification framework, aiming to build a multi-granularity classification
+model by means of invariant feature learning. This method first unsupervisedly
+constructs Coarse-Grained forest (CLF) to better characterize the distribution
+of attributes within a class. Depending on the distribution of attributes, one
+can customize suitable sampling strategies to construct different imbalanced
+datasets. We then introduce multi-center loss (MCL) that aims to gradually
+eliminate confusing attributes during feature learning process. The proposed
+framework does not necessarily couple to a specific LT classification model
+structure and can be integrated with any existing LT method as an independent
+component. Extensive experiments show that our approach achieves
+state-of-the-art performance on both existing benchmarks ImageNet-GLT and
+MSCOCO-GLT and can improve the performance of existing LT methods. Our codes
+are available on GitHub: \url{https://github.com/jinyery/cognisance}
+
+
+
+ comment: This is another research work to apply leading tree structure along
+ with deep learning architecture, aiming to deal with attribute-wise long-tail
+ distribution within class
+
+ Conventional Federated Domain Adaptation (FDA) approaches usually demand an
+abundance of assumptions, which makes them significantly less feasible for
+real-world situations and introduces security hazards. This paper relaxes the
+assumptions from previous FDAs and studies a more practical scenario named
+Universal Federated Domain Adaptation (UFDA). It only requires the black-box
+model and the label set information of each source domain, while the label sets
+of different source domains could be inconsistent, and the target-domain label
+set is totally blind. Towards a more effective solution for our newly proposed
+UFDA scenario, we propose a corresponding methodology called Hot-Learning with
+Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain
+shifts and category gaps problems by using one-hot outputs from the black-box
+models of various source domains. Moreover, to better distinguish the shared
+and unknown classes, we further present a cluster-level strategy named
+Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer
+classes from both source and target domains. Extensive experiments on three
+benchmark datasets demonstrate that our method achieves comparable performance
+for our UFDA scenario with much fewer assumptions, compared to previous
+methodologies with comprehensive additional assumptions.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ♻ ☆ iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image
+ Diffusion Model for Interior Design
+
+
+ With the open-sourcing of text-to-image models (T2I) such as stable diffusion
+(SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned
+in specific domains based on the open-source SD model, such as in anime,
+character portraits, etc. However, there are few specialized models in certain
+domains, such as interior design, which is attributed to the complex textual
+descriptions and detailed visual elements inherent in design, alongside the
+necessity for adaptable resolution. Therefore, text-to-image models for
+interior design are required to have outstanding prompt-following capabilities,
+as well as iterative collaboration with design professionals to achieve the
+desired outcome. In this paper, we collect and optimize text-image data in the
+design field and continue training in both English and Chinese on the basis of
+the open-source CLIP model. We also proposed a fine-tuning strategy with
+curriculum learning and reinforcement learning from CLIP feedback to enhance
+the prompt-following capabilities of our approach so as to improve the quality
+of image generation. The experimental results on the collected dataset
+demonstrate the effectiveness of the proposed approach, which achieves
+impressive results and outperforms strong baselines.
+
+
+
+
+
+
+
+ ♻ ☆ ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models
+ with Enhanced Adapter
+
+
+
+
+
+
+
+
+ Zhengqing Yuan, Xinyi Wang, Kun Wang, Lichao Sun, Yanfang Ye
+
+
+ In recent years, advancements in large language models have been remarkable,
+with models such as ChatGPT demonstrating exceptional proficiency in diverse
+linguistic tasks. The pre-training of large models with billions of parameters,
+poses a formidable challenge, primarily due to the scarcity of datasets of a
+commensurate scale for effective training. Nevertheless, innovative strategies
+have emerged, including methods to fine-tune these pre-trained models using
+fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite
+their potential in various domains, these models remain limited in their
+understanding of artistic imagery. They have yet to fully grasp the intricate
+nuances of art images or to provide an objective articulation of the emotions
+they evoke, in a manner akin to human perception. This work introduces
+ArtGPT-4, a pioneering large vision-language model tailored to address the
+deficiencies of contemporary models in artistic comprehension. ArtGPT-4
+underwent training on image-text pairs utilizing a Tesla A100 device in a mere
+2 hours, with a dataset comprising approximately 0.52M entries. Impressively,
+the model can render images with an artistic-understanding and convey the
+emotions they inspire, mirroring human interpretation. Additionally, this work
+presents a unique dataset designed to evaluate the efficacy of vision-language
+models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art
+performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the
+established benchmarks introduced in This study, lagging behind professional
+artists' descriptions by a negligible 0.15 points on a 6-point scale. The code
+and the pre-trained model are accessible in
+https://huggingface.co/Tyrannosaurus/ArtGPT-4.
+
+
+
+ comment: 20 pages
+
+
+
+
+
+
+ ♻ ☆ Traffic Incident Database with Multiple Labels Including Various
+ Perspective Environmental Information IROS
+
+
+ A large dataset of annotated traffic accidents is necessary to improve the
+accuracy of traffic accident recognition using deep learning models.
+Conventional traffic accident datasets provide annotations on traffic accidents
+and other teacher labels, improving traffic accident recognition performance.
+However, the labels annotated in conventional datasets need to be more
+comprehensive to describe traffic accidents in detail. Therefore, we propose
+V-TIDB, a large-scale traffic accident recognition dataset annotated with
+various environmental information as multi-labels. Our proposed dataset aims to
+improve the performance of traffic accident recognition by annotating ten types
+of environmental information as teacher labels in addition to the presence or
+absence of traffic accidents. V-TIDB is constructed by collecting many videos
+from the Internet and annotating them with appropriate environmental
+information. In our experiments, we compare the performance of traffic accident
+recognition when only labels related to the presence or absence of traffic
+accidents are trained and when environmental information is added as a
+multi-label. In the second experiment, we compare the performance of the
+training with only contact level, which represents the severity of the traffic
+accident, and the performance with environmental information added as a
+multi-label. The results showed that 6 out of 10 environmental information
+labels improved the performance of recognizing the presence or absence of
+traffic accidents. In the experiment on the degree of recognition of traffic
+accidents, the performance of recognition of car wrecks and contacts was
+improved for all environmental information. These experiments show that V-TIDB
+can be used to learn traffic accident recognition models that take
+environmental information into account in detail and can be used for
+appropriate traffic accident analysis.
+
+
+
+ comment: Conference paper accepted to IEEE/RSJ International Conference on
+ Intelligent Robots and Systems (IROS), 2023 Reason for revision: Corrected
+ due to a missing space between sentences in the preview's abstract, which led
+ to an unintended URL interpretation
+
+
+
+
+
+
+ ♻ ☆ PointVST: Self-SupervisedPre-training for 3D Point Clouds via
+ View-Specific Point-to-Image Translation
+
+
+ The past few years have witnessed the great success and prevalence of
+self-supervised representation learning within the language and 2D vision
+communities. However, such advancements have not been fully migrated to the
+field of 3D point cloud learning. Different from existing pre-training
+paradigms designed for deep point cloud feature extractors that fall into the
+scope of generative modeling or contrastive learning, this paper proposes a
+translative pre-training framework, namely PointVST, driven by a novel
+self-supervised pretext task of cross-modal translation from 3D point clouds to
+their corresponding diverse forms of 2D rendered images. More specifically, we
+begin with deducing view-conditioned point-wise embeddings through the
+insertion of the viewpoint indicator, and then adaptively aggregate a
+view-specific global codeword, which can be further fed into subsequent 2D
+convolutional translation heads for image generation. Extensive experimental
+evaluations on various downstream task scenarios demonstrate that our PointVST
+shows consistent and prominent performance superiority over current
+state-of-the-art approaches as well as satisfactory domain transfer capability.
+Our code will be publicly available at https://github.com/keeganhk/PointVST.
+
+
+
+ comment: Accepted in IEEE TVCG
+
+
+
+
+
+
+ ♻ ☆ DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic
+ Segmentation Using Diffusion Models
+
+
+
+
+
+
+
+
+ Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen
+
+
+ Collecting and annotating images with pixel-wise labels is time-consuming and
+laborious. In contrast, synthetic data can be freely available using a
+generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that
+it is possible to automatically obtain accurate semantic masks of synthetic
+images generated by the Off-the-shelf Stable Diffusion model, which uses only
+text-image pairs during training. Our approach, called DiffuMask, exploits the
+potential of the cross-attention map between text and image, which is natural
+and seamless to extend the text-driven image synthesis to semantic mask
+generation. DiffuMask uses text-guided cross-attention information to localize
+class/word-specific regions, which are combined with practical techniques to
+create a novel high-resolution and class-discriminative pixel-wise mask. The
+methods help to reduce data collection and annotation costs obviously.
+Experiments demonstrate that the existing segmentation methods trained on
+synthetic data of DiffuMask can achieve a competitive performance over the
+counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird),
+DiffuMask presents promising performance, close to the stateof-the-art result
+of real data (within 3% mIoU gap). Moreover, in the open-vocabulary
+segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on
+Unseen class of VOC 2012. The project website can be found at
+https://weijiawu.github.io/DiffusionMask/.
+
+
+
+
+
+
+
+ ♻ ☆ Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding ICASSP2024
+
+
+ Drawing on theoretical insights, we advocate an error-based thresholding
+(EBT) mechanism for learned ISTA (LISTA), which utilizes a function of the
+layer-wise reconstruction error to suggest a specific threshold for each
+observation in the shrinkage function of each layer. We show that the proposed
+EBT mechanism well disentangles the learnable parameters in the shrinkage
+functions from the reconstruction errors, endowing the obtained models with
+improved adaptivity to possible data variations. With rigorous analyses, we
+further show that the proposed EBT also leads to a faster convergence on the
+basis of LISTA or its variants, in addition to its higher adaptivity. Extensive
+experimental results confirm our theoretical analyses and verify the
+effectiveness of our methods.
+
+
+ Multi-frame methods improve monocular depth estimation over single-frame
+approaches by aggregating spatial-temporal information via feature matching.
+However, the spatial-temporal feature leads to accuracy degradation in dynamic
+scenes. To enhance the performance, recent methods tend to propose complex
+architectures for feature matching and dynamic scenes. In this paper, we show
+that a simple learning framework, together with designed feature augmentation,
+leads to superior performance. (1) A novel dynamic objects detecting method
+with geometry explainability is proposed. The detected dynamic objects are
+excluded during training, which guarantees the static environment assumption
+and relieves the accuracy degradation problem of the multi-frame depth
+estimation. (2) Multi-scale feature fusion is proposed for feature matching in
+the multi-frame depth network, which improves feature matching, especially
+between frames with large camera motion. (3) The robust knowledge distillation
+with a robust teacher network and reliability guarantee is proposed, which
+improves the multi-frame depth estimation without computation complexity
+increase during the test. The experiments show that our proposed methods
+achieve great performance improvement on the multi-frame depth estimation.
+
+
+ Detecting objects accurately from a large or open vocabulary necessitates the
+vision-language alignment on region representations. However, learning such a
+region-text alignment by obtaining high-quality box annotations with text
+labels or descriptions is expensive and infeasible. In contrast, collecting
+image-text pairs is simpler but lacks precise object location information to
+associate regions with texts. In this paper, we propose a novel approach called
+Contrastive Language-Image Mosaic (CLIM), which leverages large-scale
+image-text pairs effectively for aligning region and text representations. CLIM
+combines multiple images into a mosaicked image and treats each image as a
+`pseudo region'. The feature of each pseudo region is extracted and trained to
+be similar to the corresponding text embedding while dissimilar from others by
+a contrastive loss, enabling the model to learn the region-text alignment
+without costly box annotations. As a generally applicable approach, CLIM
+consistently improves different open-vocabulary object detection methods that
+use caption supervision. Furthermore, CLIM can effectively enhance the region
+representation of vision-language models, thus providing stronger backbones for
+open-vocabulary object detectors. Our experimental results demonstrate that
+CLIM improves different baseline open-vocabulary object detectors by a large
+margin on both OV-COCO and OV-LVIS benchmarks. The code is available at
+https://github.com/wusize/CLIM.
+
+
+
+
+
+
+
+ ♻ ☆ Personalization as a Shortcut for Few-Shot Backdoor Attack against
+ Text-to-Image Diffusion Models AAAI 2024
+
+
+
+
+
+
+
+
+ Yihao Huang, Felix Juefei-Xu, Qing Guo, Jie Zhang, Yutong Wu, Ming Hu, Tianlin Li, Geguang Pu, Yang Liu
+
+
+ Although recent personalization methods have democratized high-resolution
+image synthesis by enabling swift concept acquisition with minimal examples and
+lightweight computation, they also present an exploitable avenue for high
+accessible backdoor attacks. This paper investigates a critical and unexplored
+aspect of text-to-image (T2I) diffusion models - their potential vulnerability
+to backdoor attacks via personalization. Our study focuses on a zero-day
+backdoor vulnerability prevalent in two families of personalization methods,
+epitomized by Textual Inversion and DreamBooth.Compared to traditional backdoor
+attacks, our proposed method can facilitate more precise, efficient, and easily
+accessible attacks with a lower barrier to entry. We provide a comprehensive
+review of personalization in T2I diffusion models, highlighting the operation
+and exploitation potential of this backdoor vulnerability. To be specific, by
+studying the prompt processing of Textual Inversion and DreamBooth, we have
+devised dedicated backdoor attacks according to the different ways of dealing
+with unseen tokens and analyzed the influence of triggers and concept images on
+the attack effect. Through comprehensive empirical study, we endorse the
+utilization of the nouveau-token backdoor attack due to its impressive
+effectiveness, stealthiness, and integrity, markedly outperforming the
+legacy-token backdoor attack.
+
+
+
+ comment: 10 pages, accepted by AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Mean Teacher DETR with Masked Feature Alignment: A Robust Domain
+ Adaptive Detection Transformer Framework AAAI2024
+
+
+ Unsupervised domain adaptation object detection (UDAOD) research on Detection
+Transformer(DETR) mainly focuses on feature alignment and existing methods can
+be divided into two kinds, each of which has its unresolved issues. One-stage
+feature alignment methods can easily lead to performance fluctuation and
+training stagnation. Two-stage feature alignment method based on mean teacher
+comprises a pretraining stage followed by a self-training stage, each facing
+problems in obtaining reliable pretrained model and achieving consistent
+performance gains. Methods mentioned above have not yet explore how to utilize
+the third related domain such as target-like domain to assist adaptation. To
+address these issues, we propose a two-stage framework named MTM, i.e. Mean
+Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we
+utilize labeled target-like images produced by image style transfer to avoid
+performance fluctuation. In the self-training stage, we leverage unlabeled
+target images by pseudo labels based on mean teacher and propose a module
+called Object Queries Knowledge Transfer (OQKT) to ensure consistent
+performance gains of the student model. Most importantly, we propose masked
+feature alignment methods including Masked Domain Query-based Feature Alignment
+(MDQFA) and Masked Token-wise Feature Alignment (MTWFA) to alleviate domain
+shift in a more robust way, which not only prevent training stagnation and lead
+to a robust pretrained model in the pretraining stage, but also enhance the
+model's target performance in the self-training stage. Experiments on three
+challenging scenarios and a theoretical analysis verify the effectiveness of
+MTM.
+
+
+
+ comment: AAAI2024
+
+
+
+
+
+
+ ♻ ☆ A Recent Survey of Vision Transformers for Medical Image Segmentation
+
+
+
+
+
+
+
+
+ Asifullah Khan, Zunaira Rauf, Abdul Rehman Khan, Saima Rathore, Saddam Hussain Khan, Najmus Saher Shah, Umair Farooq, Hifsa Asif, Aqsa Asif, Umme Zahoora, Rafi Ullah Khalil, Suleman Qamar, Umme Hani Asif, Faiza Babar Khan, Abdul Majid, Jeonghwan Gwak
+
+
+ Medical image segmentation plays a crucial role in various healthcare
+applications, enabling accurate diagnosis, treatment planning, and disease
+monitoring. Traditionally, convolutional neural networks (CNNs) dominated this
+domain, excelling at local feature extraction. However, their limitations in
+capturing long-range dependencies across image regions pose challenges for
+segmenting complex, interconnected structures often encountered in medical
+data. In recent years, Vision Transformers (ViTs) have emerged as a promising
+technique for addressing the challenges in medical image segmentation. Their
+multi-scale attention mechanism enables effective modeling of long-range
+dependencies between distant structures, crucial for segmenting organs or
+lesions spanning the image. Additionally, ViTs' ability to discern subtle
+pattern heterogeneity allows for the precise delineation of intricate
+boundaries and edges, a critical aspect of accurate medical image segmentation.
+However, they do lack image-related inductive bias and translational
+invariance, potentially impacting their performance. Recently, researchers have
+come up with various ViT-based approaches that incorporate CNNs in their
+architectures, known as Hybrid Vision Transformers (HVTs) to capture local
+correlation in addition to the global information in the images. This survey
+paper provides a detailed review of the recent advancements in ViTs and HVTs
+for medical image segmentation. Along with the categorization of ViT and
+HVT-based medical image segmentation approaches, we also present a detailed
+overview of their real-time applications in several medical image modalities.
+This survey may serve as a valuable resource for researchers, healthcare
+practitioners, and students in understanding the state-of-the-art approaches
+for ViT-based medical image segmentation.
+
+
+
+
+
+
+
+ ♻ ☆ How to Bridge the Gap between Modalities: A Comprehensive Survey on
+ Multimodal Large Language Model
+
+
+ This review paper explores Multimodal Large Language Models (MLLMs), which
+integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data
+such as text and vision. MLLMs demonstrate capabilities like generating image
+narratives and answering image-based questions, bridging the gap towards
+real-world human-computer interactions and hinting at a potential pathway to
+artificial general intelligence. However, MLLMs still face challenges in
+processing the semantic gap in multimodality, which may lead to erroneous
+generation, posing potential risks to society. Choosing the appropriate
+modality alignment method is crucial, as improper methods might require more
+parameters with limited performance improvement. This paper aims to explore
+modality alignment methods for LLMs and their existing capabilities.
+Implementing modality alignment allows LLMs to address environmental issues and
+enhance accessibility. The study surveys existing modal alignment methods in
+MLLMs into four groups: (1) Multimodal Converters that change data into
+something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs
+perceive different types of data; (3) Tools Assistance for changing data into
+one common format, usually text; and (4) Data-Driven methods that teach LLMs to
+understand specific types of data in a dataset. This field is still in a phase
+of exploration and experimentation, and we will organize and update various
+existing research methods for multimodal information alignment.
+
+
+
+
+
+
+
+ ♻ ☆ IPMix: Label-Preserving Data Augmentation Method for Training Robust
+ Classifiers NeurIPS 2023
+
+
+
+
+
+
+
+
+ Zhenglin Huang, Xianan Bao, Na Zhang, Qingqi Zhang, Xiaomei Tu, Biao Wu, Xi Yang
+
+
+ Data augmentation has been proven effective for training high-accuracy
+convolutional neural network classifiers by preventing overfitting. However,
+building deep neural networks in real-world scenarios requires not only high
+accuracy on clean data but also robustness when data distributions shift. While
+prior methods have proposed that there is a trade-off between accuracy and
+robustness, we propose IPMix, a simple data augmentation approach to improve
+robustness without hurting clean accuracy. IPMix integrates three levels of
+data augmentation (image-level, patch-level, and pixel-level) into a coherent
+and label-preserving technique to increase the diversity of training data with
+limited computational overhead. To further improve the robustness, IPMix
+introduces structural complexity at different levels to generate more diverse
+images and adopts the random mixing method for multi-scale information fusion.
+Experiments demonstrate that IPMix outperforms state-of-the-art corruption
+robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also
+significantly improves the other safety measures, including robustness to
+adversarial perturbations, calibration, prediction consistency, and anomaly
+detection, achieving state-of-the-art or comparable results on several
+benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.
+
+
+
+ comment: NeurIPS 2023
+
+
+
+
+
+
+ ♻ ☆ Learning Dense Correspondence for NeRF-Based Face Reenactment AAAI
+
+
+ Face reenactment is challenging due to the need to establish dense
+correspondence between various face representations for motion transfer. Recent
+studies have utilized Neural Radiance Field (NeRF) as fundamental
+representation, which further enhanced the performance of multi-view face
+reenactment in photo-realism and 3D consistency. However, establishing dense
+correspondence between different face NeRFs is non-trivial, because implicit
+representations lack ground-truth correspondence annotations like mesh-based 3D
+parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning
+3DMM space with NeRF-based face representations can realize motion control, it
+is sub-optimal for their limited face-only modeling and low identity fidelity.
+Therefore, we are inspired to ask: Can we learn the dense correspondence
+between different NeRF-based face representations without a 3D parametric model
+prior? To address this challenge, we propose a novel framework, which adopts
+tri-planes as fundamental NeRF representation and decomposes face tri-planes
+into three components: canonical tri-planes, identity deformations, and motion.
+In terms of motion control, our key contribution is proposing a Plane
+Dictionary (PlaneDict) module, which efficiently maps the motion conditions to
+a linear weighted addition of learnable orthogonal plane bases. To the best of
+our knowledge, our framework is the first method that achieves one-shot
+multi-view face reenactment without a 3D parametric model prior. Extensive
+experiments demonstrate that we produce better results in fine-grained motion
+control and identity preservation than previous methods.
+
+
+
+ comment: Accepted by Proceedings of the AAAI Conference on Artificial
+ Intelligence, 2024
+
+
+
+
+
+
+ ♻ ☆ Transferring Modality-Aware Pedestrian Attentive Learning for
+ Visible-Infrared Person Re-identification
+
+
+ Visible-infrared person re-identification (VI-ReID) aims to search the same
+pedestrian of interest across visible and infrared modalities. Existing models
+mainly focus on compensating for modality-specific information to reduce
+modality variation. However, these methods often lead to a higher computational
+overhead and may introduce interfering information when generating the
+corresponding images or features. To address this issue, it is critical to
+leverage pedestrian-attentive features and learn modality-complete and
+-consistent representation. In this paper, a novel Transferring Modality-Aware
+Pedestrian Attentive Learning (TMPA) model is proposed, focusing on the
+pedestrian regions to efficiently compensate for missing modality-specific
+features. Specifically, we propose a region-based data augmentation module
+PedMix to enhance pedestrian region coherence by mixing the corresponding
+regions from different modalities. A lightweight hybrid compensation module,
+i.e., the Modality Feature Transfer (MFT), is devised to integrate cross
+attention and convolution networks to fully explore the discriminative
+modality-complete features with minimal computational overhead. Extensive
+experiments conducted on the benchmark SYSU-MM01 and RegDB datasets
+demonstrated the effectiveness of our proposed TMPA model.
+
+
+
+
+
+
+
+ ♻ ☆ Shot2Story20K: A New Benchmark for Comprehensive Understanding of
+ Multi-shot Videos
+
+
+ A short clip of video may contain progression of multiple events and an
+interesting story line. A human need to capture both the event in every shot
+and associate them together to understand the story behind it. In this work, we
+present a new multi-shot video understanding benchmark Shot2Story20K with
+detailed shot-level captions and comprehensive video summaries. To facilitate
+better semantic understanding of videos, we provide captions for both visual
+signals and human narrations. We design several distinct tasks including
+single-shot video and narration captioning, multi-shot video summarization, and
+video retrieval with shot descriptions. Preliminary experiments show some
+challenges to generate a long and comprehensive video summary. Nevertheless,
+the generated imperfect summaries can already significantly boost the
+performance of existing video understanding tasks such as video
+question-answering, promoting an under-explored setting of video understanding
+with detailed summaries.
+
+
+
+ comment: See https://mingfei.info/shot2story for updates and more information
+
+
+
+
+
+
+ ♻ ☆ One at a Time: Progressive Multi-step Volumetric Probability Learning
+ for Reliable 3D Scene Perception AAAI2024
+
+
+ Numerous studies have investigated the pivotal role of reliable 3D volume
+representation in scene perception tasks, such as multi-view stereo (MVS) and
+semantic scene completion (SSC). They typically construct 3D probability
+volumes directly with geometric correspondence, attempting to fully address the
+scene perception tasks in a single forward pass. However, such a single-step
+solution makes it hard to learn accurate and convincing volumetric probability,
+especially in challenging regions like unexpected occlusions and complicated
+light reflections. Therefore, this paper proposes to decompose the complicated
+3D volume representation learning into a sequence of generative steps to
+facilitate fine and reliable scene perception. Considering the recent advances
+achieved by strong generative diffusion models, we introduce a multi-step
+learning framework, dubbed as VPD, dedicated to progressively refining the
+Volumetric Probability in a Diffusion process. Extensive experiments are
+conducted on scene perception tasks including multi-view stereo (MVS) and
+semantic scene completion (SSC), to validate the efficacy of our method in
+learning reliable volumetric representations. Notably, for the SSC task, our
+work stands out as the first to surpass LiDAR-based methods on the
+SemanticKITTI dataset.
+
+
+
+ comment: AAAI2024
+
+
+
+
+
+
+ ♻ ☆ On Robustness to Missing Video for Audiovisual Speech Recognition
+
+
+ It has been shown that learning audiovisual features can lead to improved
+speech recognition performance over audio-only features, especially for noisy
+speech. However, in many common applications, the visual features are partially
+or entirely missing, e.g.~the speaker might move off screen. Multi-modal models
+need to be robust: missing video frames should not degrade the performance of
+an audiovisual model to be worse than that of a single-modality audio-only
+model. While there have been many attempts at building robust models, there is
+little consensus on how robustness should be evaluated. To address this, we
+introduce a framework that allows claims about robustness to be evaluated in a
+precise and testable way. We also conduct a systematic empirical study of the
+robustness of common audiovisual speech recognition architectures on a range of
+acoustic noise conditions and test suites. Finally, we show that an
+architecture-agnostic solution based on cascades can consistently achieve
+robustness to missing video, even in settings where existing techniques for
+robustness like dropout fall short.
+
+
+ Automatic action quality assessment (AQA) has attracted increasing attention
+due to its wide applications. However, most existing AQA methods employ
+deterministic models to predict the final score for each action, while
+overlooking the subjectivity and diversity among expert judges during the
+scoring process. In this paper, we propose a novel probabilistic model, named
+Uncertainty-Driven AQA (UD-AQA), to utilize and capture the diversity among
+multiple judge scores. Specifically, we design a Conditional Variational
+Auto-Encoder (CVAE)-based module to encode the uncertainty in expert
+assessment, where multiple judge scores can be produced by sampling latent
+features from the learned latent space multiple times. To further utilize the
+uncertainty, we generate the estimation of uncertainty for each prediction,
+which is employed to re-weight AQA regression loss, effectively reducing the
+influence of uncertain samples during training. Moreover, we further design an
+uncertainty-guided training strategy to dynamically adjust the learning order
+of the samples from low uncertainty to high uncertainty. The experiments show
+that our proposed method achieves competitive results on three benchmarks
+including the Olympic events MTL-AQA and FineDiving, and the surgical skill
+JIGSAWS datasets.
+
+
+
+
+
+
+
+ ♻ ☆ Inventing art styles with no artistic training data
+
+
+ We propose two procedures to create painting styles using models trained only
+on natural images, providing objective proof that the model is not plagiarizing
+human art styles. In the first procedure we use the inductive bias from the
+artistic medium to achieve creative expression. Abstraction is achieved by
+using a reconstruction loss. The second procedure uses an additional natural
+image as inspiration to create a new style. These two procedures make it
+possible to invent new painting styles with no artistic training data. We
+believe that our approach can help pave the way for the ethical employment of
+generative AI in art, without infringing upon the originality of human
+creators.
+
+
+
+ comment: updated title
+
+
+
+
+
+
+ ♻ ☆ Towards Consistent Stochastic Human Motion Prediction via Motion
+ Diffusion
+
+
+ Stochastic Human Motion Prediction (HMP) aims to predict multiple possible
+upcoming pose sequences based on past human motion trajectories. Although
+previous approaches have shown impressive performance, they face several
+issues, including complex training processes and a tendency to generate
+predictions that are often inconsistent with the provided history, and
+sometimes even becoming entirely unreasonable. To overcome these issues, we
+propose DiffMotion, an end-to-end diffusion-based stochastic HMP framework.
+DiffMotion's motion predictor is composed of two modules, including (1) a
+Transformer-based network for initial motion reconstruction from corrupted
+motion, and (2) a Graph Convolutional Network (GCN) to refine the generated
+motion considering past observations. Our method, facilitated by this novel
+Transformer-GCN module design and a proposed variance scheduler, excels in
+predicting accurate, realistic, and consistent motions, while maintaining an
+appropriate level of diversity. Our results on benchmark datasets show that
+DiffMotion significantly outperforms previous methods in terms of both accuracy
+and fidelity, while demonstrating superior robustness.
+
+
+
+
+
+
+
+ ♻ ☆ Deep Learning for Time Series Classification and Extrinsic Regression: A
+ Current Survey
+
+
+ Time Series Classification and Extrinsic Regression are important and
+challenging machine learning tasks. Deep learning has revolutionized natural
+language processing and computer vision and holds great promise in other fields
+such as time series analysis where the relevant features must often be
+abstracted from the raw data but are not known a priori. This paper surveys the
+current state of the art in the fast-moving field of deep learning for time
+series classification and extrinsic regression. We review different network
+architectures and training methods used for these tasks and discuss the
+challenges and opportunities when applying deep learning to time series data.
+We also summarize two critical applications of time series classification and
+extrinsic regression, human activity recognition and satellite earth
+observation.
+
+
+
+
+
+
+
+ ♻ ☆ Trust, but Verify: Robust Image Segmentation using Deep Learning
+
+
+
+
+
+
+
+
+ Fahim Ahmed Zaman, Xiaodong Wu, Weiyu Xu, Milan Sonka, Raghuraman Mudumbai
+
+
+ We describe a method for verifying the output of a deep neural network for
+medical image segmentation that is robust to several classes of random as well
+as worst-case perturbations i.e. adversarial attacks. This method is based on a
+general approach recently developed by the authors called "Trust, but Verify"
+wherein an auxiliary verification network produces predictions about certain
+masked features in the input image using the segmentation as an input. A
+well-designed auxiliary network will produce high-quality predictions when the
+input segmentations are accurate, but will produce low-quality predictions when
+the segmentations are incorrect. Checking the predictions of such a network
+with the original image allows us to detect bad segmentations. However, to
+ensure the verification method is truly robust, we need a method for checking
+the quality of the predictions that does not itself rely on a black-box neural
+network. Indeed, we show that previous methods for segmentation evaluation that
+do use deep neural regression networks are vulnerable to false negatives i.e.
+can inaccurately label bad segmentations as good. We describe the design of a
+verification network that avoids such vulnerability and present results to
+demonstrate its robustness compared to previous methods.
+
+
+ Understanding object recognition patterns in mice is crucial for advancing
+behavioral neuroscience and has significant implications for human health,
+particularly in the realm of Alzheimer's research. This study is centered on
+the development, application, and evaluation of a state-of-the-art
+computational pipeline designed to analyze such behaviors, specifically
+focusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition
+(SLR) tasks. The pipeline integrates three advanced computational models:
+Any-Maze for initial data collection, DeepLabCut for detailed pose estimation,
+and Convolutional Neural Networks (CNNs) for nuanced behavioral classification.
+Employed across four distinct mouse groups, this pipeline demonstrated high
+levels of accuracy and robustness. Despite certain challenges like video
+quality limitations and the need for manual calculations, the results affirm
+the pipeline's efficacy and potential for scalability. The study serves as a
+proof of concept for a multidimensional computational approach to behavioral
+neuroscience, emphasizing the pipeline's versatility and readiness for future,
+more complex analyses.
+
+
+
+ comment: 10 Pages. All code used in this research can be found at
+ https://github.com/bafanaS/DLC-Object-Recognition-Analysis.git
+
+
+
+
+
+
+ ♻ ☆ Bootstrapping Vision-Language Learning with Decoupled Language
+ Pre-training NeurIPS 2023
+
+
+ We present a novel methodology aimed at optimizing the application of frozen
+large language models (LLMs) for resource-intensive vision-language (VL)
+pre-training. The current paradigm uses visual features as prompts to guide
+language models, with a focus on determining the most relevant visual features
+for corresponding text. Our approach diverges by concentrating on the language
+component, specifically identifying the optimal prompts to align with visual
+features. We introduce the Prompt-Transformer (P-Former), a model that predicts
+these ideal prompts, which is trained exclusively on linguistic data, bypassing
+the need for image-text pairings. This strategy subtly bifurcates the
+end-to-end VL training process into an additional, separate stage. Our
+experiments reveal that our framework significantly enhances the performance of
+a robust image-to-text baseline (BLIP-2), and effectively narrows the
+performance gap between models trained with either 4M or 129M image-text pairs.
+Importantly, our framework is modality-agnostic and flexible in terms of
+architectural design, as validated by its successful application in a video
+learning task using varied base modules. The code will be made available at
+https://github.com/yiren-jian/BLIText.
+
+
+
+ comment: Accepted to NeurIPS 2023 (spotlight). The code is available at
+ https://github.com/yiren-jian/BLIText
+
+
+
+
+
+
+ ♻ ☆ Debiasing Scores and Prompts of 2D Diffusion for View-consistent
+ Text-to-3D Generation NeurIPS 2023
+
+
+ Existing score-distilling text-to-3D generation techniques, despite their
+considerable promise, often encounter the view inconsistency problem. One of
+the most notable issues is the Janus problem, where the most canonical view of
+an object (\textit{e.g}., face or head) appears in other views. In this work,
+we explore existing frameworks for score-distilling text-to-3D generation and
+identify the main causes of the view inconsistency problem -- the embedded bias
+of 2D diffusion models. Based on these findings, we propose two approaches to
+debias the score-distillation frameworks for view-consistent text-to-3D
+generation. Our first approach, called score debiasing, involves cutting off
+the score estimated by 2D diffusion models and gradually increasing the
+truncation value throughout the optimization process. Our second approach,
+called prompt debiasing, identifies conflicting words between user prompts and
+view prompts using a language model, and adjusts the discrepancy between view
+prompts and the viewing direction of an object. Our experimental results show
+that our methods improve the realism of the generated 3D objects by
+significantly reducing artifacts and achieve a good trade-off between
+faithfulness to the 2D diffusion models and 3D consistency with little
+overhead. Our project page is available
+at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.
+
+
+
+
+
+
+
+ ♻ ☆ Deep Hashing via Householder Quantization
+
+
+
+
+
+
+
+
+ Lucas R. Schwengber, Lucas Resende, Paulo Orenstein, Roberto I. Oliveira
+
+
+ Hashing is at the heart of large-scale image similarity search, and recent
+methods have been substantially improved through deep learning techniques. Such
+algorithms typically learn continuous embeddings of the data. To avoid a
+subsequent costly binarization step, a common solution is to employ loss
+functions that combine a similarity learning term (to ensure similar images are
+grouped to nearby embeddings) and a quantization penalty term (to ensure that
+the embedding entries are close to binarized entries, e.g., -1 or 1). Still,
+the interaction between these two terms can make learning harder and the
+embeddings worse. We propose an alternative quantization strategy that
+decomposes the learning problem in two stages: first, perform similarity
+learning over the embedding space with no quantization; second, find an optimal
+orthogonal transformation of the embeddings so each coordinate of the embedding
+is close to its sign, and then quantize the transformed embedding through the
+sign function. In the second step, we parametrize orthogonal transformations
+using Householder matrices to efficiently leverage stochastic gradient descent.
+Since similarity measures are usually invariant under orthogonal
+transformations, this quantization strategy comes at no cost in terms of
+performance. The resulting algorithm is unsupervised, fast, hyperparameter-free
+and can be run on top of any existing deep hashing or metric learning
+algorithm. We provide extensive experimental results showing that this approach
+leads to state-of-the-art performance on widely used image datasets, and,
+unlike other quantization strategies, brings consistent improvements in
+performance to existing deep hashing algorithms.
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 9
+
+
+
+
+
+ ☆ Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
+
+
+ We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
+title reranking technique to achieve efficient title reranking 20x-40x faster
+than vanilla passage reranker. However, one of the challenges with the training
+of Efficient Title Reranker is the instability. Analyzing the issue, we found
+some very difficult ground truths might act as noisy labels causing accuracy to
+drop as well as some extreme values in model probability output causing nan. To
+address these issues, we introduce the Sigmoid Trick, a novel technique that
+reduces the gradient update of both cases resulting in better retrieval
+efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
+achieved four state-of-the-art positions on the kilt knowledge benchmark.
+
+
+ Finding appropriate experts is essential in Community Question Answering
+(CQA) platforms as it enables the effective routing of questions to potential
+users who can provide relevant answers. The key is to personalized learning
+expert representations based on their historical answered questions, and
+accurately matching them with target questions. There have been some
+preliminary works exploring the usability of PLMs in expert finding, such as
+pre-training expert or question representations. However, these models usually
+learn pure text representations of experts from histories, disregarding
+personalized and fine-grained expert modeling. For alleviating this, we present
+a personalized pre-training and fine-tuning paradigm, which could effectively
+learn expert interest and expertise simultaneously. Specifically, in our
+pre-training framework, we integrate historical answered questions of one
+expert with one target question, and regard it as a candidate aware
+expert-level input unit. Then, we fuse expert IDs into the pre-training for
+guiding the model to model personalized expert representations, which can help
+capture the unique characteristics and expertise of each individual expert.
+Additionally, in our pre-training task, we design: 1) a question-level masked
+language model task to learn the relatedness between histories, enabling the
+modeling of question-level expert interest; 2) a vote-oriented task to capture
+question-level expert expertise by predicting the vote score the expert would
+receive. Through our pre-training framework and tasks, our approach could
+holistically learn expert representations including interests and expertise.
+Our method has been extensively evaluated on six real-world CQA datasets, and
+the experimental results consistently demonstrate the superiority of our
+approach over competitive baseline methods.
+
+
+
+
+
+
+
+ ☆ Designing and Evaluating General-Purpose User Representations Based on
+ Behavioral Logs from a Measurement Process Perspective: A Case Study with
+ Snapchat
+
+
+
+
+
+
+
+
+ Qixiang Fang, Zhihan Zhou, Francesco Barbieri, Yozen Liu, Leonardo Neves, Dong Nguyen, Daniel L. Oberski, Maarten W. Bos, Ron Dotsch
+
+
+ In human-computer interaction, understanding user behaviors and tailoring
+systems accordingly is pivotal. To this end, general-purpose user
+representation learning based on behavior logs is emerging as a powerful tool
+in user modeling, offering adaptability to various downstream tasks such as
+item recommendations and ad conversion prediction, without the need to
+fine-tune the upstream user model. While this methodology has shown promise in
+contexts like search engines and e-commerce platforms, its fit for instant
+messaging apps, a cornerstone of modern digital communication, remains largely
+uncharted. These apps, with their distinct interaction patterns, data
+structures, and user expectations, necessitate specialized attention. We
+explore this user modeling approach with Snapchat data as a case study.
+Furthermore, we introduce a novel design and evaluation framework rooted in the
+principles of the Measurement Process Framework from social science research
+methodology. Using this new framework, we design a Transformer-based user model
+that can produce high-quality general-purpose user representations for instant
+messaging platforms like Snapchat.
+
+
+
+
+
+
+
+ ☆ VITA: 'Carefully Chosen and Weighted Less' Is Better in Medication
+ Recommendation AAAI 2024
+
+
+
+
+
+
+
+
+ Taeri Kim, Jiho Heo, Hongil Kim, Kijung Shin, Sang-Wook Kim
+
+
+ We address the medication recommendation problem, which aims to recommend
+effective medications for a patient's current visit by utilizing information
+(e.g., diagnoses and procedures) given at the patient's current and past
+visits. While there exist a number of recommender systems designed for this
+problem, we point out that they are challenged in accurately capturing the
+relation (spec., the degree of relevance) between the current and each of the
+past visits for the patient when obtaining her current health status, which is
+the basis for recommending medications. To address this limitation, we propose
+a novel medication recommendation framework, named VITA, based on the following
+two novel ideas: (1) relevant-Visit selectIon; (2) Target-aware Attention.
+Through extensive experiments using real-world datasets, we demonstrate the
+superiority of VITA (spec., up to 5.56% higher accuracy, in terms of Jaccard,
+than the best competitor) and the effectiveness of its two core ideas. The code
+is available at https://github.com/jhheo0123/VITA.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ On the Effectiveness of Sampled Softmax Loss for Item Recommendation
+
+
+ The learning objective plays a fundamental role to build a recommender
+system. Most methods routinely adopt either pointwise or pairwise loss to train
+the model parameters, while rarely pay attention to softmax loss due to its
+computational complexity when scaling up to large datasets or intractability
+for streaming data. The sampled softmax (SSM) loss emerges as an efficient
+substitute for softmax loss. Its special case, InfoNCE loss, has been widely
+used in self-supervised learning and exhibited remarkable performance for
+contrastive learning. Nonetheless, limited recommendation work uses the SSM
+loss as the learning objective. Worse still, none of them explores its
+properties thoroughly and answers ``Does SSM loss suit for item
+recommendation?'' and ``What are the conceptual advantages of SSM loss, as
+compared with the prevalent losses?'', to the best of our knowledge.
+ In this work, we aim to offer a better understanding of SSM for item
+recommendation. Specifically, we first theoretically reveal three
+model-agnostic advantages: (1) mitigating popularity bias; (2) mining hard
+negative samples; and (3) maximizing the ranking metric. However, based on our
+empirical studies, we recognize that the default choice of cosine similarity
+function in SSM limits its ability in learning the magnitudes of representation
+vectors. As such, the combinations of SSM with the models that also fall short
+in adjusting magnitudes may result in poor representations. One step further,
+we provide mathematical proof that message passing schemes in graph convolution
+networks can adjust representation magnitude according to node degree, which
+naturally compensates for the shortcoming of SSM. Extensive experiments on four
+benchmark datasets justify our analyses, demonstrating the superiority of SSM
+for item recommendation. Our implementations are available in both TensorFlow
+and PyTorch.
+
+
+
+ comment: Accepted by TOIS
+
+
+
+
+
+
+ ♻ ☆ CaseGNN: Graph Neural Networks for Legal Case Retrieval with
+ Text-Attributed Graphs
+
+
+ Legal case retrieval is an information retrieval task in the legal domain,
+which aims to retrieve relevant cases with a given query case. Recent research
+of legal case retrieval mainly relies on traditional bag-of-words models and
+language models. Although these methods have achieved significant improvement
+in retrieval accuracy, there are still two challenges: (1) Legal structural
+information neglect. Previous neural legal case retrieval models mostly encode
+the unstructured raw text of case into a case representation, which causes the
+lack of important legal structural information in a case and leads to poor case
+representation; (2) Lengthy legal text limitation. When using the powerful
+BERT-based models, there is a limit of input text lengths, which inevitably
+requires to shorten the input via truncation or division with a loss of legal
+context information. In this paper, a graph neural networks-based legal case
+retrieval model, CaseGNN, is developed to tackle these challenges. To
+effectively utilise the legal structural information during encoding, a case is
+firstly converted into a Text-Attributed Case Graph (TACG), followed by a
+designed Edge Graph Attention Layer and a readout function to obtain the case
+graph representation. The CaseGNN model is optimised with a carefully designed
+contrastive loss with easy and hard negative sampling. Since the text
+attributes in the case graph come from individual sentences, the restriction of
+using language models is further avoided without losing the legal context.
+Extensive experiments have been conducted on two benchmarks from COLIEE 2022
+and COLIEE 2023, which demonstrate that CaseGNN outperforms other
+state-of-the-art legal case retrieval methods. The code has been released on
+https://github.com/yanran-tang/CaseGNN.
+
+
+
+
+
+
+
+ ♻ ☆ Ad-load Balancing via Off-policy Learning in a Content Marketplace RecSys
+ '23
+
+
+ Ad-load balancing is a critical challenge in online advertising systems,
+particularly in the context of social media platforms, where the goal is to
+maximize user engagement and revenue while maintaining a satisfactory user
+experience. This requires the optimization of conflicting objectives, such as
+user satisfaction and ads revenue. Traditional approaches to ad-load balancing
+rely on static allocation policies, which fail to adapt to changing user
+preferences and contextual factors. In this paper, we present an approach that
+leverages off-policy learning and evaluation from logged bandit feedback. We
+start by presenting a motivating analysis of the ad-load balancing problem,
+highlighting the conflicting objectives between user satisfaction and ads
+revenue. We emphasize the nuances that arise due to user heterogeneity and the
+dependence on the user's position within a session. Based on this analysis, we
+define the problem as determining the optimal ad-load for a particular feed
+fetch. To tackle this problem, we propose an off-policy learning framework that
+leverages unbiased estimators such as Inverse Propensity Scoring (IPS) and
+Doubly Robust (DR) to learn and estimate the policy values using offline
+collected stochastic data. We present insights from online A/B experiments
+deployed at scale across over 80 million users generating over 200 million
+sessions, where we find statistically significant improvements in both user
+satisfaction metrics and ads revenue for the platform.
+
+
+
+ comment: Early version presented at the CONSEQUENCES '23 workshop at RecSys
+ '23, final version appearing at WSDM '24
+
+ Bundle recommendation seeks to recommend a bundle of related items to users
+to improve both user experience and the profits of platform. Existing bundle
+recommendation models have progressed from capturing only user-bundle
+interactions to the modeling of multiple relations among users, bundles and
+items. CrossCBR, in particular, incorporates cross-view contrastive learning
+into a two-view preference learning framework, significantly improving SOTA
+performance. It does, however, have two limitations: 1) the two-view
+formulation does not fully exploit all the heterogeneous relations among users,
+bundles and items; and 2) the "early contrast and late fusion" framework is
+less effective in capturing user preference and difficult to generalize to
+multiple views. In this paper, we present MultiCBR, a novel Multi-view
+Contrastive learning framework for Bundle Recommendation. First, we devise a
+multi-view representation learning framework capable of capturing all the
+user-bundle, user-item and bundle-item relations, especially better utilizing
+the bundle-item affiliations to enhance sparse bundles' representations.
+Second, we innovatively adopt an "early fusion and late contrast" design that
+first fuses the multi-view representations before performing self-supervised
+contrastive learning. In comparison to existing approaches, our framework
+reverses the order of fusion and contrast, introducing the following
+advantages: 1)our framework is capable of modeling both cross-view and ego-view
+preferences, allowing us to achieve enhanced user preference modeling; and 2)
+instead of requiring quadratic number of cross-view contrastive losses, we only
+require two self-supervised contrastive losses, resulting in minimal extra
+costs. Experimental results on three public datasets indicate that our method
+outperforms SOTA methods.
+
+
+
+ comment: fix a typo in Table 2, i.e., the R@20 and N@20 of LightGCL are
+ updated
+
+
+
+
+
+
+ ♻ ☆ Deep Hashing via Householder Quantization
+
+
+
+
+
+
+
+
+ Lucas R. Schwengber, Lucas Resende, Paulo Orenstein, Roberto I. Oliveira
+
+
+ Hashing is at the heart of large-scale image similarity search, and recent
+methods have been substantially improved through deep learning techniques. Such
+algorithms typically learn continuous embeddings of the data. To avoid a
+subsequent costly binarization step, a common solution is to employ loss
+functions that combine a similarity learning term (to ensure similar images are
+grouped to nearby embeddings) and a quantization penalty term (to ensure that
+the embedding entries are close to binarized entries, e.g., -1 or 1). Still,
+the interaction between these two terms can make learning harder and the
+embeddings worse. We propose an alternative quantization strategy that
+decomposes the learning problem in two stages: first, perform similarity
+learning over the embedding space with no quantization; second, find an optimal
+orthogonal transformation of the embeddings so each coordinate of the embedding
+is close to its sign, and then quantize the transformed embedding through the
+sign function. In the second step, we parametrize orthogonal transformations
+using Householder matrices to efficiently leverage stochastic gradient descent.
+Since similarity measures are usually invariant under orthogonal
+transformations, this quantization strategy comes at no cost in terms of
+performance. The resulting algorithm is unsupervised, fast, hyperparameter-free
+and can be run on top of any existing deep hashing or metric learning
+algorithm. We provide extensive experimental results showing that this approach
+leads to state-of-the-art performance on widely used image datasets, and,
+unlike other quantization strategies, brings consistent improvements in
+performance to existing deep hashing algorithms.
+
+
+ Amodal perception, the ability to comprehend complete object structures from
+partial visibility, is a fundamental skill, even for infants. Its significance
+extends to applications like autonomous driving, where a clear understanding of
+heavily occluded objects is essential. However, modern detection and tracking
+algorithms often overlook this critical capability, perhaps due to the
+prevalence of modal annotations in most datasets. To address the scarcity of
+amodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse
+categories in thousands of video sequences. Our dataset includes amodal and
+modal bounding boxes for visible and occluded objects, including objects that
+are partially out-of-frame. To enhance amodal tracking with object permanence,
+we leverage a lightweight plug-in module, the amodal expander, to transform
+standard, modal trackers into amodal ones through fine-tuning on a few hundred
+video sequences with data augmentation. We achieve a 3.3\% and 1.6\%
+improvement on the detection and tracking of occluded objects on TAO-Amodal.
+When evaluated on people, our method produces dramatic improvements of 2x
+compared to state-of-the-art modal baselines.
+
+
+ We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
+title reranking technique to achieve efficient title reranking 20x-40x faster
+than vanilla passage reranker. However, one of the challenges with the training
+of Efficient Title Reranker is the instability. Analyzing the issue, we found
+some very difficult ground truths might act as noisy labels causing accuracy to
+drop as well as some extreme values in model probability output causing nan. To
+address these issues, we introduce the Sigmoid Trick, a novel technique that
+reduces the gradient update of both cases resulting in better retrieval
+efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
+achieved four state-of-the-art positions on the kilt knowledge benchmark.
+
+
+
+
+
+
+
+ ☆ Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
+ Diffusion Models
+
+
+ The quality of the prompts provided to text-to-image diffusion models
+determines how faithful the generated content is to the user's intent, often
+requiring `prompt engineering'. To harness visual concepts from target images
+without prompt engineering, current approaches largely rely on embedding
+inversion by optimizing and then mapping them to pseudo-tokens. However,
+working with such high-dimensional vector representations is challenging
+because they lack semantics and interpretability, and only allow simple vector
+operations when using them. Instead, this work focuses on inverting the
+diffusion model to obtain interpretable language prompts directly. The
+challenge of doing this lies in the fact that the resulting optimization
+problem is fundamentally discrete and the space of prompts is exponentially
+large; this makes using standard optimization techniques, such as stochastic
+gradient descent, difficult. To this end, we utilize a delayed projection
+scheme to optimize for prompts representative of the vocabulary space in the
+model. Further, we leverage the findings that different timesteps of the
+diffusion process cater to different levels of detail in an image. The later,
+noisy, timesteps of the forward diffusion process correspond to the semantic
+information, and therefore, prompt inversion in this range provides tokens
+representative of the image semantics. We show that our approach can identify
+semantically interpretable and meaningful prompts for a target image which can
+be used to synthesize diverse images with similar content. We further
+illustrate the application of the optimized prompts in evolutionary image
+generation and concept removal.
+
+
+
+
+
+
+
+ ☆ New classes of the greedy-applicable arm feature distributions in the
+ sparse linear bandit problem
+
+
+ We consider the sparse contextual bandit problem where arm feature affects
+reward through the inner product of sparse parameters. Recent studies have
+developed sparsity-agnostic algorithms based on the greedy arm selection
+policy. However, the analysis of these algorithms requires strong assumptions
+on the arm feature distribution to ensure that the greedily selected samples
+are sufficiently diverse; One of the most common assumptions, relaxed symmetry,
+imposes approximate origin-symmetry on the distribution, which cannot allow
+distributions that has origin-asymmetric support. In this paper, we show that
+the greedy algorithm is applicable to a wider range of the arm feature
+distributions from two aspects. Firstly, we show that a mixture distribution
+that has a greedy-applicable component is also greedy-applicable. Second, we
+propose new distribution classes, related to Gaussian mixture, discrete, and
+radial distribution, for which the sample diversity is guaranteed. The proposed
+classes can describe distributions with origin-asymmetric support and, in
+conjunction with the first claim, provide theoretical guarantees of the greedy
+policy for a very wide range of the arm feature distributions.
+
+
+
+
+
+
+
+ ☆ Chasing Fairness in Graphs: A GNN Architecture Perspective AAAI
+
+
+
+
+
+
+
+
+ Zhimeng Jiang, Xiaotian Han, Chao Fan, Zirui Liu, Na Zou, Ali Mostafavi, Xia Hu
+
+
+ There has been significant progress in improving the performance of graph
+neural networks (GNNs) through enhancements in graph data, model architecture
+design, and training strategies. For fairness in graphs, recent studies achieve
+fair representations and predictions through either graph data pre-processing
+(e.g., node feature masking, and topology rewiring) or fair training strategies
+(e.g., regularization, adversarial debiasing, and fair contrastive learning).
+How to achieve fairness in graphs from the model architecture perspective is
+less explored. More importantly, GNNs exhibit worse fairness performance
+compared to multilayer perception since their model architecture (i.e.,
+neighbor aggregation) amplifies biases. To this end, we aim to achieve fairness
+via a new GNN architecture. We propose \textsf{F}air \textsf{M}essage
+\textsf{P}assing (FMP) designed within a unified optimization framework for
+GNNs. Notably, FMP \textit{explicitly} renders sensitive attribute usage in
+\textit{forward propagation} for node classification task using cross-entropy
+loss without data pre-processing. In FMP, the aggregation is first adopted to
+utilize neighbors' information and then the bias mitigation step explicitly
+pushes demographic group node presentation centers together. In this way, FMP
+scheme can aggregate useful information from neighbors and mitigate bias to
+achieve better fairness and prediction tradeoff performance. Experiments on
+node classification tasks demonstrate that the proposed FMP outperforms several
+baselines in terms of fairness and accuracy on three real-world datasets. The
+code is available in {\url{https://github.com/zhimengj0326/FMP}}.
+
+
+
+ comment: Accepted by AAAI Conference on Artificial Intelligence (AAAI) 2024.
+ arXiv admin note: substantial text overlap with arXiv:2202.04187
+
+
+
+
+
+
+ ☆ Modeling non-linear Effects with Neural Networks in Relational Event
+ Models
+
+
+
+
+
+
+
+
+ Edoardo Filippi-Mazzola, Ernst C. Wit
+
+
+ Dynamic networks offer an insight of how relational systems evolve. However,
+modeling these networks efficiently remains a challenge, primarily due to
+computational constraints, especially as the number of observed events grows.
+This paper addresses this issue by introducing the Deep Relational Event
+Additive Model (DREAM) as a solution to the computational challenges presented
+by modeling non-linear effects in Relational Event Models (REMs). DREAM relies
+on Neural Additive Models to model non-linear effects, allowing each effect to
+be captured by an independent neural network. By strategically trading
+computational complexity for improved memory management and leveraging the
+computational capabilities of Graphic Processor Units (GPUs), DREAM efficiently
+captures complex non-linear relationships within data. This approach
+demonstrates the capability of DREAM in modeling dynamic networks and scaling
+to larger networks. Comparisons with traditional REM approaches showcase DREAM
+superior computational efficiency. The model potential is further demonstrated
+by an examination of the patent citation network, which contains nearly 8
+million nodes and 100 million events.
+
+
+
+
+
+
+
+ ☆ On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation
+
+
+ Imitation learning with visual observations is notoriously inefficient when
+addressed with end-to-end behavioural cloning methods. In this paper, we
+explore an alternative paradigm which decomposes reasoning into three phases.
+First, a retrieval phase, which informs the robot what it can do with an
+object. Second, an alignment phase, which informs the robot where to interact
+with the object. And third, a replay phase, which informs the robot how to
+interact with the object. Through a series of real-world experiments on
+everyday tasks, such as grasping, pouring, and inserting objects, we show that
+this decomposition brings unprecedented learning efficiency, and effective
+inter- and intra-class generalisation. Videos are available at
+https://www.robot-learning.uk/retrieval-alignment-replay.
+
+
+
+ comment: Published in IEEE Robotics and Automation Letters (RA-L). (Accepted
+ December 2023)
+
+
+
+
+
+
+ ☆ Value Explicit Pretraining for Goal-Based Transfer Learning
+
+
+ We propose a method that allows for learning task-agnostic representations
+based on value function estimates from a sequence of observations where the
+last frame corresponds to a goal. These representations would learn to relate
+states across different tasks, based on the temporal distance to the goal
+state, irrespective of the appearance changes and dynamics. This method could
+be used to transfer learnt policies/skills to unseen related tasks.
+
+
+
+ comment: Accepted at CoRL 2023 Workshop on PRL
+
+
+
+
+
+
+ ☆ pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable
+ Generalizable 3D Reconstruction
+
+
+
+
+
+
+
+
+ David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann
+
+
+ We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D
+radiance fields parameterized by 3D Gaussian primitives from pairs of images.
+Our model features real-time and memory-efficient rendering for scalable
+training as well as fast 3D reconstruction at inference time. To overcome local
+minima inherent to sparse and locally supported representations, we predict a
+dense probability distribution over 3D and sample Gaussian means from that
+probability distribution. We make this sampling operation differentiable via a
+reparameterization trick, allowing us to back-propagate gradients through the
+Gaussian splatting representation. We benchmark our method on wide-baseline
+novel view synthesis on the real-world RealEstate10k and ACID datasets, where
+we outperform state-of-the-art light field transformers and accelerate
+rendering by 2.5 orders of magnitude while reconstructing an interpretable and
+editable 3D radiance field.
+
+
+ With the recent surge in popularity of LLMs has come an ever-increasing need
+for LLM safety training. In this paper, we show that SOTA open-source LLMs are
+vulnerable to simple, optimization-free attacks we refer to as $\textit{priming
+attacks}$, which are easy to execute and effectively bypass alignment from
+safety training. Our proposed attack improves the Attack Success Rate on
+Harmful Behaviors, as measured by Llama Guard, by up to $3.3\times$ compared to
+baselines. Source code and data are available at
+https://github.com/uiuc-focal-lab/llm-priming-attacks .
+
+
+
+
+
+
+
+ ☆ An Alternate View on Optimal Filtering in an RKHS
+
+
+
+
+
+
+
+
+ Benjamin Colburn, Jose C. Principe, Luis G. Sanchez Giraldo
+
+
+ Kernel Adaptive Filtering (KAF) are mathematically principled methods which
+search for a function in a Reproducing Kernel Hilbert Space. While they work
+well for tasks such as time series prediction and system identification they
+are plagued by a linear relationship between number of training samples and
+model size, hampering their use on the very large data sets common in today's
+data saturated world. Previous methods try to solve this issue by
+sparsification. We describe a novel view of optimal filtering which may provide
+a route towards solutions in a RKHS which do not necessarily have this linear
+growth in model size. We do this by defining a RKHS in which the time structure
+of a stochastic process is still present. Using correntropy [11], an extension
+of the idea of a covariance function, we create a time based functional which
+describes some potentially nonlinear desired mapping function. This form of a
+solution may provide a fruitful line of research for creating more efficient
+representations of functionals in a RKHS, while theoretically providing
+computational complexity in the test set similar to Wiener solution.
+
+
+
+ comment: 5 pages, 2 figures
+
+
+
+
+
+
+ ☆ Celestial Machine Learning: Discovering the Planarity, Heliocentricity,
+ and Orbital Equation of Mars with AI Feynman
+
+
+
+
+
+
+
+
+ Zi-Yu Khoo, Gokul Rajiv, Abel Yang, Jonathan Sze Choong Low, Stéphane Bressan
+
+
+ Can a machine or algorithm discover or learn the elliptical orbit of Mars
+from astronomical sightings alone? Johannes Kepler required two paradigm shifts
+to discover his First Law regarding the elliptical orbit of Mars. Firstly, a
+shift from the geocentric to the heliocentric frame of reference. Secondly, the
+reduction of the orbit of Mars from a three- to a two-dimensional space. We
+extend AI Feynman, a physics-inspired tool for symbolic regression, to discover
+the heliocentricity and planarity of Mars' orbit and emulate his discovery of
+Kepler's first law.
+
+
+
+
+
+
+
+ ☆ Prompt-based Domain Discrimination for Multi-source Time Series Domain
+ Adaptation
+
+
+ Time series domain adaptation stands as a pivotal and intricate challenge
+with diverse applications, including but not limited to human activity
+recognition, sleep stage classification, and machine fault diagnosis. Despite
+the numerous domain adaptation techniques proposed to tackle this complex
+problem, their primary focus has been on the common representations of time
+series data. This concentration might inadvertently lead to the oversight of
+valuable domain-specific information originating from different source domains.
+To bridge this gap, we introduce POND, a novel prompt-based deep learning model
+designed explicitly for multi-source time series domain adaptation. POND is
+tailored to address significant challenges, notably: 1) The unavailability of a
+quantitative relationship between meta-data information and time series
+distributions, and 2) The dearth of exploration into extracting domain-specific
+meta-data information. In this paper, we present an instance-level prompt
+generator and a fidelity loss mechanism to facilitate the faithful learning of
+meta-data information. Additionally, we propose a domain discrimination
+technique to discern domain-specific meta-data information from multiple source
+domains. Our approach involves a simple yet effective meta-learning algorithm
+to optimize the objective efficiently. Furthermore, we augment the model's
+performance by incorporating the Mixture of Expert (MoE) technique. The
+efficacy and robustness of our proposed POND model are extensively validated
+through experiments across 50 scenarios encompassing five datasets, which
+demonstrates that our proposed POND model outperforms the state-of-the-art
+methods by up to $66\%$ on the F1-score.
+
+
+
+ comment: Undergoing work
+
+
+
+
+
+
+ ☆ Emergence of In-Context Reinforcement Learning from Noise Distillation
+
+
+ In-Context Reinforcement Learning is an emerging field with great potential
+for advancing Artificial Intelligence. Its core capability lies in generalizing
+to unseen tasks through interaction with the environment. To master these
+capabilities, an agent must be trained on specifically curated data that
+includes a policy improvement that an algorithm seeks to extract and then apply
+in context in the environment. However, for numerous tasks, training RL agents
+may be unfeasible, while obtaining human demonstrations can be relatively easy.
+Additionally, it is rare to be given the optimal policy, typically, only
+suboptimal demonstrations are available. We propose $AD^{\epsilon}$, a method
+that leverages demonstrations without policy improvement and enables multi-task
+in-context learning in the presence of a suboptimal demonstrator. This is
+achieved by artificially creating a history of incremental improvement, wherein
+noise is systematically introduced into the demonstrator's policy.
+Consequently, each successive transition illustrates a marginally better
+trajectory than the previous one. Our approach was tested on the Dark Room and
+Dark Key-to-Door environments, resulting in over a $\textbf{2}$x improvement
+compared to the best available policy in the data.
+
+
+
+ comment: Preprint, work in progress
+
+
+
+
+
+
+ ☆ Inferring the relationship between soil temperature and the normalized
+ difference vegetation index with machine learning
+
+
+
+
+
+
+
+
+ Steven Mortier, Amir Hamedpour, Bart Bussmann, Ruth Phoebe Tchana Wandji, Steven Latré, Bjarni D. Sigurdsson, Tom De Schepper, Tim Verdonck
+
+
+ Changes in climate can greatly affect the phenology of plants, which can have
+important feedback effects, such as altering the carbon cycle. These
+phenological feedback effects are often induced by a shift in the start or end
+dates of the growing season of plants. The normalized difference vegetation
+index (NDVI) serves as a straightforward indicator for assessing the presence
+of green vegetation and can also provide an estimation of the plants' growing
+season. In this study, we investigated the effect of soil temperature on the
+timing of the start of the season (SOS), timing of the peak of the season
+(POS), and the maximum annual NDVI value (PEAK) in subarctic grassland
+ecosystems between 2014 and 2019. We also explored the impact of other
+meteorological variables, including air temperature, precipitation, and
+irradiance, on the inter-annual variation in vegetation phenology. Using
+machine learning (ML) techniques and SHapley Additive exPlanations (SHAP)
+values, we analyzed the relative importance and contribution of each variable
+to the phenological predictions. Our results reveal a significant relationship
+between soil temperature and SOS and POS, indicating that higher soil
+temperatures lead to an earlier start and peak of the growing season. However,
+the Peak NDVI values showed just a slight increase with higher soil
+temperatures. The analysis of other meteorological variables demonstrated their
+impacts on the inter-annual variation of the vegetation phenology. Ultimately,
+this study contributes to our knowledge of the relationships between soil
+temperature, meteorological variables, and vegetation phenology, providing
+valuable insights for predicting vegetation phenology characteristics and
+managing subarctic grasslands in the face of climate change. Additionally, this
+work provides a solid foundation for future ML-based vegetation phenology
+studies.
+
+
+
+ comment: 31 pages, 7 figures, 5 tables
+
+
+
+
+
+
+ ☆ TaskFlex Solver for Multi-Agent Pursuit via Automatic Curriculum
+ Learning
+
+
+ This paper addresses the problem of multi-agent pursuit, where slow pursuers
+cooperate to capture fast evaders in a confined environment with obstacles.
+Existing heuristic algorithms often lack expressive coordination strategies and
+are highly sensitive to task conditions, requiring extensive hyperparameter
+tuning. In contrast, reinforcement learning (RL) has been applied to this
+problem and is capable of obtaining cooperative pursuit strategies. However,
+RL-based methods face challenges in training for complex scenarios due to the
+vast amount of training data and limited adaptability to varying task
+conditions, such as different scene sizes, varying numbers and speeds of
+obstacles, and flexible speed ratios of the evader to the pursuer. In this
+work, we combine RL and curriculum learning to introduce a flexible solver for
+multiagent pursuit problems, named TaskFlex Solver (TFS), which is capable of
+solving multi-agent pursuit problems with diverse and dynamically changing task
+conditions in both 2-dimensional and 3-dimensional scenarios. TFS utilizes a
+curriculum learning method that constructs task distributions based on training
+progress, enhancing training efficiency and final performance. Our algorithm
+consists of two main components: the Task Evaluator, which evaluates task
+success rates and selects tasks of moderate difficulty to maintain a curriculum
+archive, and the Task Sampler, which constructs training distributions by
+sampling tasks from the curriculum archive to maximize policy improvement.
+Experiments show that TFS produces much stronger performance than baselines and
+achieves close to 100% capture rates in both 2-dimensional and 3-dimensional
+multi-agent pursuit problems with diverse and dynamically changing scenes. The
+project website is at https://sites.google.com/view/tfs-2023.
+
+
+
+
+
+
+
+ ☆ MDD-UNet: Domain Adaptation for Medical Image Segmentation with
+ Theoretical Guarantees, a Proof of Concept
+
+
+ The current state-of-the art techniques for image segmentation are often
+based on U-Net architectures, a U-shaped encoder-decoder networks with skip
+connections. Despite the powerful performance, the architecture often does not
+perform well when used on data which has different characteristics than the
+data it was trained on. Many techniques for improving performance in the
+presence of domain shift have been developed, however typically only have loose
+connections to the theory of domain adaption. In this work, we propose an
+unsupervised domain adaptation framework for U-Nets with theoretical guarantees
+based on the Margin Disparity Discrepancy [1] called the MDD-UNet. We evaluate
+the proposed technique on the task of hippocampus segmentation, and find that
+the MDD-UNet is able to learn features which are domain-invariant with no
+knowledge about the labels in the target domain. The MDD-UNet improves
+performance over the standard U-Net on 11 out of 12 combinations of datasets.
+This work serves as a proof of concept by demonstrating an improvement on the
+U-Net in it's standard form without modern enhancements, which opens up a new
+avenue of studying domain adaptation for models with very large hypothesis
+spaces from both methodological and practical perspectives. Code is available
+at https://github.com/asbjrnmunk/mdd-unet.
+
+
+
+ comment: Published at NLDL 2024
+
+
+
+
+
+
+ ☆ Roll With the Punches: Expansion and Shrinkage of Soft Label Selection
+ for Semi-supervised Fine-Grained Learning AAAI 2024
+
+
+
+
+
+
+
+
+ Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
+
+
+ While semi-supervised learning (SSL) has yielded promising results, the more
+realistic SSL scenario remains to be explored, in which the unlabeled data
+exhibits extremely high recognition difficulty, e.g., fine-grained visual
+classification in the context of SSL (SS-FGVC). The increased recognition
+difficulty on fine-grained unlabeled data spells disaster for pseudo-labeling
+accuracy, resulting in poor performance of the SSL model. To tackle this
+challenge, we propose Soft Label Selection with Confidence-Aware Clustering
+based on Class Transition Tracking (SoC) by reconstructing the pseudo-label
+selection process by jointly optimizing Expansion Objective and Shrinkage
+Objective, which is based on a soft label manner. Respectively, the former
+objective encourages soft labels to absorb more candidate classes to ensure the
+attendance of ground-truth class, while the latter encourages soft labels to
+reject more noisy classes, which is theoretically proved to be equivalent to
+entropy minimization. In comparisons with various state-of-the-art methods, our
+approach demonstrates its superior performance in SS-FGVC. Checkpoints and
+source code are available at https://github.com/NJUyued/SoC4SS-FGVC.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Generalization Analysis of Machine Learning Algorithms via the
+ Worst-Case Data-Generating Probability Measure AAAI
+
+
+
+
+
+
+
+
+ Xinying Zou, Samir M. Perlaza, Iñaki Esnaola, Eitan Altman
+
+
+ In this paper, the worst-case probability measure over the data is introduced
+as a tool for characterizing the generalization capabilities of machine
+learning algorithms. More specifically, the worst-case probability measure is a
+Gibbs probability measure and the unique solution to the maximization of the
+expected loss under a relative entropy constraint with respect to a reference
+probability measure. Fundamental generalization metrics, such as the
+sensitivity of the expected loss, the sensitivity of the empirical risk, and
+the generalization gap are shown to have closed-form expressions involving the
+worst-case data-generating probability measure. Existing results for the Gibbs
+algorithm, such as characterizing the generalization gap as a sum of mutual
+information and lautum information, up to a constant factor, are recovered. A
+novel parallel is established between the worst-case data-generating
+probability measure and the Gibbs algorithm. Specifically, the Gibbs
+probability measure is identified as a fundamental commonality of the model
+space and the data space for machine learning algorithms.
+
+
+
+ comment: To appear in the Proceedings of the AAAI Conference on Artificial
+ Intelligence (7 + 2 pages)
+
+
+
+
+
+
+ ☆ It's All in the Mix: Wasserstein Machine Learning with Mixed Features NeurIPS 2022
+
+
+ Problem definition: The recent advent of data-driven and end-to-end
+decision-making across different areas of operations management has led to an
+ever closer integration of prediction models from machine learning and
+optimization models from operations research. A key challenge in this context
+is the presence of estimation errors in the prediction models, which tend to be
+amplified by the subsequent optimization model -- a phenomenon that is often
+referred to as the Optimizer's Curse or the Error-Maximization Effect of
+Optimization.
+ Methodology/results: A contemporary approach to combat such estimation errors
+is offered by distributionally robust problem formulations that consider all
+data-generating distributions close to the empirical distribution derived from
+historical samples, where `closeness' is determined by the Wasserstein
+distance. While those techniques show significant promise in problems where all
+input features are continuous, they scale exponentially when binary and/or
+categorical features are present. This paper demonstrates that such
+mixed-feature problems can indeed be solved in polynomial time. We present a
+practically efficient algorithm to solve mixed-feature problems, and we compare
+our method against alternative techniques both theoretically and empirically on
+standard benchmark instances.
+ Managerial implications: Data-driven operations management problems often
+involve prediction models with discrete features. We develop and analyze a
+methodology that faithfully accounts for the presence of discrete features, and
+we demonstrate that our approach can significantly outperform existing methods
+that are agnostic to the presence of discrete features, both theoretically and
+across standard benchmark instances.
+
+
+
+ comment: 48 pages (31 main + proofs), 7 tables, 2 colored plots, an early
+ version appeared in NeurIPS 2022 main track (arXiv 2205.13501)
+
+
+
+
+
+
+ ☆ On the Parameterization of Second-Order Optimization Effective Towards
+ the Infinite Width
+
+
+ Second-order optimization has been developed to accelerate the training of
+deep neural networks and it is being applied to increasingly larger-scale
+models. In this study, towards training on further larger scales, we identify a
+specific parameterization for second-order optimization that promotes feature
+learning in a stable manner even if the network width increases significantly.
+Inspired by a maximal update parameterization, we consider a one-step update of
+the gradient and reveal the appropriate scales of hyperparameters including
+random initialization, learning rates, and damping terms. Our approach covers
+two major second-order optimization algorithms, K-FAC and Shampoo, and we
+demonstrate that our parameterization achieves higher generalization
+performance in feature learning. In particular, it enables us to transfer the
+hyperparameters across models with different widths.
+
+
+
+ comment: 34 pages
+
+
+
+
+
+
+ ☆ Self-Supervised Detection of Perfect and Partial Input-Dependent
+ Symmetries
+
+
+ Group equivariance ensures consistent responses to group transformations of
+the input, leading to more robust models and enhanced generalization
+capabilities. However, this property can lead to overly constrained models if
+the symmetries considered in the group differ from those observed in data.
+While common methods address this by determining the appropriate level of
+symmetry at the dataset level, they are limited to supervised settings and
+ignore scenarios in which multiple levels of symmetry co-exist in the same
+dataset. For instance, pictures of cars and planes exhibit different levels of
+rotation, yet both are included in the CIFAR-10 dataset. In this paper, we
+propose a method able to detect the level of symmetry of each input without the
+need for labels. To this end, we derive a sufficient and necessary condition to
+learn the distribution of symmetries in the data. Using the learned
+distribution, we generate pseudo-labels that allow us to learn the levels of
+symmetry of each input in a self-supervised manner. We validate the
+effectiveness of our approach on synthetic datasets with different per-class
+levels of symmetries e.g. MNISTMultiple, in which digits are uniformly rotated
+within a class-dependent interval. We demonstrate that our method can be used
+for practical applications such as the generation of standardized datasets in
+which the symmetries are not present, as well as the detection of
+out-of-distribution symmetries during inference. By doing so, both the
+generalization and robustness of non-equivariant models can be improved. Our
+code is publicly available at https://github.com/aurban0/ssl-sym.
+
+
+
+
+
+
+
+ ☆ Sharing is CAIRing: Characterizing Principles and Assessing Properties
+ of Universal Privacy Evaluation for Synthetic Tabular Data
+
+
+
+
+
+
+
+
+ Tobias Hyrup, Anton Danholt Lautrup, Arthur Zimek, Peter Schneider-Kamp
+
+
+ Data sharing is a necessity for innovative progress in many domains,
+especially in healthcare. However, the ability to share data is hindered by
+regulations protecting the privacy of natural persons. Synthetic tabular data
+provide a promising solution to address data sharing difficulties but does not
+inherently guarantee privacy. Still, there is a lack of agreement on
+appropriate methods for assessing the privacy-preserving capabilities of
+synthetic data, making it difficult to compare results across studies. To the
+best of our knowledge, this is the first work to identify properties that
+constitute good universal privacy evaluation metrics for synthetic tabular
+data. The goal of such metrics is to enable comparability across studies and to
+allow non-technical stakeholders to understand how privacy is protected. We
+identify four principles for the assessment of metrics: Comparability,
+Applicability, Interpretability, and Representativeness (CAIR). To quantify and
+rank the degree to which evaluation metrics conform to the CAIR principles, we
+design a rubric using a scale of 1-4. Each of the four properties is scored on
+four parameters, yielding 16 total dimensions. We study the applicability and
+usefulness of the CAIR principles and rubric by assessing a selection of
+metrics popular in other studies. The results provide granular insights into
+the strengths and weaknesses of existing metrics that not only rank the metrics
+but highlight areas of potential improvements. We expect that the CAIR
+principles will foster agreement among researchers and organizations on which
+universal privacy evaluation metrics are appropriate for synthetic tabular
+data.
+
+
+
+
+
+
+
+ ☆ Identification of Causal Structure in the Presence of Missing Data with
+ Additive Noise Model AAAI-2024
+
+
+ Missing data are an unavoidable complication frequently encountered in many
+causal discovery tasks. When a missing process depends on the missing values
+themselves (known as self-masking missingness), the recovery of the joint
+distribution becomes unattainable, and detecting the presence of such
+self-masking missingness remains a perplexing challenge. Consequently, due to
+the inability to reconstruct the original distribution and to discern the
+underlying missingness mechanism, simply applying existing causal discovery
+methods would lead to wrong conclusions. In this work, we found that the recent
+advances additive noise model has the potential for learning causal structure
+under the existence of the self-masking missingness. With this observation, we
+aim to investigate the identification problem of learning causal structure from
+missing data under an additive noise model with different missingness
+mechanisms, where the `no self-masking missingness' assumption can be
+eliminated appropriately. Specifically, we first elegantly extend the scope of
+identifiability of causal skeleton to the case with weak self-masking
+missingness (i.e., no other variable could be the cause of self-masking
+indicators except itself). We further provide the sufficient and necessary
+identification conditions of the causal direction under additive noise model
+and show that the causal structure can be identified up to an IN-equivalent
+pattern. We finally propose a practical algorithm based on the above
+theoretical results on learning the causal skeleton and causal direction.
+Extensive experiments on synthetic and real data demonstrate the efficiency and
+effectiveness of the proposed algorithms.
+
+
+
+ comment: Accepted by AAAI-2024
+
+
+
+
+
+
+ ☆ Gaussian process learning of nonlinear dynamics
+
+
+ One of the pivotal tasks in scientific machine learning is to represent
+underlying dynamical systems from time series data. Many methods for such
+dynamics learning explicitly require the derivatives of state data, which are
+not directly available and can be approximated conventionally by finite
+differences. However, the discrete approximations of time derivatives may
+result in a poor estimation when state data are scarce and/or corrupted by
+noise, thus compromising the predictiveness of the learned dynamical models. To
+overcome this technical hurdle, we propose a new method that learns nonlinear
+dynamics through a Bayesian inference of characterizing model parameters. This
+method leverages a Gaussian process representation of states, and constructs a
+likelihood function using the correlation between state data and their
+derivatives, yet prevents explicit evaluations of time derivatives. Through a
+Bayesian scheme, a probabilistic estimate of the model parameters is given by
+the posterior distribution, and thus a quantification is facilitated for
+uncertainties from noisy state data and the learning process. Specifically, we
+will discuss the applicability of the proposed method to two typical scenarios
+for dynamical systems: parameter identification and estimation with an affine
+structure of the system, and nonlinear parametric approximation without prior
+knowledge.
+
+
+
+
+
+
+
+ ☆ CUDC: A Curiosity-Driven Unsupervised Data Collection Method with
+ Adaptive Temporal Distances for Offline Reinforcement Learning AAAI-24
+
+
+ Offline reinforcement learning (RL) aims to learn an effective policy from a
+pre-collected dataset. Most existing works are to develop sophisticated
+learning algorithms, with less emphasis on improving the data collection
+process. Moreover, it is even challenging to extend the single-task setting and
+collect a task-agnostic dataset that allows an agent to perform multiple
+downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised
+Data Collection (CUDC) method to expand feature space using adaptive temporal
+distances for task-agnostic data collection and ultimately improve learning
+efficiency and capabilities for multi-task offline RL. To achieve this, CUDC
+estimates the probability of the k-step future states being reachable from the
+current states, and adapts how many steps into the future that the dynamics
+model should predict. With this adaptive reachability mechanism in place, the
+feature representation can be diversified, and the agent can navigate itself to
+collect higher-quality data with curiosity. Empirically, CUDC surpasses
+existing unsupervised methods in efficiency and learning performance in various
+downstream offline RL tasks of the DeepMind control suite.
+
+
+
+ comment: Accepted at AAAI-24
+
+
+
+
+
+
+ ☆ Decentralised and collaborative machine learning framework for IoT
+
+
+
+
+
+
+
+
+ Martín González-Soto, Rebeca P. Díaz-Redondo, Manuel Fernández-Veiga, Bruno Rodríguez-Castro, Ana Fernández-Vilas
+
+
+ Decentralised machine learning has recently been proposed as a potential
+solution to the security issues of the canonical federated learning approach.
+In this paper, we propose a decentralised and collaborative machine learning
+framework specially oriented to resource-constrained devices, usual in IoT
+deployments. With this aim we propose the following construction blocks. First,
+an incremental learning algorithm based on prototypes that was specifically
+implemented to work in low-performance computing elements. Second, two
+random-based protocols to exchange the local models among the computing
+elements in the network. Finally, two algorithmics approaches for prediction
+and prototype creation. This proposal was compared to a typical centralized
+incremental learning approach in terms of accuracy, training time and
+robustness with very promising results.
+
+
+ Hierarchy is an important and commonly observed topological property in
+real-world graphs that indicate the relationships between supervisors and
+subordinates or the organizational behavior of human groups. As hierarchy is
+introduced as a new inductive bias into the Graph Neural Networks (GNNs) in
+various tasks, it implies latent topological relations for attackers to improve
+their inference attack performance, leading to serious privacy leakage issues.
+In addition, existing privacy-preserving frameworks suffer from reduced
+protection ability in hierarchical propagation due to the deficiency of
+adaptive upper-bound estimation of the hierarchical perturbation boundary. It
+is of great urgency to effectively leverage the hierarchical property of data
+while satisfying privacy guarantees. To solve the problem, we propose the
+Poincar\'e Differential Privacy framework, named PoinDP, to protect the
+hierarchy-aware graph embedding based on hyperbolic geometry. Specifically,
+PoinDP first learns the hierarchy weights for each entity based on the
+Poincar\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware
+Sensitivity is designed to measure the sensitivity of the hierarchical
+structure and adaptively allocate the privacy protection strength. Besides, the
+Hyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian
+mechanism in Euclidean space to hyperbolic space to realize random
+perturbations that satisfy differential privacy under the hyperbolic space
+metric. Extensive experiment results on five real-world datasets demonstrate
+the proposed PoinDP's advantages of effective privacy protection while
+maintaining good performance on the node classification task.
+
+
+
+
+
+
+
+ ☆ OVD-Explorer:Optimism Should Not Be the Sole Pursuit of Exploration in
+ Noisy Environments AAAI 2024
+
+
+
+
+
+
+
+
+ Jinyi Liu, Zhi Wang, Yan Zheng, Jianye Hao, Chenjia Bai, Junjie Ye, Zhen Wang, Haiyin Piao, Yang Sun
+
+
+ In reinforcement learning, the optimism in the face of uncertainty (OFU) is a
+mainstream principle for directing exploration towards less explored areas,
+characterized by higher uncertainty. However, in the presence of environmental
+stochasticity (noise), purely optimistic exploration may lead to excessive
+probing of high-noise areas, consequently impeding exploration efficiency.
+Hence, in exploring noisy environments, while optimism-driven exploration
+serves as a foundation, prudent attention to alleviating unnecessary
+over-exploration in high-noise areas becomes beneficial. In this work, we
+propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a
+noise-aware optimistic exploration for continuous control. OVD-Explorer
+proposes a new measurement of the policy's exploration ability considering
+noise in optimistic perspectives, and leverages gradient ascent to drive
+exploration. Practically, OVD-Explorer can be easily integrated with continuous
+control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks
+demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic
+exploration.
+
+
+
+ comment: Accepted by AAAI 2024, with appendix
+
+
+
+
+
+
+ ☆ Exploring the Residual Stream of Transformers
+
+
+ Transformer-based models have achieved great breakthroughs in recent years.
+However, there are many significant questions that have not been answered in
+the field of explaining the reason why the models have powerful outputs. We do
+not know how to locate the models' important parameters storing the knowledge
+for predicting the next word, and whether these parameters are stored on the
+same layer/module or different ones. Moreover, we do not understand the
+mechanism to merge the knowledge into the final embedding for next word
+prediction. In this paper, we explore the residual stream of transformers to
+increase the interpretability. We find the mechanism behind residual connection
+is a direct addition function on before-softmax values, so the probabilities of
+tokens with larger before-softmax values will increase. Moreover, we prove that
+using log probability increase as contribution scores is reasonable, and based
+on this we can locate important parameters. Besides, we propose a method to
+analyze how previous layers affect upper layers by comparing the inner
+products. The experimental results and case study show that our research can
+increase the interpretability of transformer-based models. We will release our
+code on https://github.com/zepingyu0512/residualstream.
+
+
+
+
+
+
+
+ ☆ Best Arm Identification with Fixed Budget: A Large Deviation Perspective NeurIPS 2023
+
+
+ We consider the problem of identifying the best arm in stochastic Multi-Armed
+Bandits (MABs) using a fixed sampling budget. Characterizing the minimal
+instance-specific error probability for this problem constitutes one of the
+important remaining open problems in MABs. When arms are selected using a
+static sampling strategy, the error probability decays exponentially with the
+number of samples at a rate that can be explicitly derived via Large Deviation
+techniques. Analyzing the performance of algorithms with adaptive sampling
+strategies is however much more challenging. In this paper, we establish a
+connection between the Large Deviation Principle (LDP) satisfied by the
+empirical proportions of arm draws and that satisfied by the empirical arm
+rewards. This connection holds for any adaptive algorithm, and is leveraged (i)
+to improve error probability upper bounds of some existing algorithms, such as
+the celebrated \sr (Successive Rejects) algorithm \citep{audibert2010best}, and
+(ii) to devise and analyze new algorithms. In particular, we present \sred
+(Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it
+any} round based on the observed empirical gaps between the rewards of various
+arms. Applying our Large Deviation results, we prove that \sred enjoys better
+performance guarantees than existing algorithms, including \sr. Extensive
+numerical experiments confirm this observation.
+
+
+
+ comment: This work has been published in NeurIPS 2023
+
+
+
+
+
+
+ ☆ Object Detection for Automated Coronary Artery Using Deep Learning
+
+
+ In the era of digital medicine, medical imaging serves as a widespread
+technique for early disease detection, with a substantial volume of images
+being generated and stored daily in electronic patient records. X-ray
+angiography imaging is a standard and one of the most common methods for
+rapidly diagnosing coronary artery diseases. The notable achievements of recent
+deep learning algorithms align with the increased use of electronic health
+records and diagnostic imaging. Deep neural networks, leveraging abundant data,
+advanced algorithms, and powerful computational capabilities, prove highly
+effective in the analysis and interpretation of images. In this context, Object
+detection methods have become a promising approach, particularly through
+convolutional neural networks (CNN), streamlining medical image analysis by
+eliminating manual feature extraction. This allows for direct feature
+extraction from images, ensuring high accuracy in results. Therefore, in our
+paper, we utilized the object detection method on X-ray angiography images to
+precisely identify the location of coronary artery stenosis. As a result, this
+model enables automatic and real-time detection of stenosis locations,
+assisting in the crucial and sensitive decision-making process for healthcare
+professionals.
+
+
+ Single-domain generalization (S-DG) aims to generalize a model to unseen
+environments with a single-source domain. However, most S-DG approaches have
+been conducted in the field of classification. When these approaches are
+applied to object detection, the semantic features of some objects can be
+damaged, which can lead to imprecise object localization and misclassification.
+To address these problems, we propose an object-aware domain generalization
+(OA-DG) method for single-domain generalization in object detection. Our method
+consists of data augmentation and training strategy, which are called OA-Mix
+and OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level
+transformation and object-aware mixing strategy. OA-Loss enables models to
+learn domain-invariant representations for objects and backgrounds from the
+original and OA-Mixed images. Our proposed method outperforms state-of-the-art
+works on standard benchmarks. Our code is available at
+https://github.com/WoojuLee24/OA-DG.
+
+
+
+ comment: Accepted by AAAI-24. The first two authors contributed equally
+
+
+
+
+
+
+ ☆ Probabilistic Prediction of Longitudinal Trajectory Considering Driving
+ Heterogeneity with Interpretability
+
+
+
+
+
+
+
+
+ Shuli Wang, Kun Gao, Lanfang Zhang, Yang Liu, Lei Chen
+
+
+ Automated vehicles are envisioned to navigate safely in complex mixed-traffic
+scenarios alongside human-driven vehicles. To promise a high degree of safety,
+accurately predicting the maneuvers of surrounding vehicles and their future
+positions is a critical task and attracts much attention. However, most
+existing studies focused on reasoning about positional information based on
+objective historical trajectories without fully considering the heterogeneity
+of driving behaviors. Therefore, this study proposes a trajectory prediction
+framework that combines Mixture Density Networks (MDN) and considers the
+driving heterogeneity to provide probabilistic and personalized predictions.
+Specifically, based on a certain length of historical trajectory data, the
+situation-specific driving preferences of each driver are identified, where key
+driving behavior feature vectors are extracted to characterize heterogeneity in
+driving behavior among different drivers. With the inputs of the short-term
+historical trajectory data and key driving behavior feature vectors, a
+probabilistic LSTMMD-DBV model combined with LSTM-based encoder-decoder
+networks and MDN layers is utilized to carry out personalized predictions.
+Finally, the SHapley Additive exPlanations (SHAP) method is employed to
+interpret the trained model for predictions. The proposed framework is tested
+based on a wide-range vehicle trajectory dataset. The results indicate that the
+proposed model can generate probabilistic future trajectories with remarkably
+improved predictions compared to existing benchmark models. Moreover, the
+results confirm that the additional input of driving behavior feature vectors
+representing the heterogeneity of driving behavior could provide more
+information and thus contribute to improving the prediction accuracy.
+
+
+
+ comment: 14 pages, 8 figures
+
+
+
+
+
+
+ ☆ Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor
+ Selection AAAI-24
+
+
+ Machine learning techniques, such as deep learning and ensemble methods, are
+widely used in various domains due to their ability to handle complex
+real-world tasks. However, their black-box nature has raised multiple concerns
+about the fairness, trustworthiness, and transparency of computer-assisted
+decision-making. This has led to the emergence of local post-hoc explainability
+methods, which offer explanations for individual decisions made by black-box
+algorithms. Among these methods, Kernel SHAP is widely used due to its
+model-agnostic nature and its well-founded theoretical framework. Despite these
+strengths, Kernel SHAP suffers from high instability: different executions of
+the method with the same inputs can lead to significantly different
+explanations, which diminishes the utility of post-hoc explainability. The
+contribution of this paper is two-fold. On the one hand, we show that Kernel
+SHAP's instability is caused by its stochastic neighbor selection procedure,
+which we adapt to achieve full stability without compromising explanation
+fidelity. On the other hand, we show that by restricting the neighbors
+generation to perturbations of size 1 -- which we call the coalitions of Layer
+1 -- we obtain a novel feature-attribution method that is fully stable,
+efficient to compute, and still meaningful.
+
+
+
+ comment: To appear in AAAI-24
+
+
+
+
+
+
+ ☆ Variational Mode Decomposition-Based Nonstationary Coherent Structure
+ Analysis for Spatiotemporal Data
+
+
+ The modal analysis techniques face difficulties in handling nonstationary
+phenomena. This paper presents a variational mode decomposition-based
+nonstationary coherent structure (VMD-NCS) analysis that enables the extraction
+and analysis of coherent structures in case of nonstationary phenomena from
+high-dimensional spatiotemporal data. The VMD-NCS analysis decomposes the input
+spatiotemporal data into intrinsic coherent structures (ICSs) that represent
+nonstationary spatiotemporal patterns and exhibit coherence in both the spatial
+and temporal directions. Furthermore, unlike many conventional modal analysis
+techniques, the proposed method accounts for the temporal changes in the
+spatial distribution with time. The performance of the VMD-NCS analysis was
+validated based on the transient growth phenomena in the flow around a
+cylinder. It was confirmed that the temporal changes in the spatial
+distribution, depicting the transient growth of vortex shedding where
+fluctuations arising in the far-wake region gradually approach the near-wake
+region, were represented as a single ICS. Further, in the analysis of the
+quasi-periodic flow field around a pitching airfoil, the temporal changes in
+the spatial distribution and the amplitude of vortex shedding behind the
+airfoil, influenced by the pitching motion of the airfoil, were captured as a
+single ICS. Additionally, the impact of two parameters, adjusting the number of
+ICSs ($K$) and the penalty factor related to the temporal coherence ($\alpha$),
+was investigated. The results revealed that $K$ has a significant impact on the
+VMD-NCS analysis results. In the case of a relatively high $K$, the VMD-NCS
+analysis tends to extract more periodic spatiotemporal patterns resembling the
+results of dynamic mode decomposition, whereas in the case of a small $K$, the
+analysis tends to extract more nonstationary spatiotemporal patterns.
+
+
+
+
+
+
+
+ ☆ Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation
+ in ultra low-data regimes
+
+
+
+
+
+
+
+
+ Nabeel Seedat, Nicolas Huynh, Boris van Breugel, Mihaela van der Schaar
+
+
+ Machine Learning (ML) in low-data settings remains an underappreciated yet
+crucial problem. This challenge is pronounced in low-to-middle income countries
+where access to large datasets is often limited or even absent. Hence, data
+augmentation methods to increase the sample size of datasets needed for ML are
+key to unlocking the transformative potential of ML in data-deprived regions
+and domains. Unfortunately, the limited training set constrains traditional
+tabular synthetic data generators in their ability to generate a large and
+diverse augmented dataset needed for ML tasks. To address this technical
+challenge, we introduce CLLM, which leverages the prior knowledge of Large
+Language Models (LLMs) for data augmentation in the low-data regime. While
+diverse, not all the data generated by LLMs will help increase utility for a
+downstream task, as for any generative model. Consequently, we introduce a
+principled curation process, leveraging learning dynamics, coupled with
+confidence and uncertainty metrics, to obtain a high-quality dataset.
+Empirically, on multiple real-world datasets, we demonstrate the superior
+performance of LLMs in the low-data regime compared to conventional generators.
+We further show our curation mechanism improves the downstream performance for
+all generators, including LLMs. Additionally, we provide insights and
+understanding into the LLM generation and curation mechanism, shedding light on
+the features that enable them to output high-quality augmented datasets. CLLM
+paves the way for wider usage of ML in data scarce domains and regions, by
+allying the strengths of LLMs with a robust data-centric approach.
+
+
+
+ comment: *Seedat & Huynh contributed equally
+
+
+
+
+
+
+ ☆ I-CEE: Tailoring Explanations of Image Classifications Models to User
+ Expertise
+
+
+ Effectively explaining decisions of black-box machine learning models is
+critical to responsible deployment of AI systems that rely on them. Recognizing
+their importance, the field of explainable AI (XAI) provides several techniques
+to generate these explanations. Yet, there is relatively little emphasis on the
+user (the explainee) in this growing body of work and most XAI techniques
+generate "one-size-fits-all" explanations. To bridge this gap and achieve a
+step closer towards human-centered XAI, we present I-CEE, a framework that
+provides Image Classification Explanations tailored to User Expertise. Informed
+by existing work, I-CEE explains the decisions of image classification models
+by providing the user with an informative subset of training data (i.e.,
+example images), corresponding local explanations, and model decisions.
+However, unlike prior work, I-CEE models the informativeness of the example
+images to depend on user expertise, resulting in different examples for
+different users. We posit that by tailoring the example set to user expertise,
+I-CEE can better facilitate users' understanding and simulatability of the
+model. To evaluate our approach, we conduct detailed experiments in both
+simulation and with human participants (N = 100) on multiple datasets.
+Experiments with simulated users show that I-CEE improves users' ability to
+accurately predict the model's decisions (simulatability) compared to
+baselines, providing promising preliminary results. Experiments with human
+participants demonstrate that our method significantly improves user
+simulatability accuracy, highlighting the importance of human-centered XAI
+
+
+
+
+
+
+
+ ☆ PICNN: A Pathway towards Interpretable Convolutional Neural Networks
+
+
+ Convolutional Neural Networks (CNNs) have exhibited great performance in
+discriminative feature learning for complex visual tasks. Besides
+discrimination power, interpretability is another important yet under-explored
+property for CNNs. One difficulty in the CNN interpretability is that filters
+and image classes are entangled. In this paper, we introduce a novel pathway to
+alleviate the entanglement between filters and image classes. The proposed
+pathway groups the filters in a late conv-layer of CNN into class-specific
+clusters. Clusters and classes are in a one-to-one relationship. Specifically,
+we use the Bernoulli sampling to generate the filter-cluster assignment matrix
+from a learnable filter-class correspondence matrix. To enable end-to-end
+optimization, we develop a novel reparameterization trick for handling the
+non-differentiable Bernoulli sampling. We evaluate the effectiveness of our
+method on ten widely used network architectures (including nine CNNs and a ViT)
+and five benchmark datasets. Experimental results have demonstrated that our
+method PICNN (the combination of standard CNNs with our proposed pathway)
+exhibits greater interpretability than standard CNNs while achieving higher or
+comparable discrimination power.
+
+
+
+
+
+
+
+ ☆ Optimistic Policy Gradient in Multi-Player Markov Games with a Single
+ Controller: Convergence Beyond the Minty Property AAAI 2024
+
+
+
+
+
+
+
+
+ Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, Tuomas Sandholm
+
+
+ Policy gradient methods enjoy strong practical performance in numerous tasks
+in reinforcement learning. Their theoretical understanding in multiagent
+settings, however, remains limited, especially beyond two-player competitive
+and potential Markov games. In this paper, we develop a new framework to
+characterize optimistic policy gradient methods in multi-player Markov games
+with a single controller. Specifically, under the further assumption that the
+game exhibits an equilibrium collapse, in that the marginals of coarse
+correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to
+stationary $\epsilon$-NE in $O(1/\epsilon^2)$ iterations, where $O(\cdot)$
+suppresses polynomial factors in the natural parameters of the game. Such an
+equilibrium collapse is well-known to manifest itself in two-player zero-sum
+Markov games, but also occurs even in a class of multi-player Markov games with
+separable interactions, as established by recent work. As a result, we bypass
+known complexity barriers for computing stationary NE when either of our
+assumptions fails. Our approach relies on a natural generalization of the
+classical Minty property that we introduce, which we anticipate to have further
+applications beyond Markov games.
+
+
+
+ comment: To appear at AAAI 2024
+
+
+
+
+
+
+ ☆ PPO-Clip Attains Global Optimality: Towards Deeper Understandings of
+ Clipping
+
+
+ Proximal Policy Optimization algorithm employing a clipped surrogate
+objective (PPO-Clip) is a prominent exemplar of the policy optimization
+methods. However, despite its remarkable empirical success, PPO-Clip lacks
+theoretical substantiation to date. In this paper, we contribute to the field
+by establishing the first global convergence results of a PPO-Clip variant in
+both tabular and neural function approximation settings. Our findings highlight
+the $O(1/\sqrt{T})$ min-iterate convergence rate specifically in the context of
+neural function approximation. We tackle the inherent challenges in analyzing
+PPO-Clip through three central concepts: (i) We introduce a generalized version
+of the PPO-Clip objective, illuminated by its connection with the hinge loss.
+(ii) Employing entropic mirror descent, we establish asymptotic convergence for
+tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the
+tabular analysis, we streamline convergence analysis by introducing a two-step
+policy improvement approach. This decouples policy search from complex neural
+policy parameterization using a regression-based update scheme. Furthermore, we
+gain deeper insights into the efficacy of PPO-Clip by interpreting these
+generalized objectives. Our theoretical findings also mark the first
+characterization of the influence of the clipping mechanism on PPO-Clip
+convergence. Importantly, the clipping range affects only the pre-constant of
+the convergence rate.
+
+
+
+
+
+
+
+ ☆ Extension of the Dip-test Repertoire -- Efficient and Differentiable
+ p-value Calculation for Clustering
+
+
+
+
+
+
+
+
+ Lena G. M. Bauer, Collin Leiber, Christian Böhm, Claudia Plant
+
+
+ Over the last decade, the Dip-test of unimodality has gained increasing
+interest in the data mining community as it is a parameter-free statistical
+test that reliably rates the modality in one-dimensional samples. It returns a
+so called Dip-value and a corresponding probability for the sample's
+unimodality (Dip-p-value). These two values share a sigmoidal relationship.
+However, the specific transformation is dependent on the sample size. Many
+Dip-based clustering algorithms use bootstrapped look-up tables translating
+Dip- to Dip-p-values for a certain limited amount of sample sizes. We propose a
+specifically designed sigmoid function as a substitute for these
+state-of-the-art look-up tables. This accelerates computation and provides an
+approximation of the Dip- to Dip-p-value transformation for every single sample
+size. Further, it is differentiable and can therefore easily be integrated in
+learning schemes using gradient descent. We showcase this by exploiting our
+function in a novel subspace clustering algorithm called Dip'n'Sub. We
+highlight in extensive experiments the various benefits of our proposal.
+
+
+
+
+
+
+
+ ☆ EncryIP: A Practical Encryption-Based Framework for Model Intellectual
+ Property Protection
+
+
+ In the rapidly growing digital economy, protecting intellectual property (IP)
+associated with digital products has become increasingly important. Within this
+context, machine learning (ML) models, being highly valuable digital assets,
+have gained significant attention for IP protection. This paper introduces a
+practical encryption-based framework called \textit{EncryIP}, which seamlessly
+integrates a public-key encryption scheme into the model learning process. This
+approach enables the protected model to generate randomized and confused
+labels, ensuring that only individuals with accurate secret keys, signifying
+authorized users, can decrypt and reveal authentic labels. Importantly, the
+proposed framework not only facilitates the protected model to multiple
+authorized users without requiring repetitive training of the original ML model
+with IP protection methods but also maintains the model's performance without
+compromising its accuracy. Compared to existing methods like watermark-based,
+trigger-based, and passport-based approaches, \textit{EncryIP} demonstrates
+superior effectiveness in both training protected models and efficiently
+detecting the unauthorized spread of ML models.
+
+
+ We present XLand-MiniGrid, a suite of tools and grid-world environments for
+meta-reinforcement learning research inspired by the diversity and depth of
+XLand and the simplicity and minimalism of MiniGrid. XLand-Minigrid is written
+in JAX, designed to be highly scalable, and can potentially run on GPU or TPU
+accelerators, democratizing large-scale experimentation with limited resources.
+To demonstrate the generality of our library, we have implemented some
+well-known single-task environments as well as new meta-learning environments
+capable of generating $10^8$ distinct tasks. We have empirically shown that the
+proposed environments can scale up to $2^{13}$ parallel instances on the GPU,
+reaching tens of millions of steps per second.
+
+
+
+
+
+
+
+
+ Siamul Karim Khan, Patrick Tinsley, Mahsa Mitcheff, Patrick Flynn, Kevin W. Bowyer, Adam Czajka
+
+
+ Synthesis of same-identity biometric iris images, both for existing and
+non-existing identities while preserving the identity across a wide range of
+pupil sizes, is complex due to intricate iris muscle constriction mechanism,
+requiring a precise model of iris non-linear texture deformations to be
+embedded into the synthesis pipeline. This paper presents the first method of
+fully data-driven, identity-preserving, pupil size-varying s ynthesis of iris
+images. This approach is capable of synthesizing images of irises with
+different pupil sizes representing non-existing identities as well as
+non-linearly deforming the texture of iris images of existing subjects given
+the segmentation mask of the target iris image. Iris recognition experiments
+suggest that the proposed deformation model not only preserves the identity
+when changing the pupil size but offers better similarity between same-identity
+iris samples with significant differences in pupil size, compared to
+state-of-the-art linear and non-linear (bio-mechanical-based) iris deformation
+models. Two immediate applications of the proposed approach are: (a) synthesis
+of, or enhancement of the existing biometric datasets for iris recognition,
+mimicking those acquired with iris sensors, and (b) helping forensic human
+experts in examining iris image pairs with significant differences in pupil
+dilation. Source codes and weights of the models are made available with the
+paper.
+
+
+ Data-driven soft sensors provide a potentially cost-effective and more
+accurate modeling approach to measure difficult-to-measure indices in
+industrial processes compared to mechanistic approaches. Artificial
+intelligence (AI) techniques, such as deep learning, have become a popular soft
+sensors modeling approach in the area of machine learning and big data.
+However, soft sensors models based deep learning potentially lead to complex
+model structures and excessive training time. In addition, industrial processes
+often rely on distributed control systems (DCS) characterized by resource
+constraints. Herein, guided by spatial geometric, a lightweight geometric
+constructive neural network, namely LightGCNet, is proposed, which utilizes
+compact angle constraint to assign the hidden parameters from dynamic
+intervals. At the same time, a node pool strategy and spatial geometric
+relationships are used to visualize and optimize the process of assigning
+hidden parameters, enhancing interpretability. In addition, the universal
+approximation property of LightGCNet is proved by spatial geometric analysis.
+Two versions algorithmic implementations of LightGCNet are presented in this
+article. Simulation results concerning both benchmark datasets and the ore
+grinding process indicate remarkable merits of LightGCNet in terms of small
+network size, fast learning speed, and sound generalization.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2307.00185
+
+
+
+
+
+
+ ☆ Active Preference Inference using Language Models and Probabilistic
+ Reasoning
+
+
+
+
+
+
+
+
+ Top Piriyakulkij, Volodymyr Kuleshov, Kevin Ellis
+
+
+ Actively inferring user preferences, for example by asking good questions, is
+important for any human-facing decision-making system. Active inference allows
+such systems to adapt and personalize themselves to nuanced individual
+preferences. To enable this ability for instruction-tuned large language models
+(LLMs), one may prompt them to ask users questions to infer their preferences,
+transforming the language models into more robust, interactive systems.
+However, out of the box, these models are not efficient at extracting
+preferences: the questions they generate are not informative, requiring a high
+number of user interactions and impeding the usability of the downstream
+system. In this work, we introduce an inference-time algorithm that helps LLMs
+quickly infer preferences by using more informative questions. Our algorithm
+uses a probabilistic model whose conditional distributions are defined by
+prompting an LLM, and returns questions that optimize expected entropy and
+expected model change. Results in a simplified interactive web shopping setting
+with real product items show that an LLM equipped with our entropy reduction
+algorithm outperforms baselines with the same underlying LLM on task
+performance while using fewer user interactions.
+
+
+
+
+
+
+
+ ☆ Modelling and characterization of fine Particulate Matter dynamics in
+ Bujumbura using low cost sensors
+
+
+ Air pollution is a result of multiple sources including both natural and
+anthropogenic activities. The rapid urbanization of the cities such as
+Bujumbura economic capital of Burundi, is one of these factors. The very first
+characterization of the spatio-temporal variability of PM2.5 in Bujumbura and
+the forecasting of PM2.5 concentration have been conducted in this paper using
+data collected during a year, from august 2022 to august 2023, by low cost
+sensors installed in Bujumbura city. For each commune, an hourly, daily and
+seasonal analysis were carried out and the results showed that the mass
+concentrations of PM2.5 in the three municipalities differ from one commune to
+another. The average hourly and annual PM2.5 concentrations exceed the World
+Health Organization standards. The range is between 28.3 and 35.0 microgram/m3
+. In order to make prediction of PM2.5 concentration, an investigation of RNN
+with Long Short Term Memory (LSTM) has been undertaken.
+
+
+
+
+
+
+
+ ☆ When Model Meets New Normals: Test-time Adaptation for Unsupervised
+ Time-series Anomaly Detection AAAI 2024
+
+
+
+
+
+
+
+
+ Dongmin Kim, Sunghyun Park, Jaegul Choo
+
+
+ Time-series anomaly detection deals with the problem of detecting anomalous
+timesteps by learning normality from the sequence of observations. However, the
+concept of normality evolves over time, leading to a "new normal problem",
+where the distribution of normality can be changed due to the distribution
+shifts between training and test data. This paper highlights the prevalence of
+the new normal problem in unsupervised time-series anomaly detection studies.
+To tackle this issue, we propose a simple yet effective test-time adaptation
+strategy based on trend estimation and a self-supervised approach to learning
+new normalities during inference. Extensive experiments on real-world
+benchmarks demonstrate that incorporating the proposed strategy into the
+anomaly detector consistently improves the model's performance compared to the
+baselines, leading to robustness to the distribution shifts.
+
+
+
+ comment: Accepted to AAAI 2024, 17 pages, https://github.com/carrtesy/M2N2
+
+
+
+
+
+
+ ☆ Continual Learning: Forget-free Winning Subnetworks for Video
+ Representations
+
+
+
+
+
+
+
+
+ Haeyong Kang, Jaehong Yoon, Sung Ju Hwang, Chang D. Yoo
+
+
+ Inspired by the Regularized Lottery Ticket Hypothesis (RLTH), which
+highlights the presence of competitive subnetworks within dense networks for
+continual learning tasks, we introduce Winning Subnetworks (WSN). This approach
+utilizes reused weights in dense networks to enhance learning in Task
+Incremental Learning (TIL) scenarios. To mitigate overfitting in Few-Shot Class
+Incremental Learning (FSCIL), we have developed WSN variants referred to as the
+Soft subnetwork (SoftNet). Furthermore, addressing WSN's limitation of sparse
+reused weights in Video Incremental Learning (VIL), we propose the Fourier
+Subneural Operator (FSO). The FSO, operating in Fourier space, adaptively and
+compactly encodes videos, discovering reusable subnetworks with diverse
+bandwidths. We have applied FSO's Fourier representations to various continual
+learning contexts, including VIL, TIL, and FSCIL. Our extensive experiments
+across these scenarios demonstrate FSO's remarkable efficacy in continual
+learning, significantly enhancing task performance at various convolutional
+representational levels: it boosts performance in the higher layers for TIL and
+FSCIL and the lower layers for VIL.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2303.14962,
+ arXiv:2306.11305
+
+ Recent research has identified discriminatory behavior of automated
+prediction algorithms towards groups identified on specific protected
+attributes (e.g., gender, ethnicity, age group, etc.). When deployed in
+real-world scenarios, such techniques may demonstrate biased predictions
+resulting in unfair outcomes. Recent literature has witnessed algorithms for
+mitigating such biased behavior mostly by adding convex surrogates of fairness
+metrics such as demographic parity or equalized odds in the loss function,
+which are often not easy to estimate. This research proposes a novel
+in-processing based GroupMixNorm layer for mitigating bias from deep learning
+models. The GroupMixNorm layer probabilistically mixes group-level feature
+statistics of samples across different groups based on the protected attribute.
+The proposed method improves upon several fairness metrics with minimal impact
+on overall accuracy. Analysis on benchmark tabular and image datasets
+demonstrates the efficacy of the proposed method in achieving state-of-the-art
+performance. Further, the experimental analysis also suggests the robustness of
+the GroupMixNorm layer against new protected attributes during inference and
+its utility in eliminating bias from a pre-trained network.
+
+
+
+ comment: 12 pages, 6 figures, Pacific-Asia Conference on Knowledge Discovery
+ and Data Mining (PAKDD) 2023
+
+ High-dimensional datasets often contain multiple meaningful clusterings in
+different subspaces. For example, objects can be clustered either by color,
+weight, or size, revealing different interpretations of the given dataset. A
+variety of approaches are able to identify such non-redundant clusterings.
+However, most of these methods require the user to specify the expected number
+of subspaces and clusters for each subspace. Stating these values is a
+non-trivial problem and usually requires detailed knowledge of the input
+dataset. In this paper, we propose a framework that utilizes the Minimum
+Description Length Principle (MDL) to detect the number of subspaces and
+clusters per subspace automatically. We describe an efficient procedure that
+greedily searches the parameter space by splitting and merging subspaces and
+clusters within subspaces. Additionally, an encoding strategy is introduced
+that allows us to detect outliers in each subspace. Extensive experiments show
+that our approach is highly competitive to state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ Time-Series Contrastive Learning against False Negatives and Class
+ Imbalance
+
+
+
+
+
+
+
+
+ Xiyuan Jin, Jing Wang, Lei Liu, Youfang Lin
+
+
+ As an exemplary self-supervised approach for representation learning,
+time-series contrastive learning has exhibited remarkable advancements in
+contemporary research. While recent contrastive learning strategies have
+focused on how to construct appropriate positives and negatives, in this study,
+we conduct theoretical analysis and find they have overlooked the fundamental
+issues: false negatives and class imbalance inherent in the InfoNCE loss-based
+framework. Therefore, we introduce a straightforward modification grounded in
+the SimCLR framework, universally adaptable to models engaged in the instance
+discrimination task. By constructing instance graphs to facilitate interactive
+learning among instances, we emulate supervised contrastive learning via the
+multiple-instances discrimination task, mitigating the harmful impact of false
+negatives. Moreover, leveraging the graph structure and few-labeled data, we
+perform semi-supervised consistency classification and enhance the
+representative ability of minority classes. We compared our method with the
+most popular time-series contrastive learning methods on four real-world
+time-series datasets and demonstrated our significant advantages in overall
+performance.
+
+
+
+
+
+
+
+ ☆ Identification of Causal Structure with Latent Variables Based on Higher
+ Order Cumulants AAAI 2024
+
+
+ Causal discovery with latent variables is a crucial but challenging task.
+Despite the emergence of numerous methods aimed at addressing this challenge,
+they are not fully identified to the structure that two observed variables are
+influenced by one latent variable and there might be a directed edge in
+between. Interestingly, we notice that this structure can be identified through
+the utilization of higher-order cumulants. By leveraging the higher-order
+cumulants of non-Gaussian data, we provide an analytical solution for
+estimating the causal coefficients or their ratios. With the estimated (ratios
+of) causal coefficients, we propose a novel approach to identify the existence
+of a causal edge between two observed variables subject to latent variable
+influence. In case when such a causal edge exits, we introduce an asymmetry
+criterion to determine the causal direction. The experimental results
+demonstrate the effectiveness of our proposed method.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ☆ Dynamic Frequency Domain Graph Convolutional Network for Traffic
+ Forecasting
+
+
+ Complex spatial dependencies in transportation networks make traffic
+prediction extremely challenging. Much existing work is devoted to learning
+dynamic graph structures among sensors, and the strategy of mining spatial
+dependencies from traffic data, known as data-driven, tends to be an intuitive
+and effective approach. However, Time-Shift of traffic patterns and noise
+induced by random factors hinder data-driven spatial dependence modeling. In
+this paper, we propose a novel dynamic frequency domain graph convolution
+network (DFDGCN) to capture spatial dependencies. Specifically, we mitigate the
+effects of time-shift by Fourier transform, and introduce the identity
+embedding of sensors and time embedding when capturing data for graph learning
+since traffic data with noise is not entirely reliable. The graph is combined
+with static predefined and self-adaptive graphs during graph convolution to
+predict future traffic data through classical causal convolutions. Extensive
+experiments on four real-world datasets demonstrate that our model is effective
+and outperforms the baselines.
+
+
+
+
+
+
+
+ ☆ Transformer Network for Multi-Person Tracking and Re-Identification in
+ Unconstrained Environment
+
+
+
+
+
+
+
+
+ Hamza Mukhtar, Muhammad Usman Ghani Khan
+
+
+ Multi-object tracking (MOT) has profound applications in a variety of fields,
+including surveillance, sports analytics, self-driving, and cooperative
+robotics. Despite considerable advancements, existing MOT methodologies tend to
+falter when faced with non-uniform movements, occlusions, and
+appearance-reappearance scenarios of the objects. Recognizing this inadequacy,
+we put forward an integrated MOT method that not only marries object detection
+and identity linkage within a singular, end-to-end trainable framework but also
+equips the model with the ability to maintain object identity links over long
+periods of time. Our proposed model, named STMMOT, is built around four key
+modules: 1) candidate proposal generation, which generates object proposals via
+a vision-transformer encoder-decoder architecture that detects the object from
+each frame in the video; 2) scale variant pyramid, a progressive pyramid
+structure to learn the self-scale and cross-scale similarities in multi-scale
+feature maps; 3) spatio-temporal memory encoder, extracting the essential
+information from the memory associated with each object under tracking; and 4)
+spatio-temporal memory decoder, simultaneously resolving the tasks of object
+detection and identity association for MOT. Our system leverages a robust
+spatio-temporal memory module that retains extensive historical observations
+and effectively encodes them using an attention-based aggregator. The
+uniqueness of STMMOT lies in representing objects as dynamic query embeddings
+that are updated continuously, which enables the prediction of object states
+with attention mechanisms and eradicates the need for post-processing.
+
+
+ While self-supervised graph pretraining techniques have shown promising
+results in various domains, their application still experiences challenges of
+limited topology learning, human knowledge dependency, and incompetent
+multi-level interactions. To address these issues, we propose a novel solution,
+Dual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which
+introduces a unique dual-level pretraining structure that orchestrates
+node-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM
+autonomously uncovers significant graph motifs through an edge pooling module,
+aligning learned motif similarities with graph kernel-based similarities. A
+cross-matching task enables sophisticated node-motif interactions and novel
+representation learning. Extensive experiments on 15 datasets validate DGPM's
+effectiveness and generalizability, outperforming state-of-the-art methods in
+unsupervised representation learning and transfer learning settings. The
+autonomously discovered motifs demonstrate the potential of DGPM to enhance
+robustness and interpretability.
+
+
+
+ comment: 14 pages, 6 figures, accepted by AAAI'24
+
+ Mixture models serve as one fundamental tool with versatile applications.
+However, their training techniques, like the popular Expectation Maximization
+(EM) algorithm, are notoriously sensitive to parameter initialization and often
+suffer from bad local optima that could be arbitrarily worse than the optimal.
+To address the long-lasting bad-local-optima challenge, we draw inspiration
+from the recent ground-breaking foundation models and propose to leverage their
+underlying big learning principle to upgrade the EM. Specifically, we present
+the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs
+joint, marginal, and orthogonally transformed marginal matchings between data
+and model distributions. Through simulated experiments, we empirically show
+that the BigLearn-EM is capable of delivering the optimal with high
+probability; comparisons on benchmark clustering datasets further demonstrate
+its effectiveness and advantages over existing techniques. The code is
+available at
+https://github.com/YulaiCong/Big-Learning-Expectation-Maximization.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on
+ NVIDIA Hopper Architecture using the CUTLASS Library
+
+
+ We provide an optimized implementation of the forward pass of
+FlashAttention-2, a popular memory-aware scaled dot-product attention
+algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture
+and written using the open-source CUTLASS library. In doing so, we explain the
+challenges and techniques involved in fusing online-softmax with back-to-back
+GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and
+Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and
+transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations,
+and choosing optimal tile sizes for the Q, K and V attention matrices while
+balancing the register pressure and shared memory utilization. In head-to-head
+benchmarks on a single H100 PCIe GPU for some common choices of
+hyperparameters, we observe 20-50% higher FLOPs/s over a version of
+FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.
+
+
+
+ comment: 13 pages, comments welcome
+
+
+
+
+
+
+ ☆ Convergence Visualizer of Decentralized Federated Distillation with
+ Reduced Communication Costs
+
+
+ Federated learning (FL) achieves collaborative learning without the need for
+data sharing, thus preventing privacy leakage. To extend FL into a fully
+decentralized algorithm, researchers have applied distributed optimization
+algorithms to FL by considering machine learning (ML) tasks as parameter
+optimization problems. Conversely, the consensus-based multi-hop federated
+distillation (CMFD) proposed in the authors' previous work makes neural network
+(NN) models get close with others in a function space rather than in a
+parameter space. Hence, this study solves two unresolved challenges of CMFD:
+(1) communication cost reduction and (2) visualization of model convergence.
+Based on a proposed dynamic communication cost reduction method (DCCR), the
+amount of data transferred in a network is reduced; however, with a slight
+degradation in the prediction accuracy. In addition, a technique for
+visualizing the distance between the NN models in a function space is also
+proposed. The technique applies a dimensionality reduction technique by
+approximating infinite-dimensional functions as numerical vectors to visualize
+the trajectory of how the models change by the distributed learning algorithm.
+
+
+
+ comment: (c) 2023 IEEE. Personal use of this material is permitted. Permission
+ from IEEE must be obtained for all other uses, in any current or future
+ media, including reprinting/republishing this material for advertising or
+ promotional purposes, creating new collective works, for resale or
+ redistribution to servers or lists, or reuse of any copyrighted component of
+ this work in other works
+
+
+
+
+
+
+ ☆ Sign Language Conversation Interpretation Using Wearable Sensors and
+ Machine Learning
+
+
+ The count of people suffering from various levels of hearing loss reached
+1.57 billion in 2019. This huge number tends to suffer on many personal and
+professional levels and strictly needs to be included with the rest of society
+healthily. This paper presents a proof of concept of an automatic sign language
+recognition system based on data obtained using a wearable device of 3 flex
+sensors. The system is designed to interpret a selected set of American Sign
+Language (ASL) dynamic words by collecting data in sequences of the performed
+signs and using machine learning methods. The built models achieved
+high-quality performances, such as Random Forest with 99% accuracy, Support
+Vector Machine (SVM) with 99%, and two K-Nearest Neighbor (KNN) models with
+98%. This indicates many possible paths toward the development of a full-scale
+system.
+
+
+
+
+
+
+
+ ☆ Short-Term Multi-Horizon Line Loss Rate Forecasting of a Distribution
+ Network Using Attention-GCN-LSTM
+
+
+
+
+
+
+
+
+ Jie Liu, Yijia Cao, Yong Li, Yixiu Guo, Wei Deng
+
+
+ Accurately predicting line loss rates is vital for effective line loss
+management in distribution networks, especially over short-term multi-horizons
+ranging from one hour to one week. In this study, we propose
+Attention-GCN-LSTM, a novel method that combines Graph Convolutional Networks
+(GCN), Long Short-Term Memory (LSTM), and a three-level attention mechanism to
+address this challenge. By capturing spatial and temporal dependencies, our
+model enables accurate forecasting of line loss rates across multiple horizons.
+Through comprehensive evaluation using real-world data from 10KV feeders, our
+Attention-GCN-LSTM model consistently outperforms existing algorithms,
+exhibiting superior performance in terms of prediction accuracy and
+multi-horizon forecasting. This model holds significant promise for enhancing
+line loss management in distribution networks.
+
+
+
+
+
+
+
+ ☆ 3D-LFM: Lifting Foundation Model
+
+
+
+
+
+
+
+
+ Mosam Dabhi, Laszlo A. Jeni, Simon Lucey
+
+
+ The lifting of 3D structure and camera from 2D landmarks is at the
+cornerstone of the entire discipline of computer vision. Traditional methods
+have been confined to specific rigid objects, such as those in
+Perspective-n-Point (PnP) problems, but deep learning has expanded our
+capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL)
+with resilience to noise, occlusions, and perspective distortions. All these
+techniques, however, have been limited by the fundamental need to establish
+correspondences across the 3D training data -- significantly limiting their
+utility to applications where one has an abundance of "in-correspondence" 3D
+data. Our approach harnesses the inherent permutation equivariance of
+transformers to manage varying number of points per 3D data instance,
+withstands occlusions, and generalizes to unseen categories. We demonstrate
+state of the art performance across 2D-3D lifting task benchmarks. Since our
+approach can be trained across such a broad class of structures we refer to it
+simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
+
+
+
+ comment: Project page is available at https://3dlfm.github.io
+
+
+
+
+
+
+ ☆ Hierarchical and Incremental Structural Entropy Minimization for
+ Unsupervised Social Event Detection AAAI 2024
+
+
+
+
+
+
+
+
+ Yuwei Cao, Hao Peng, Zhengtao Yu, Philip S. Yu
+
+
+ As a trending approach for social event detection, graph neural network
+(GNN)-based methods enable a fusion of natural language semantics and the
+complex social network structural information, thus showing SOTA performance.
+However, GNN-based methods can miss useful message correlations. Moreover, they
+require manual labeling for training and predetermining the number of events
+for prediction. In this work, we address social event detection via graph
+structural entropy (SE) minimization. While keeping the merits of the GNN-based
+methods, the proposed framework, HISEvent, constructs more informative message
+graphs, is unsupervised, and does not require the number of events given a
+priori. Specifically, we incrementally explore the graph neighborhoods using
+1-dimensional (1D) SE minimization to supplement the existing message graph
+with edges between semantically related messages. We then detect events from
+the message graph by hierarchically minimizing 2-dimensional (2D) SE. Our
+proposed 1D and 2D SE minimization algorithms are customized for social event
+detection and effectively tackle the efficiency problem of the existing SE
+minimization algorithms. Extensive experiments show that HISEvent consistently
+outperforms GNN-based methods and achieves the new SOTA for social event
+detection under both closed- and open-set settings while being efficient and
+robust.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for
+ Accelerating Language Models Inference AAAI24
+
+
+ Early Exiting is one of the most popular methods to achieve efficient
+inference. Current early exiting methods adopt the (weighted) sum of the cross
+entropy loss of all internal classifiers during training, imposing all these
+classifiers to predict all instances correctly. However, during inference, as
+long as one internal classifier predicts an instance correctly, it can
+accelerate without losing accuracy. Thus, there is a notable gap between
+training and inference. We propose ConsistentEE, an early exiting method that
+is consistent in training and inference. ConsistentEE formulates the early
+exiting process as a reinforcement learning problem. A policy network is added
+to decide whether an instance should exit or continue. The training objective
+of ConsistentEE only require each instance to be predicted correctly by one
+internal classifier. Additionally, we introduce the concept Memorize Layer to
+measure the hardness of an instance. We incorporate memorized layer into reward
+function design, which allows ``easy'' instances to focus more on acceleration
+while ``hard'' instances to focus more on accuracy. Experimental results show
+that our method outperforms other baselines on various natural language
+understanding and generation tasks.
+
+
+
+ comment: Accepted in AAAI24
+
+
+
+
+
+
+ ☆ Point Cloud Segmentation Using Transfer Learning with RandLA-Net: A Case
+ Study on Urban Areas
+
+
+
+
+
+
+
+
+ Alperen Enes Bayar, Ufuk Uyan, Elif Toprak, Cao Yuheng, Tang Juncheng, Ahmet Alp Kindiroglu
+
+
+ Urban environments are characterized by complex structures and diverse
+features, making accurate segmentation of point cloud data a challenging task.
+This paper presents a comprehensive study on the application of RandLA-Net, a
+state-of-the-art neural network architecture, for the 3D segmentation of
+large-scale point cloud data in urban areas. The study focuses on three major
+Chinese cities, namely Chengdu, Jiaoda, and Shenzhen, leveraging their unique
+characteristics to enhance segmentation performance.
+ To address the limited availability of labeled data for these specific urban
+areas, we employed transfer learning techniques. We transferred the learned
+weights from the Sensat Urban and Toronto 3D datasets to initialize our
+RandLA-Net model. Additionally, we performed class remapping to adapt the model
+to the target urban areas, ensuring accurate segmentation results.
+ The experimental results demonstrate the effectiveness of the proposed
+approach achieving over 80\% F1 score for each areas in 3D point cloud
+segmentation. The transfer learning strategy proves to be crucial in overcoming
+data scarcity issues, providing a robust solution for urban point cloud
+analysis. The findings contribute to the advancement of point cloud
+segmentation methods, especially in the context of rapidly evolving Chinese
+urban areas.
+
+
+
+
+
+
+
+ ☆ Sparse is Enough in Fine-tuning Pre-trained Large Language Model
+
+
+
+
+
+
+
+
+ Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
+
+
+ With the prevalence of pre-training-fine-tuning paradigm, how to efficiently
+adapt the pre-trained model to the downstream tasks has been an intriguing
+issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for
+low-cost adaptation, including Adapters, Bia-only, and the recently widely used
+Low-Rank Adaptation. Although these methods have demonstrated their
+effectiveness to some extent and have been widely applied, the underlying
+principles are still unclear. In this paper, we reveal the transition of loss
+landscape in the downstream domain from random initialization to pre-trained
+initialization, that is, from low-amplitude oscillation to high-amplitude
+oscillation. The parameter gradients exhibit a property akin to sparsity, where
+a small fraction of components dominate the total gradient norm, for instance,
+1% of the components account for 99% of the gradient. This property ensures
+that the pre-trained model can easily find a flat minimizer which guarantees
+the model's ability to generalize even with a low number of trainable
+parameters. Based on this, we propose a gradient-based sparse fine-tuning
+algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its
+effectiveness on a range of tasks including the GLUE Benchmark and
+Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
+
+
+
+
+
+
+
+
+ Di Wu, Yuling Jiao, Li Shen, Haizhao Yang, Xiliang Lu
+
+
+ Deep reinforcement learning (RL) has shown remarkable success in specific
+offline decision-making scenarios, yet its theoretical guarantees are still
+under development. Existing works on offline RL theory primarily emphasize a
+few trivial settings, such as linear MDP or general function approximation with
+strong assumptions and independent data, which lack guidance for practical use.
+The coupling of deep learning and Bellman residuals makes this problem
+challenging, in addition to the difficulty of data dependence. In this paper,
+we establish a non-asymptotic estimation error of pessimistic offline RL using
+general neural network approximation with $\mathcal{C}$-mixing data regarding
+the structure of networks, the dimension of datasets, and the concentrability
+of data coverage, under mild assumptions. Our result shows that the estimation
+error consists of two parts: the first converges to zero at a desired rate on
+the sample size with partially controllable concentrability, and the second
+becomes negligible if the residual constraint is tight. This result
+demonstrates the explicit efficiency of deep adversarial offline RL frameworks.
+We utilize the empirical process tool for $\mathcal{C}$-mixing sequences and
+the neural network approximation theory for the H\"{o}lder class to achieve
+this. We also develop methods to bound the Bellman estimation error caused by
+function approximation with empirical Bellman constraint perturbations.
+Additionally, we present a result that lessens the curse of dimensionality
+using data with low intrinsic dimensionality and function classes with low
+complexity. Our estimation provides valuable insights into the development of
+deep offline RL and guidance for algorithm model design.
+
+
+
+ comment: Full version of the paper accepted to the 38th Annual AAAI Conference
+ on Artificial Intelligence (AAAI 2024)
+
+
+
+
+
+
+ ☆ Topo-MLP : A Simplicial Network Without Message Passing
+
+
+ Due to their ability to model meaningful higher order relations among a set
+of entities, higher order network models have emerged recently as a powerful
+alternative for graph-based network models which are only capable of modeling
+binary relationships. Message passing paradigm is still dominantly used to
+learn representations even for higher order network models. While powerful,
+message passing can have disadvantages during inference, particularly when the
+higher order connectivity information is missing or corrupted. To overcome such
+limitations, we propose Topo-MLP, a purely MLP-based simplicial neural network
+algorithm to learn the representation of elements in a simplicial complex
+without explicitly relying on message passing. Our framework utilizes a novel
+Higher Order Neighborhood Contrastive (HONC) loss which implicitly incorporates
+the simplicial structure into representation learning. Our proposed model's
+simplicity makes it faster during inference. Moreover, we show that our model
+is robust when faced with missing or corrupted connectivity structure.
+
+
+
+
+
+
+
+ ☆ MG-Skip: Random Multi-Gossip Skipping Method for Nonsmooth Distributed
+ Optimization
+
+
+ Distributed optimization methods with probabilistic local updates have
+recently gained attention for their provable ability to communication
+acceleration. Nevertheless, this capability is effective only when the loss
+function is smooth and the network is sufficiently well-connected. In this
+paper, we propose the first linear convergent method MG-Skip with probabilistic
+local updates for nonsmooth distributed optimization. Without any extra
+condition for the network connectivity, MG-Skip allows for the multiple-round
+gossip communication to be skipped in most iterations, while its iteration
+complexity is $\mathcal{O}\left(\kappa \log \frac{1}{\epsilon}\right)$ and
+communication complexity is only
+$\mathcal{O}\left(\sqrt{\frac{\kappa}{(1-\rho)}} \log
+\frac{1}{\epsilon}\right)$, where $\kappa$ is the condition number of the loss
+function and $\rho$ reflects the connectivity of the network topology. To the
+best of our knowledge, MG-Skip achieves the best communication complexity when
+the loss function has the smooth (strongly convex)+nonsmooth (convex) composite
+form.
+
+
+
+
+
+
+
+ ☆ SimCalib: Graph Neural Network Calibration based on Similarity between
+ Nodes
+
+
+
+
+
+
+
+
+ Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, Helen Meng
+
+
+ Graph neural networks (GNNs) have exhibited impressive performance in
+modeling graph data as exemplified in various applications. Recently, the GNN
+calibration problem has attracted increasing attention, especially in
+cost-sensitive scenarios. Previous work has gained empirical insights on the
+issue, and devised effective approaches for it, but theoretical supports still
+fall short. In this work, we shed light on the relationship between GNN
+calibration and nodewise similarity via theoretical analysis. A novel
+calibration framework, named SimCalib, is accordingly proposed to consider
+similarity between nodes at global and local levels. At the global level, the
+Mahalanobis distance between the current node and class prototypes is
+integrated to implicitly consider similarity between the current node and all
+nodes in the same class. At the local level, the similarity of node
+representation movement dynamics, quantified by nodewise homophily and relative
+degree, is considered. Informed about the application of nodewise movement
+patterns in analyzing nodewise behavior on the over-smoothing problem, we
+empirically present a possible relationship between over-smoothing and GNN
+calibration problem. Experimentally, we discover a correlation between nodewise
+similarity and model calibration improvement, in alignment with our theoretical
+results. Additionally, we conduct extensive experiments investigating different
+design factors and demonstrate the effectiveness of our proposed SimCalib
+framework for GNN calibration by achieving state-of-the-art performance on 14
+out of 16 benchmarks.
+
+
+
+
+
+
+
+ ☆ Initializing Services in Interactive ML Systems for Diverse Users
+
+
+
+
+
+
+
+
+ Avinandan Bose, Mihaela Curmei, Daniel L. Jiang, Jamie Morgenstern, Sarah Dean, Lillian J. Ratliff, Maryam Fazel
+
+
+ This paper studies ML systems that interactively learn from users across
+multiple subpopulations with heterogeneous data distributions. The primary
+objective is to provide specialized services for different user groups while
+also predicting user preferences. Once the users select a service based on how
+well the service anticipated their preference, the services subsequently adapt
+and refine themselves based on the user data they accumulate, resulting in an
+iterative, alternating minimization process between users and services
+(learning dynamics). Employing such tailored approaches has two main
+challenges: (i) Unknown user preferences: Typically, data on user preferences
+are unavailable without interaction, and uniform data collection across a large
+and diverse user base can be prohibitively expensive. (ii) Suboptimal Local
+Solutions: The total loss (sum of loss functions across all users and all
+services) landscape is not convex even if the individual losses on a single
+service are convex, making it likely for the learning dynamics to get stuck in
+local minima. The final outcome of the aforementioned learning dynamics is thus
+strongly influenced by the initial set of services offered to users, and is not
+guaranteed to be close to the globally optimal outcome. In this work, we
+propose a randomized algorithm to adaptively select very few users to collect
+preference data from, while simultaneously initializing a set of services. We
+prove that under mild assumptions on the loss functions, the expected total
+loss achieved by the algorithm right after initialization is within a factor of
+the globally optimal total loss with complete user preference data, and this
+factor scales only logarithmically in the number of services. Our theory is
+complemented by experiments on real as well as semi-synthetic datasets.
+
+
+
+
+
+
+
+
+ Yang Jiao, Kai Yang, Tiancheng Wu, Chengtao Jian, Jianwei Huang
+
+
+ Trilevel learning, also called trilevel optimization (TLO), has been
+recognized as a powerful modelling tool for hierarchical decision process and
+widely applied in many machine learning applications, such as robust neural
+architecture search, hyperparameter optimization, and domain adaptation.
+Tackling TLO problems has presented a great challenge due to their nested
+decision-making structure. In addition, existing works on TLO face the
+following key challenges: 1) they all focus on the non-distributed setting,
+which may lead to privacy breach; 2) they do not offer any non-asymptotic
+convergence analysis which characterizes how fast an algorithm converges. To
+address the aforementioned challenges, this paper proposes an asynchronous
+federated trilevel optimization method to solve TLO problems. The proposed
+method utilizes $\mu$-cuts to construct a hyper-polyhedral approximation for
+the TLO problem and solve it in an asynchronous manner. We demonstrate that the
+proposed $\mu$-cuts are applicable to not only convex functions but also a wide
+range of non-convex functions that meet the $\mu$-weakly convex assumption.
+Furthermore, we theoretically analyze the non-asymptotic convergence rate for
+the proposed method by showing its iteration complexity to obtain
+$\epsilon$-stationary point is upper bounded by
+$\mathcal{O}(\frac{1}{\epsilon^2})$. Extensive experiments on real-world
+datasets have been conducted to elucidate the superiority of the proposed
+method, e.g., it has a faster convergence rate with a maximum acceleration of
+approximately 80$\%$.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ☆ Multi-agent reinforcement learning using echo-state network and its
+ application to pedestrian dynamics
+
+
+ In recent years, simulations of pedestrians using the multi-agent
+reinforcement learning (MARL) have been studied. This study considered the
+roads on a grid-world environment, and implemented pedestrians as MARL agents
+using an echo-state network and the least squares policy iteration method.
+Under this environment, the ability of these agents to learn to move forward by
+avoiding other agents was investigated. Specifically, we considered two types
+of tasks: the choice between a narrow direct route and a broad detour, and the
+bidirectional pedestrian flow in a corridor. The simulations results indicated
+that the learning was successful when the density of the agents was not that
+high.
+
+
+
+ comment: 19 pages, 10 figures
+
+
+
+
+
+
+ ☆ The Validity of a Machine Learning-Based Video Game in the Objective
+ Screening of Attention Deficit Hyperactivity Disorder in Children Aged 5 to
+ 12 Years
+
+
+ Objective: Early identification of ADHD is necessary to provide the
+opportunity for timely treatment. However, screening the symptoms of ADHD on a
+large scale is not easy. This study aimed to validate a video game (FishFinder)
+for the screening of ADHD using objective measurement of the core symptoms of
+this disorder. Method: The FishFinder measures attention and impulsivity
+through in-game performance and evaluates the child's hyperactivity using
+smartphone motion sensors. This game was tested on 26 children with ADHD and 26
+healthy children aged 5 to 12 years. A Support Vector Machine was employed to
+detect children with ADHD. results: This system showed 92.3% accuracy, 90%
+sensitivity, and 93.7% specificity using a combination of in-game and movement
+features. Conclusions: The FishFinder demonstrated a strong ability to identify
+ADHD in children. So, this game can be used as an affordable, accessible, and
+enjoyable method for the objective screening of ADHD.
+
+
+ Formal abductive explanations offer crucial guarantees of rigor and so are of
+interest in high-stakes uses of machine learning (ML). One drawback of
+abductive explanations is explanation size, justified by the cognitive limits
+of human decision-makers. Probabilistic abductive explanations (PAXps) address
+this limitation, but their theoretical and practical complexity makes their
+exact computation most often unrealistic. This paper proposes novel efficient
+algorithms for the computation of locally-minimal PXAps, which offer
+high-quality approximations of PXAps in practice. The experimental results
+demonstrate the practical efficiency of the proposed algorithms.
+
+
+
+
+
+
+
+ ☆ Classification of complex local environments in systems of particle
+ shapes through shape-symmetry encoded data augmentation
+
+
+ Detecting and analyzing the local environment is crucial for investigating
+the dynamical processes of crystal nucleation and shape colloidal particle
+self-assembly. Recent developments in machine learning provide a promising
+avenue for better order parameters in complex systems that are challenging to
+study using traditional approaches. However, the application of machine
+learning to self-assembly on systems of particle shapes is still underexplored.
+To address this gap, we propose a simple, physics-agnostic, yet powerful
+approach that involves training a multilayer perceptron (MLP) as a local
+environment classifier for systems of particle shapes, using input features
+such as particle distances and orientations. Our MLP classifier is trained in a
+supervised manner with a shape symmetry-encoded data augmentation technique
+without the need for any conventional roto-translations invariant symmetry
+functions. We evaluate the performance of our classifiers on four different
+scenarios involving self-assembly of cubic structures, 2-dimensional and
+3-dimensional patchy particle shape systems, hexagonal bipyramids with varying
+aspect ratios, and truncated shapes with different degrees of truncation. The
+proposed training process and data augmentation technique are both
+straightforward and flexible, enabling easy application of the classifier to
+other processes involving particle orientations. Our work thus presents a
+valuable tool for investigating self-assembly processes on systems of particle
+shapes, with potential applications in structure identification of any
+particle-based or molecular system where orientations can be defined.
+
+
+
+ comment: 14 pages, 9 figures
+
+
+
+
+
+
+ ☆ An Adaptive Placement and Parallelism Framework for Accelerating RLHF
+ Training
+
+
+
+
+
+
+
+
+ Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou
+
+
+ Recently, ChatGPT or InstructGPT like large language models (LLM) has made a
+significant impact in the AI world. These models are incredibly versatile,
+capable of performing language tasks on par or even exceeding the capabilities
+of human experts. Many works have attempted to reproduce the complex
+InstructGPT's RLHF (Reinforcement Learning with Human Feedback) training
+pipeline. However, the mainstream distributed RLHF training methods typically
+adopt a fixed model placement strategy, referred to as the Flattening strategy.
+This strategy treats all four models involved in RLHF as a single entity and
+places them on all devices, regardless of their differences. Unfortunately,
+this strategy exacerbates the generation bottlenecks in the RLHF training and
+degrades the overall training efficiency. To address these issues, we propose
+an adaptive model placement framework that offers two flexible model placement
+strategies. These strategies allow for the agile allocation of models across
+devices in a fine-grained manner. The Interleaving strategy helps reduce memory
+redundancy and communication costs during RLHF training. On the other hand, the
+Separation strategy improves the throughput of model training by separating the
+training and generation stages of the RLHF pipeline. Notably, this framework
+seamlessly integrates with other mainstream techniques for acceleration and
+enables automatic hyperparameter search. Extensive experiments have
+demonstrated that our Interleaving and Separation strategies can achieve
+notable improvements up to 11x, compared to the current state-of-the-art (SOTA)
+approaches. These experiments encompassed a wide range of training scenarios,
+involving models of varying sizes and devices of different scales. The results
+highlight the effectiveness and superiority of our approaches in accelerating
+the training of distributed RLHF.
+
+
+
+
+
+
+
+ ☆ On the Role of Server Momentum in Federated Learning AAAI 2024
+
+
+ Federated Averaging (FedAvg) is known to experience convergence issues when
+encountering significant clients system heterogeneity and data heterogeneity.
+Server momentum has been proposed as an effective mitigation. However, existing
+server momentum works are restrictive in the momentum formulation, do not
+properly schedule hyperparameters and focus only on system homogeneous
+settings, which leaves the role of server momentum still an under-explored
+problem. In this paper, we propose a general framework for server momentum,
+that (a) covers a large class of momentum schemes that are unexplored in
+federated learning (FL), (b) enables a popular stagewise hyperparameter
+scheduler, (c) allows heterogeneous and asynchronous local computing. We
+provide rigorous convergence analysis for the proposed framework. To our best
+knowledge, this is the first work that thoroughly analyzes the performances of
+server momentum with a hyperparameter scheduler and system heterogeneity.
+Extensive experiments validate the effectiveness of our proposed framework.
+
+
+
+ comment: Accepted at AAAI 2024
+
+
+
+
+
+
+ ☆ Convolutional Channel-wise Competitive Learning for the Forward-Forward
+ Algorithm AAAI 2024
+
+
+ The Forward-Forward (FF) Algorithm has been recently proposed to alleviate
+the issues of backpropagation (BP) commonly used to train deep neural networks.
+However, its current formulation exhibits limitations such as the generation of
+negative data, slower convergence, and inadequate performance on complex tasks.
+In this paper, we take the main ideas of FF and improve them by leveraging
+channel-wise competitive learning in the context of convolutional neural
+networks for image classification tasks. A layer-wise loss function is
+introduced that promotes competitive learning and eliminates the need for
+negative data construction. To enhance both the learning of compositional
+features and feature space partitioning, a channel-wise feature separator and
+extractor block is proposed that complements the competitive learning process.
+Our method outperforms recent FF-based models on image classification tasks,
+achieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST,
+Fashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the
+performance gap between FF learning and BP methods, indicating the potential of
+our proposed approach to learn useful representations in a layer-wise modular
+fashion, enabling more efficient and flexible learning.
+
+
+
+ comment: To be published in AAAI 2024, 11 pages, 7 figures
+
+
+
+
+
+
+ ☆ Discovering Malicious Signatures in Software from Structural
+ Interactions ICASSP 2024
+
+
+ Malware represents a significant security concern in today's digital
+landscape, as it can destroy or disable operating systems, steal sensitive user
+information, and occupy valuable disk space. However, current malware detection
+methods, such as static-based and dynamic-based approaches, struggle to
+identify newly developed (``zero-day") malware and are limited by customized
+virtual machine (VM) environments. To overcome these limitations, we propose a
+novel malware detection approach that leverages deep learning, mathematical
+techniques, and network science. Our approach focuses on static and dynamic
+analysis and utilizes the Low-Level Virtual Machine (LLVM) to profile
+applications within a complex network. The generated network topologies are
+input into the GraphSAGE architecture to efficiently distinguish between benign
+and malicious software applications, with the operation names denoted as node
+features. Importantly, the GraphSAGE models analyze the network's topological
+geometry to make predictions, enabling them to detect state-of-the-art malware
+and prevent potential damage during execution in a VM. To evaluate our
+approach, we conduct a study on a dataset comprising source code from 24,376
+applications, specifically written in C/C++, sourced directly from
+widely-recognized malware and various types of benign software. The results
+show a high detection performance with an Area Under the Receiver Operating
+Characteristic Curve (AUROC) of 99.85%. Our approach marks a substantial
+improvement in malware detection, providing a notably more accurate and
+efficient solution when compared to current state-of-the-art malware detection
+methods.
+
+
+ This manuscript enriches the framework of continuous normalizing flows (CNFs)
+within causal inference, primarily to augment the geometric properties of
+parametric submodels used in targeted maximum likelihood estimation (TMLE). By
+introducing an innovative application of CNFs, we construct a refined series of
+parametric submodels that enable a directed interpolation between the prior
+distribution $p_0$ and the empirical distribution $p_1$. This proposed
+methodology serves to optimize the semiparametric efficiency bound in causal
+inference by orchestrating CNFs to align with Wasserstein gradient flows. Our
+approach not only endeavors to minimize the mean squared error in the
+estimation but also imbues the estimators with geometric sophistication,
+thereby enhancing robustness against misspecification. This robustness is
+crucial, as it alleviates the dependence on the standard $n^{\frac{1}{4}}$ rate
+for a doubly-robust perturbation direction in TMLE. By incorporating robust
+optimization principles and differential geometry into the estimators, the
+developed geometry-aware CNFs represent a significant advancement in the
+pursuit of doubly robust causal inference.
+
+
+
+
+
+
+
+
+ Roman Pogodin, Namrata Deka, Yazhe Li, Danica J. Sutherland, Victor Veitch, Arthur Gretton
+
+
+ We introduce the Conditional Independence Regression CovariancE (CIRCE), a
+measure of conditional independence for multivariate continuous-valued
+variables. CIRCE applies as a regularizer in settings where we wish to learn
+neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being
+conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are
+assumed to be continuous-valued but relatively low dimensional, whereas $X$ and
+its features may be complex and high dimensional. Relevant settings include
+domain-invariant learning, fairness, and causal learning. The procedure
+requires just a single ridge regression from $Y$ to kernelized features of $Z$,
+which can be done in advance. It is then only necessary to enforce independence
+of $\varphi(X)$ from residuals of this regression, which is possible with
+attractive estimation properties and consistency guarantees. By contrast,
+earlier measures of conditional feature dependence require multiple regressions
+for each step of feature learning, resulting in more severe bias and variance,
+and greater computational cost. When sufficiently rich features are used, we
+establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z
+\mid Y$. In experiments, we show superior performance to previous methods on
+challenging benchmarks, including learning conditionally invariant image
+features.
+
+
+
+
+
+
+
+
+ B. A. Schreiber, J. Denholm, F. Jaeckle, M. J. Arends, K. M. Branson, C. -B. Schönlieb, E. J. Soilleux
+
+
+ We present an innovative method for rapidly segmenting hematoxylin and eosin
+(H&E)-stained tissue in whole-slide images (WSIs) that eliminates a wide range
+of undesirable artefacts such as pen marks and scanning artefacts. Our method
+involves taking a single-channel representation of a lowmagnification RGB
+overview of the WSI in which the pixel values are bimodally distributed such
+that H&E-stained tissue is easily distinguished from both background and a wide
+variety of artefacts. We demonstrate our method on 30 WSIs prepared from a wide
+range of institutions and WSI digital scanners, each containing substantial
+artefacts, and compare it to segmentations provided by Otsu thresholding and
+Histolab tissue segmentation and pen filtering tools. We found that our method
+segmented the tissue and fully removed all artefacts in 29 out of 30 WSIs,
+whereas Otsu thresholding failed to remove any artefacts, and the Histolab pen
+filtering tools only partially removed the pen marks. The beauty of our
+approach lies in its simplicity: manipulating RGB colour space and using Otsu
+thresholding allows for the segmentation of H&E-stained tissue and the rapid
+removal of artefacts without the need for machine learning or parameter tuning.
+
+
+
+ comment: 7 pages, 3 figures
+
+
+
+
+
+
+ ♻ ☆ Finding Nash equilibria by minimizing approximate exploitability with
+ learned best responses
+
+
+ There has been substantial progress on finding game-theoretic equilibria.
+Most of that work has focused on games with finite, discrete action spaces.
+However, many games involving space, time, money, and other fine-grained
+quantities have continuous action spaces (or are best modeled as such). We
+study the problem of finding an approximate Nash equilibrium of games with
+continuous action sets. The standard measure of closeness to Nash equilibrium
+is exploitability, which measures how much players can benefit from
+unilaterally changing their strategy. We propose two new methods that minimize
+an approximation of the exploitability with respect to the strategy profile.
+The first method uses a learned best-response function, which takes the current
+strategy profile as input and returns candidate best responses for each player.
+The strategy profile and best-response functions are trained simultaneously,
+with the former trying to minimize exploitability while the latter tries to
+maximize it. The second method maintains an ensemble of candidate best
+responses for each player. In each iteration, the best-performing elements of
+each ensemble are used to update the current strategy profile. The strategy
+profile and best-response ensembles are simultaneously trained to minimize and
+maximize the approximate exploitability, respectively. We evaluate our methods
+on various continuous games, showing that they outperform prior methods.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:1611.01673 by other authors
+
+
+
+
+
+
+ ♻ ☆ ICML 2023 Topological Deep Learning Challenge : Design and Results
+
+
+
+
+
+
+
+
+ Mathilde Papillon, Mustafa Hajij, Helen Jenne, Johan Mathe, Audun Myers, Theodore Papamarkou, Ghada Zamzmi, Tolga Birdal, Tamal Dey, Tim Doster, Tegan Emerson, Gurusankar Gopalakrishnan, Devendra Govil, Aldo Guzmán-Sáenz, Henry Kvinge, Neal Livesay, Soham Mukherjee, Shreyas N. Samaga, Karthikeyan Natesan Ramamurthy, Maneel Reddy Karri, Paul Rosen, Sophia Sanborn, Robin Walters, Jens Agerberg, Sadrodin Barikbin, Claudio Battiloro, Gleb Bazhenov, Guillermo Bernardez, Aiden Brent, Sergio Escalera, Simone Fiorellino, Dmitrii Gavrilev, Mohammed Hassanin, Paul Häusner, Odin Hoff Gardaa, Abdelwahed Khamis, Manuel Lecha, German Magai, Tatiana Malygina, Rubén Ballester, Kalyan Nadimpalli, Alexander Nikitin, Abraham Rabinowitz, Alessandro Salatiello, Simone Scardapane, Luca Scofano, Suraj Singh, Jens Sjölund, Pavel Snopov, Indro Spinelli, Lev Telyatnikov, Lucia Testa, Maosheng Yang, Yixiao Yue, Olga Zaghen, Ali Zia, Nina Miolane
+
+
+ This paper presents the computational challenge on topological deep learning
+that was hosted within the ICML 2023 Workshop on Topology and Geometry in
+Machine Learning. The competition asked participants to provide open-source
+implementations of topological neural networks from the literature by
+contributing to the python packages TopoNetX (data processing) and TopoModelX
+(deep learning). The challenge attracted twenty-eight qualifying submissions in
+its two-month duration. This paper describes the design of the challenge and
+summarizes its main findings.
+
+
+ Large Language Models (LLM) exhibit zero-shot mathematical reasoning capacity
+as a behavior emergent with scale, commonly manifesting as chain-of-thoughts
+(CoT) reasoning. However, multiple empirical findings suggest that this prowess
+is exclusive to LLMs with exorbitant sizes (beyond 50 billion parameters).
+Meanwhile, educational neuroscientists suggest that symbolic algebraic
+manipulation be introduced around the same time as arithmetic word problems to
+modularize language-to-formulation, symbolic manipulation of the formulation,
+and endgame arithmetic. In this paper, we start with the hypothesis that much
+smaller LMs, which are weak at multi-step reasoning, can achieve reasonable
+arithmetic reasoning if arithmetic word problems are posed as a
+formalize-then-solve task. In our architecture, which we call SYRELM, the LM
+serves the role of a translator to map natural language arithmetic questions
+into a formal language (FL) description. A symbolic solver then evaluates the
+FL expression to obtain the answer. A small frozen LM, equipped with an
+efficient low-rank adapter, is capable of generating FL expressions that
+incorporate natural language descriptions of the arithmetic problem (e.g.,
+variable names and their purposes, formal expressions combining variables,
+etc.). We adopt policy-gradient reinforcement learning to train the adapted LM,
+informed by the non-differentiable symbolic solver. This marks a sharp
+departure from the recent development in tool-augmented LLMs, in which the
+external tools (e.g., calculator, Web search, etc.) are essentially detached
+from the learning phase of the LM. SYRELM shows massive improvements (e.g.,
++30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J
+6B model) over base LMs, while keeping our testbed easy to diagnose, interpret
+and within reach of most researchers.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Augmentation-Aware Self-Supervision for Data-Efficient GAN Training NeurIPS 2023
+
+
+ Training generative adversarial networks (GANs) with limited data is
+challenging because the discriminator is prone to overfitting. Previously
+proposed differentiable augmentation demonstrates improved data efficiency of
+training GANs. However, the augmentation implicitly introduces undesired
+invariance to augmentation for the discriminator since it ignores the change of
+semantics in the label space caused by data transformation, which may limit the
+representation learning ability of the discriminator and ultimately affect the
+generative modeling performance of the generator. To mitigate the negative
+impact of invariance while inheriting the benefits of data augmentation, we
+propose a novel augmentation-aware self-supervised discriminator that predicts
+the augmentation parameter of the augmented data. Particularly, the prediction
+targets of real data and generated data are required to be distinguished since
+they are different during training. We further encourage the generator to
+adversarially learn from the self-supervised discriminator by generating
+augmentation-predictable real and not fake data. This formulation connects the
+learning objective of the generator and the arithmetic $-$ harmonic mean
+divergence under certain assumptions. We compare our method with
+state-of-the-art (SOTA) methods using the class-conditional BigGAN and
+unconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100,
+FFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate
+significant improvements of our method over SOTA methods in training
+data-efficient GANs.
+
+
+
+
+
+
+
+
+ Stanislas Ducotterd, Alexis Goujon, Pakshal Bohra, Dimitris Perdios, Sebastian Neumayer, Michael Unser
+
+
+ Lipschitz-constrained neural networks have several advantages over
+unconstrained ones and can be applied to a variety of problems, making them a
+topic of attention in the deep learning community. Unfortunately, it has been
+shown both theoretically and empirically that they perform poorly when equipped
+with ReLU activation functions. By contrast, neural networks with learnable
+1-Lipschitz linear splines are known to be more expressive. In this paper, we
+show that such networks correspond to global optima of a constrained functional
+optimization problem that consists of the training of a neural network composed
+of 1-Lipschitz linear layers and 1-Lipschitz freeform activation functions with
+second-order total-variation regularization. Further, we propose an efficient
+method to train these neural networks. Our numerical experiments show that our
+trained networks compare favorably with existing 1-Lipschitz neural
+architectures.
+
+
+
+
+
+
+
+ ♻ ☆ Bayesian Methods for Media Mix Modelling with shape and funnel effects
+
+
+ In recent years, significant progress in generative AI has highlighted the
+important role of physics-inspired models that utilize advanced mathematical
+concepts based on fundamental physics principles to enhance artificial
+intelligence capabilities. Among these models, those based on diffusion
+equations have greatly improved image quality. This study aims to explore the
+potential uses of Maxwell-Boltzmann equation, which forms the basis of the
+kinetic theory of gases, and the Michaelis-Menten model in Marketing Mix
+Modelling (MMM) applications. We propose incorporating these equations into
+Hierarchical Bayesian models to analyse consumer behaviour in the context of
+advertising. These equation sets excel in accurately describing the random
+dynamics in complex systems like social interactions and consumer-advertising
+interactions.
+
+
+
+ comment: Rev. 4, December 2023
+
+
+
+
+
+
+ ♻ ☆ auto-sktime: Automated Time Series Forecasting AISTATS 2024
+
+
+
+
+
+
+
+
+ Marc-André Zöller, Marius Lindauer, Marco F. Huber
+
+
+ In today's data-driven landscape, time series forecasting is pivotal in
+decision-making across various sectors. Yet, the proliferation of more diverse
+time series data, coupled with the expanding landscape of available forecasting
+methods, poses significant challenges for forecasters. To meet the growing
+demand for efficient forecasting, we introduce auto-sktime, a novel framework
+for automated time series forecasting. The proposed framework uses the power of
+automated machine learning (AutoML) techniques to automate the creation of the
+entire forecasting pipeline. The framework employs Bayesian optimization, to
+automatically construct pipelines from statistical, machine learning (ML) and
+deep neural network (DNN) models. Furthermore, we propose three essential
+improvements to adapt AutoML to time series data: First, pipeline templates to
+account for the different supported forecasting models. Second, a novel
+warm-starting technique to start the optimization from prior optimization runs.
+Third, we adapt multi-fidelity optimizations to make them applicable to a
+search space containing statistical, ML and DNN models. Experimental results on
+64 diverse real-world time series datasets demonstrate the effectiveness and
+efficiency of the framework, outperforming traditional methods while requiring
+minimal human involvement.
+
+
+
+ comment: Submitted to AISTATS 2024
+
+
+
+
+
+
+ ♻ ☆ Who Reviews The Reviewers? A Multi-Level Jury Problem
+
+
+
+
+
+
+
+
+ Ben Abramowitz, Omer Lev, Nicholas Mattei
+
+
+ We consider the problem of determining a binary ground truth using advice
+from a group of independent reviewers (experts) who express their guess about a
+ground truth correctly with some independent probability (competence). In this
+setting, when all reviewers are competent (competence greater than one-half),
+the Condorcet Jury Theorem tells us that adding more reviewers increases the
+overall accuracy, and if all competences are known, then there exists an
+optimal weighting of the reviewers. However, in practical settings, reviewers
+may be noisy or incompetent, i.e., competence below half, and the number of
+experts may be small, so the asymptotic Condorcet Jury Theorem is not
+practically relevant. In such cases we explore appointing one or more chairs
+(judges) who determine the weight of each reviewer for aggregation, creating
+multiple levels. However, these chairs may be unable to correctly identify the
+competence of the reviewers they oversee, and therefore unable to compute the
+optimal weighting. We give conditions when a set of chairs is able to weight
+the reviewers optimally, and depending on the competence distribution of the
+agents, give results about when it is better to have more chairs or more
+reviewers. Through numerical simulations we show that in some cases it is
+better to have more chairs, but in many cases it is better to have more
+reviewers.
+
+
+
+
+
+
+
+ ♻ ☆ A Comparative Evaluation of Additive Separability Tests for
+ Physics-Informed Machine Learning
+
+
+ Many functions characterising physical systems are additively separable. This
+is the case, for instance, of mechanical Hamiltonian functions in physics,
+population growth equations in biology, and consumer preference and utility
+functions in economics. We consider the scenario in which a surrogate of a
+function is to be tested for additive separability. The detection that the
+surrogate is additively separable can be leveraged to improve further learning.
+Hence, it is beneficial to have the ability to test for such separability in
+surrogates. The mathematical approach is to test if the mixed partial
+derivative of the surrogate is zero; or empirically, lower than a threshold. We
+present and comparatively and empirically evaluate the eight methods to compute
+the mixed partial derivative of a surrogate function.
+
+
+
+
+
+
+
+ ♻ ☆ FAL-CUR: Fair Active Learning using Uncertainty and Representativeness
+ on Fair Clustering
+
+
+ Active Learning (AL) techniques have proven to be highly effective in
+reducing data labeling costs across a range of machine learning tasks.
+Nevertheless, one known challenge of these methods is their potential to
+introduce unfairness towards sensitive attributes. Although recent approaches
+have focused on enhancing fairness in AL, they tend to reduce the model's
+accuracy. To address this issue, we propose a novel strategy, named Fair Active
+Learning using fair Clustering, Uncertainty, and Representativeness (FAL-CUR),
+to improve fairness in AL. FAL-CUR tackles the fairness problem in AL by
+combining fair clustering with an acquisition function that determines which
+samples to query based on their uncertainty and representativeness scores. We
+evaluate the performance of FAL-CUR on four real-world datasets, and the
+results demonstrate that FAL-CUR achieves a 15% - 20% improvement in fairness
+compared to the best state-of-the-art method in terms of equalized odds while
+maintaining stable accuracy scores. Furthermore, an ablation study highlights
+the crucial roles of fair clustering in preserving fairness and the acquisition
+function in stabilizing the accuracy performance.
+
+
+
+
+
+
+
+ ♻ ☆ A Baseline Analysis of Reward Models' Ability To Accurately Analyze
+ Foundation Models Under Distribution Shift
+
+
+
+
+
+
+
+
+ Will LeVine, Ben Pikus, Tony Chen, Sean Hendryx
+
+
+ Foundation models, specifically Large Language Models (LLM's), have lately
+gained wide-spread attention and adoption. Reinforcement Learning with Human
+Feedback (RLHF) involves training a reward model to capture desired behaviors,
+which is then used to align LLM's. These reward models are additionally used at
+inference-time to estimate LLM responses' adherence to those desired behaviors.
+However, there is little work measuring how robust these reward models are to
+distribution shifts. In this work, we evaluate how reward model performance -
+measured via accuracy and calibration (i.e. alignment between accuracy and
+confidence) - is affected by distribution shift. We show novel calibration
+patterns and accuracy drops due to OOD prompts and responses, and that the
+reward model is more sensitive to shifts in responses than prompts.
+Additionally, we adapt an OOD detection technique commonly used in
+classification to the reward model setting to detect these distribution shifts
+in prompts and responses.
+
+
+
+
+
+
+
+ ♻ ☆ Vertical Federated Alzheimer's Detection on Multimodal Data
+
+
+ In the era of rapidly advancing medical technologies, the segmentation of
+medical data has become inevitable, necessitating the development of privacy
+preserving machine learning algorithms that can train on distributed data.
+Consolidating sensitive medical data is not always an option particularly due
+to the stringent privacy regulations imposed by the Health Insurance
+Portability and Accountability Act (HIPAA). In this paper, we introduce a HIPAA
+compliant framework that can train from distributed data. We then propose a
+multimodal vertical federated model for Alzheimer's Disease (AD) detection, a
+serious neurodegenerative condition that can cause dementia, severely impairing
+brain function and hindering simple tasks, especially without preventative
+care. This vertical federated model offers a distributed architecture that
+enables collaborative learning across diverse sources of medical data while
+respecting privacy constraints imposed by HIPAA. It is also able to leverage
+multiple modalities of data, enhancing the robustness and accuracy of AD
+detection. Our proposed model not only contributes to the advancement of
+federated learning techniques but also holds promise for overcoming the hurdles
+posed by data segmentation in medical research. By using vertical federated
+learning, this research strives to provide a framework that enables healthcare
+institutions to harness the collective intelligence embedded in their
+distributed datasets without compromising patient privacy.
+
+
+
+
+
+
+
+
+ Max van Spengler, Philipp Wirth, Pascal Mettes
+
+
+ Deep learning in hyperbolic space is quickly gaining traction in the fields
+of machine learning, multimedia, and computer vision. Deep networks commonly
+operate in Euclidean space, implicitly assuming that data lies on regular
+grids. Recent advances have shown that hyperbolic geometry provides a viable
+alternative foundation for deep learning, especially when data is hierarchical
+in nature and when working with few embedding dimensions. Currently however, no
+accessible open-source library exists to build hyperbolic network modules akin
+to well-known deep learning libraries. We present HypLL, the Hyperbolic
+Learning Library to bring the progress on hyperbolic deep learning together.
+HypLL is built on top of PyTorch, with an emphasis in its design for
+ease-of-use, in order to attract a broad audience towards this new and
+open-ended research direction. The code is available at:
+https://github.com/maxvanspengler/hyperbolic_learning_library.
+
+
+
+
+
+
+
+ ♻ ☆ Fast Neural Network Inference on FPGAs for Triggering on Long-Lived
+ Particles at Colliders
+
+
+
+
+
+
+
+
+ Andrea Coccaro, Francesco Armando Di Bello, Stefano Giagu, Lucrezia Rambelli, Nicola Stocchetti
+
+
+ Experimental particle physics demands a sophisticated trigger and acquisition
+system capable to efficiently retain the collisions of interest for further
+investigation. Heterogeneous computing with the employment of FPGA cards may
+emerge as a trending technology for the triggering strategy of the upcoming
+high-luminosity program of the Large Hadron Collider at CERN. In this context,
+we present two machine-learning algorithms for selecting events where neutral
+long-lived particles decay within the detector volume studying their accuracy
+and inference time when accelerated on commercially available Xilinx FPGA
+accelerator cards. The inference time is also confronted with a CPU- and
+GPU-based hardware setup. The proposed new algorithms are proven efficient for
+the considered benchmark physics scenario and their accuracy is found to not
+degrade when accelerated on the FPGA cards. The results indicate that all
+tested architectures fit within the latency requirements of a second-level
+trigger farm and that exploiting accelerator technologies for real-time
+processing of particle-physics collisions is a promising research field that
+deserves additional investigations, in particular with machine-learning models
+with a large number of trainable parameters.
+
+
+
+
+
+
+
+
+ Nathanael Bosch, Philipp Hennig, Filip Tronarp
+
+
+ Probabilistic solvers provide a flexible and efficient framework for
+simulation, uncertainty quantification, and inference in dynamical systems.
+However, like standard solvers, they suffer performance penalties for certain
+stiff systems, where small steps are required not for reasons of numerical
+accuracy but for the sake of stability. This issue is greatly alleviated in
+semi-linear problems by the probabilistic exponential integrators developed in
+this paper. By including the fast, linear dynamics in the prior, we arrive at a
+class of probabilistic integrators with favorable properties. Namely, they are
+proven to be L-stable, and in a certain case reduce to a classic exponential
+integrator -- with the added benefit of providing a probabilistic account of
+the numerical error. The method is also generalized to arbitrary non-linear
+systems by imposing piece-wise semi-linearity on the prior via Jacobians of the
+vector field at the previous estimates, resulting in probabilistic exponential
+Rosenbrock methods. We evaluate the proposed methods on multiple stiff
+differential equations and demonstrate their improved stability and efficiency
+over established probabilistic solvers. The present contribution thus expands
+the range of problems that can be effectively tackled within probabilistic
+numerics.
+
+
+
+
+
+
+
+ ♻ ☆ Label Words are Anchors: An Information Flow Perspective for
+ Understanding In-Context Learning EMNLP 2023
+
+
+
+
+
+
+
+
+ Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
+
+
+ In-context learning (ICL) emerges as a promising capability of large language
+models (LLMs) by providing them with demonstration examples to perform diverse
+tasks. However, the underlying mechanism of how LLMs learn from the provided
+context remains under-explored. In this paper, we investigate the working
+mechanism of ICL through an information flow lens. Our findings reveal that
+label words in the demonstration examples function as anchors: (1) semantic
+information aggregates into label word representations during the shallow
+computation layers' processing; (2) the consolidated information in label words
+serves as a reference for LLMs' final predictions. Based on these insights, we
+introduce an anchor re-weighting method to improve ICL performance, a
+demonstration compression technique to expedite inference, and an analysis
+framework for diagnosing ICL errors in GPT2-XL. The promising applications of
+our findings again validate the uncovered ICL working mechanism and pave the
+way for future studies.
+
+
+
+ comment: Accepted by EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Ghost Noise for Regularizing Deep Neural Networks
+
+
+ Batch Normalization (BN) is widely used to stabilize the optimization process
+and improve the test performance of deep neural networks. The regularization
+effect of BN depends on the batch size and explicitly using smaller batch sizes
+with Batch Normalization, a method known as Ghost Batch Normalization (GBN),
+has been found to improve generalization in many settings. We investigate the
+effectiveness of GBN by disentangling the induced ``Ghost Noise'' from
+normalization and quantitatively analyzing the distribution of noise as well as
+its impact on model performance. Inspired by our analysis, we propose a new
+regularization technique called Ghost Noise Injection (GNI) that imitates the
+noise in GBN without incurring the detrimental train-test discrepancy effects
+of small batch training. We experimentally show that GNI can provide a greater
+generalization benefit than GBN. Ghost Noise Injection can also be beneficial
+in otherwise non-noisy settings such as layer-normalized networks, providing
+additional evidence of the usefulness of Ghost Noise in Batch Normalization as
+a regularizer.
+
+
+
+
+
+
+
+
+ Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Gardar Ingvarsson, Timon Willi, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert Tjarko Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktaschel, Chris Lu, Jakob Nicolaus Foerster
+
+
+ Benchmarks play an important role in the development of machine learning
+algorithms. For example, research in reinforcement learning (RL) has been
+heavily influenced by available environments and benchmarks. However, RL
+environments are traditionally run on the CPU, limiting their scalability with
+typical academic compute. Recent advancements in JAX have enabled the wider use
+of hardware acceleration to overcome these computational hurdles, enabling
+massively parallel RL training pipelines and environments. This is particularly
+useful for multi-agent reinforcement learning (MARL) research. First of all,
+multiple agents must be considered at each environment step, adding
+computational burden, and secondly, the sample complexity is increased due to
+non-stationarity, decentralised partial observability, or other MARL
+challenges. In this paper, we present JaxMARL, the first open-source code base
+that combines ease-of-use with GPU enabled efficiency, and supports a large
+number of commonly used MARL environments as well as popular baseline
+algorithms. When considering wall clock time, our experiments show that per-run
+our JAX-based training pipeline is up to 12500x faster than existing
+approaches. This enables efficient and thorough evaluations, with the potential
+to alleviate the evaluation crisis of the field. We also introduce and
+benchmark SMAX, a vectorised, simplified version of the popular StarCraft
+Multi-Agent Challenge, which removes the need to run the StarCraft II game
+engine. This not only enables GPU acceleration, but also provides a more
+flexible MARL environment, unlocking the potential for self-play,
+meta-learning, and other future applications in MARL. We provide code at
+https://github.com/flairox/jaxmarl.
+
+
+
+
+
+
+
+ ♻ ☆ Polar Encoding: A Simple Baseline Approach for Classification with
+ Missing Values
+
+
+
+
+
+
+
+
+ Oliver Urs Lenz, Daniel Peralta, Chris Cornelis
+
+
+ We propose polar encoding, a representation of categorical and numerical
+$[0,1]$-valued attributes with missing values to be used in a classification
+context. We argue that this is a good baseline approach, because it can be used
+with any classification algorithm, preserves missingness information, is very
+simple to apply and offers good performance. In particular, unlike the existing
+missing-indicator approach, it does not require imputation, ensures that
+missing values are equidistant from non-missing values, and lets decision tree
+algorithms choose how to split missing values, thereby providing a practical
+realisation of the "missingness incorporated in attributes" (MIA) proposal.
+Furthermore, we show that categorical and $[0,1]$-valued attributes can be
+viewed as special cases of a single attribute type, corresponding to the
+classical concept of barycentric coordinates, and that this offers a natural
+interpretation of polar encoding as a fuzzified form of one-hot encoding. With
+an experiment based on twenty real-life datasets with missing values, we show
+that, in terms of the resulting classification performance, polar encoding
+performs better than the state-of-the-art strategies \e{multiple imputation by
+chained equations} (MICE) and \e{multiple imputation with denoising
+autoencoders} (MIDAS) and -- depending on the classifier -- about as well or
+better than mean/mode imputation with missing-indicators.
+
+
+
+
+
+
+
+ ♻ ☆ Relative Policy-Transition Optimization for Fast Policy Transfer AAAI 2024
+
+
+
+
+
+
+
+
+ Jiawei Xu, Cheng Zhou, Yizheng Zhang, Baoxiang Wang, Lei Han
+
+
+ We consider the problem of policy transfer between two Markov Decision
+Processes (MDPs). We introduce a lemma based on existing theoretical results in
+reinforcement learning to measure the relativity gap between two arbitrary
+MDPs, that is the difference between any two cumulative expected returns
+defined on different policies and environment dynamics. Based on this lemma, we
+propose two new algorithms referred to as Relative Policy Optimization (RPO)
+and Relative Transition Optimization (RTO), which offer fast policy transfer
+and dynamics modelling, respectively. RPO transfers the policy evaluated in one
+environment to maximize the return in another, while RTO updates the
+parameterized dynamics model to reduce the gap between the dynamics of the two
+environments. Integrating the two algorithms results in the complete Relative
+Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts
+with the two environments simultaneously, such that data collections from two
+environments, policy and transition updates are completed in one closed loop to
+form a principled learning framework for policy transfer. We demonstrate the
+effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating
+policy transfer problems via variant dynamics.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Conductivity Imaging from Internal Measurements with Mixed Least-Squares
+ Deep Neural Networks
+
+
+ In this work we develop a novel approach using deep neural networks to
+reconstruct the conductivity distribution in elliptic problems from one
+measurement of the solution over the whole domain. The approach is based on a
+mixed reformulation of the governing equation and utilizes the standard
+least-squares objective, with deep neural networks as ansatz functions to
+approximate the conductivity and flux simultaneously. We provide a thorough
+analysis of the deep neural network approximations of the conductivity for both
+continuous and empirical losses, including rigorous error estimates that are
+explicit in terms of the noise level, various penalty parameters and neural
+network architectural parameters (depth, width and parameter bound). We also
+provide multiple numerical experiments in two- and multi-dimensions to
+illustrate distinct features of the approach, e.g., excellent stability with
+respect to data noise and capability of solving high-dimensional problems.
+
+
+
+ comment: corrected a few typos
+
+
+
+
+
+
+ ♻ ☆ Is Channel Independent strategy optimal for Time Series Forecasting?
+
+
+ There has been an emergence of various models for long-term time series
+forecasting. Recent studies have demonstrated that a single linear layer, using
+Channel Dependent (CD) or Channel Independent (CI) modeling, can even
+outperform a large number of sophisticated models. However, current research
+primarily considers CD and CI as two complementary yet mutually exclusive
+approaches, unable to harness these two extremes simultaneously. And it is also
+a challenging issue that both CD and CI are static strategies that cannot be
+determined to be optimal for a specific dataset without extensive experiments.
+In this paper, we reconsider whether the current CI strategy is the best
+solution for time series forecasting. First, we propose a simple yet effective
+strategy called CSC, which stands for $\mathbf{C}$hannel
+$\mathbf{S}$elf-$\mathbf{C}$lustering strategy, for linear models. Our Channel
+Self-Clustering (CSC) enhances CI strategy's performance improvements while
+reducing parameter size, for exmpale by over 10 times on electricity dataset,
+and significantly cutting training time. Second, we further propose Channel
+Rearrangement (CR), a method for deep models inspired by the self-clustering.
+CR attains competitive performance against baselines. Finally, we also discuss
+whether it is best to forecast the future values using the historical values of
+the same channel as inputs. We hope our findings and methods could inspire new
+solutions beyond CD/CI.
+
+
+
+
+
+
+
+ ♻ ☆ SEPT: Towards Efficient Scene Representation Learning for Motion
+ Prediction
+
+
+
+
+
+
+
+
+ Zhiqian Lan, Yuxuan Jiang, Yao Mu, Chen Chen, Shengbo Eben Li
+
+
+ Motion prediction is crucial for autonomous vehicles to operate safely in
+complex traffic environments. Extracting effective spatiotemporal relationships
+among traffic elements is key to accurate forecasting. Inspired by the
+successful practice of pretrained large language models, this paper presents
+SEPT, a modeling framework that leverages self-supervised learning to develop
+powerful spatiotemporal understanding for complex traffic scenes. Specifically,
+our approach involves three masking-reconstruction modeling tasks on scene
+inputs including agents' trajectories and road network, pretraining the scene
+encoder to capture kinematics within trajectory, spatial structure of road
+network, and interactions among roads and agents. The pretrained encoder is
+then finetuned on the downstream forecasting task. Extensive experiments
+demonstrate that SEPT, without elaborate architectural design or manual feature
+engineering, achieves state-of-the-art performance on the Argoverse 1 and
+Argoverse 2 motion forecasting benchmarks, outperforming previous methods on
+all main metrics by a large margin.
+
+
+
+
+
+
+
+ ♻ ☆ Chain-of-Questions Training with Latent Answers for Robust Multistep
+ Question Answering EMNLP 2023
+
+
+
+
+
+
+
+
+ Wang Zhu, Jesse Thomason, Robin Jia
+
+
+ We train a language model (LM) to robustly answer multistep questions by
+generating and answering sub-questions. We propose Chain-of-Questions, a
+framework that trains a model to generate sub-questions and sub-answers one at
+a time by leveraging human annotated question decomposition meaning
+representation (QDMR). The key technical challenge is that QDMR only contains
+sub-questions but not answers to those sub-questions, so we treat sub-answers
+as latent variables and optimize them using a novel dynamic mixture of Hard-EM
+and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods
+by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA
+adversarial set, thus demonstrating the effectiveness and robustness of our
+framework.
+
+
+
+ comment: Accepted by the EMNLP 2023
+
+
+
+
+
+
+ ♻ ☆ Improving new physics searches with diffusion models for event
+ observables and jet constituents
+
+
+
+
+
+
+
+
+ Debajyoti Sengupta, Matthew Leigh, John Andrew Raine, Samuel Klein, Tobias Golling
+
+
+ We introduce a new technique called Drapes to enhance the sensitivity in
+searches for new physics at the LHC. By training diffusion models on side-band
+data, we show how background templates for the signal region can be generated
+either directly from noise, or by partially applying the diffusion process to
+existing data. In the partial diffusion case, data can be drawn from side-band
+regions, with the inverse diffusion performed for new target conditional
+values, or from the signal region, preserving the distribution over the
+conditional property that defines the signal region. We apply this technique to
+the hunt for resonances using the LHCO di-jet dataset, and achieve
+state-of-the-art performance for background template generation using high
+level input features. We also show how Drapes can be applied to low level
+inputs with jet constituents, reducing the model dependence on the choice of
+input observables. Using jet constituents we can further improve sensitivity to
+the signal process, but observe a loss in performance where the signal
+significance before applying any selection is below 4$\sigma$.
+
+
+
+ comment: 34 pages, 19 figures
+
+
+
+
+
+
+ ♻ ☆ Pareto Envelope Augmented with Reinforcement Learning: Multi-objective
+ reinforcement learning-based approach for Large-Scale Constrained Pressurized
+ Water Reactor optimization
+
+
+ A novel method, the Pareto Envelope Augmented with Reinforcement Learning
+(PEARL), has been developed to address the challenges posed by multi-objective
+problems, particularly in the field of engineering where the evaluation of
+candidate solutions can be time-consuming. PEARL distinguishes itself from
+traditional policy-based multi-objective Reinforcement Learning methods by
+learning a single policy, eliminating the need for multiple neural networks to
+independently solve simpler sub-problems. Several versions inspired from deep
+learning and evolutionary techniques have been crafted, catering to both
+unconstrained and constrained problem domains. Curriculum Learning is harnessed
+to effectively manage constraints in these versions. PEARL's performance is
+first evaluated on classical multi-objective benchmarks. Additionally, it is
+tested on two practical PWR core Loading Pattern optimization problems to
+showcase its real-world applicability. The first problem involves optimizing
+the Cycle length and the rod-integrated peaking factor as the primary
+objectives, while the second problem incorporates the mean average enrichment
+as an additional objective. Furthermore, PEARL addresses three types of
+constraints related to boron concentration, peak pin burnup, and peak pin
+power. The results are systematically compared against a conventional approach,
+the Non-dominated Sorting Genetic Algorithm. Notably, PEARL, specifically the
+PEARL-NdS variant, efficiently uncovers a Pareto front without necessitating
+additional efforts from the algorithm designer, as opposed to a single
+optimization with scaled objectives. It also outperforms the classical approach
+across multiple performance metrics, including the Hyper-volume.
+
+
+
+
+
+
+
+ ♻ ☆ Finite Element Operator Network for Solving Parametric PDEs
+
+
+ Partial differential equations (PDEs) underlie our understanding and
+prediction of natural phenomena across numerous fields, including physics,
+engineering, and finance. However, solving parametric PDEs is a complex task
+that necessitates efficient numerical methods. In this paper, we propose a
+novel approach for solving parametric PDEs using a Finite Element Operator
+Network (FEONet). Our proposed method leverages the power of deep learning in
+conjunction with traditional numerical methods, specifically the finite element
+method, to solve parametric PDEs in the absence of any paired input-output
+training data. We performed various experiments on several benchmark problems
+and confirmed that our approach has demonstrated excellent performance across
+various settings and environments, proving its versatility in terms of
+accuracy, generalization, and computational flexibility. Our FEONet framework
+shows potential for application in various fields where PDEs play a crucial
+role in modeling complex domains with diverse boundary conditions and singular
+behavior. Furthermore, we provide theoretical convergence analysis to support
+our approach, utilizing finite element approximation in numerical analysis.
+
+
+ Modeling of real-world biological multi-agents is a fundamental problem in
+various scientific and engineering fields. Reinforcement learning (RL) is a
+powerful framework to generate flexible and diverse behaviors in cyberspace;
+however, when modeling real-world biological multi-agents, there is a domain
+gap between behaviors in the source (i.e., real-world data) and the target
+(i.e., cyberspace for RL), and the source environment parameters are usually
+unknown. In this paper, we propose a method for adaptive action supervision in
+RL from real-world demonstrations in multi-agent scenarios. We adopt an
+approach that combines RL and supervised learning by selecting actions of
+demonstrations in RL based on the minimum distance of dynamic time warping for
+utilizing the information of the unknown source dynamics. This approach can be
+easily applied to many existing neural network architectures and provide us
+with an RL model balanced between reproducibility as imitation and
+generalization ability to obtain rewards in cyberspace. In the experiments,
+using chase-and-escape and football tasks with the different dynamics between
+the unknown source and target environments, we show that our approach achieved
+a balance between the reproducibility and the generalization ability compared
+with the baselines. In particular, we used the tracking data of professional
+football players as expert demonstrations in football and show successful
+performances despite the larger gap between behaviors in the source and target
+environments than the chase-and-escape task.
+
+
+
+
+
+
+
+ ♻ ☆ Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge
+ Computing
+
+
+
+
+
+
+
+
+ Lyudong Jin, Ming Tang, Meng Zhang, Hao Wang
+
+
+ Mobile edge computing (MEC) is a promising paradigm for real-time
+applications with intensive computational needs (e.g., autonomous driving), as
+it can reduce the processing delay. In this work, we focus on the timeliness of
+computational-intensive updates, measured by Age-ofInformation (AoI), and study
+how to jointly optimize the task updating and offloading policies for AoI with
+fractional form. Specifically, we consider edge load dynamics and formulate a
+task scheduling problem to minimize the expected time-average AoI. The
+uncertain edge load dynamics, the nature of the fractional objective, and
+hybrid continuous-discrete action space (due to the joint optimization) make
+this problem challenging and existing approaches not directly applicable. To
+this end, we propose a fractional reinforcement learning(RL) framework and
+prove its convergence. We further design a model-free fractional deep RL (DRL)
+algorithm, where each device makes scheduling decisions with the hybrid action
+space without knowing the system dynamics and decisions of other devices.
+Experimental results show that our proposed algorithms reduce the average AoI
+by up to 57.6% compared with several non-fractional benchmarks.
+
+
+
+
+
+
+
+ ♻ ☆ Mind the Gap: Federated Learning Broadens Domain Generalization in
+ Diagnostic AI Models
+
+
+
+
+
+
+
+
+ Soroosh Tayebi Arasteh, Christiane Kuhl, Marwin-Jonathan Saehn, Peter Isfort, Daniel Truhn, Sven Nebelung
+
+
+ Developing robust artificial intelligence (AI) models that generalize well to
+unseen datasets is challenging and usually requires large and variable
+datasets, preferably from multiple institutions. In federated learning (FL), a
+model is trained collaboratively at numerous sites that hold local datasets
+without exchanging them. So far, the impact of training strategy, i.e., local
+versus collaborative, on the diagnostic on-domain and off-domain performance of
+AI models interpreting chest radiographs has not been assessed. Consequently,
+using 610,000 chest radiographs from five institutions across the globe, we
+assessed diagnostic performance as a function of training strategy (i.e., local
+vs. collaborative), network architecture (i.e., convolutional vs.
+transformer-based), generalization performance (i.e., on-domain vs.
+off-domain), imaging finding (i.e., cardiomegaly, pleural effusion, pneumonia,
+atelectasis, consolidation, pneumothorax, and no abnormality), dataset size
+(i.e., from n=18,000 to 213,921 radiographs), and dataset diversity. Large
+datasets not only showed minimal performance gains with FL but, in some
+instances, even exhibited decreases. In contrast, smaller datasets revealed
+marked improvements. Thus, on-domain performance was mainly driven by training
+data size. However, off-domain performance leaned more on training diversity.
+When trained collaboratively across diverse external institutions, AI models
+consistently surpassed models trained locally for off-domain tasks, emphasizing
+FL's potential in leveraging data diversity. In conclusion, FL can bolster
+diagnostic privacy, reproducibility, and off-domain reliability of AI models
+and, potentially, optimize healthcare outcomes.
+
+
+
+ comment: Published in Nature Scientific Reports
+
+ Machine learning (ML) models trained on data from potentially untrusted
+sources are vulnerable to poisoning. A small, maliciously crafted subset of the
+training inputs can cause the model to learn a "backdoor" task (e.g.,
+misclassify inputs with a certain feature) in addition to its main task. Recent
+research proposed many hypothetical backdoor attacks whose efficacy heavily
+depends on the configuration and training hyperparameters of the target model.
+ Given the variety of potential backdoor attacks, ML engineers who are not
+security experts have no way to measure how vulnerable their current training
+pipelines are, nor do they have a practical way to compare training
+configurations so as to pick the more resistant ones. Deploying a defense
+requires evaluating and choosing from among dozens of research papers and
+re-engineering the training pipeline.
+ In this paper, we aim to provide ML engineers with pragmatic tools to audit
+the backdoor resistance of their training pipelines and to compare different
+training configurations, to help choose one that best balances accuracy and
+security.
+ First, we propose a universal, attack-agnostic resistance metric based on the
+minimum number of training inputs that must be compromised before the model
+learns any backdoor.
+ Second, we design, implement, and evaluate Mithridates a multi-stage approach
+that integrates backdoor resistance into the training-configuration search. ML
+developers already rely on hyperparameter search to find configurations that
+maximize the model's accuracy. Mithridates extends this standard tool to
+balance accuracy and resistance without disruptive changes to the training
+pipeline. We show that hyperparameters found by Mithridates increase resistance
+to multiple types of backdoor attacks by 3-5x with only a slight impact on
+accuracy. We also discuss extensions to AutoML and federated learning.
+
+
+ A fundamental challenge of bipartite graph representation learning is how to
+extract informative node embeddings. Self-Supervised Learning (SSL) is a
+promising paradigm to address this challenge. Most recent bipartite graph SSL
+methods are based on contrastive learning which learns embeddings by
+discriminating positive and negative node pairs. Contrastive learning usually
+requires a large number of negative node pairs, which could lead to
+computational burden and semantic errors. In this paper, we introduce a novel
+synergistic representation learning model (STERLING) to learn node embeddings
+without negative node pairs. STERLING preserves the unique local and global
+synergies in bipartite graphs. The local synergies are captured by maximizing
+the similarity of the inter-type and intra-type positive node pairs, and the
+global synergies are captured by maximizing the mutual information of
+co-clusters. Theoretical analysis demonstrates that STERLING could improve the
+connectivity between different node types in the embedding space. Extensive
+empirical evaluation on various benchmark datasets and tasks demonstrates the
+effectiveness of STERLING for extracting node embeddings.
+
+
+
+ comment: Accepted by AAAI'2024
+
+
+
+
+
+
+ ♻ ☆ GDP nowcasting with artificial neural networks: How much does long-term
+ memory matter?
+
+
+ In our study, we apply artificial neural networks (ANNs) to nowcast quarterly
+GDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare
+the nowcasting performance of five different ANN architectures: the multilayer
+perceptron (MLP), the one-dimensional convolutional neural network (1D CNN),
+the Elman recurrent neural network (RNN), the long short-term memory network
+(LSTM), and the gated recurrent unit (GRU). The empirical analysis presents the
+results from two distinctively different evaluation periods. The first (2012:Q1
+-- 2019:Q4) is characterized by balanced economic growth, while the second
+(2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According
+to our results, longer input sequences result in more accurate nowcasts in
+periods of balanced economic growth. However, this effect ceases above a
+relatively low threshold value of around six quarters (eighteen months). During
+periods of economic turbulence (e.g., during the COVID-19 recession), longer
+input sequences do not help the models' predictive performance; instead, they
+seem to weaken their generalization capability. Combined results from the two
+evaluation periods indicate that architectural features enabling for long-term
+memory do not result in more accurate nowcasts. On the other hand, the 1D CNN
+has proved to be a highly suitable model for GDP nowcasting. The network has
+shown good nowcasting performance among the competitors during the first
+evaluation period and achieved the overall best accuracy during the second
+evaluation period. Consequently, first in the literature, we propose the
+application of the 1D CNN for economic nowcasting.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2106.08901 by other authors
+
+
+
+
+
+
+
+ Min Hu, Zhizhong Tan, Bin Liu, Guosheng Yin
+
+
+ This study aims to address the challenges of futures price prediction in
+high-frequency trading (HFT) by proposing a continuous learning factor
+predictor based on graph neural networks. The model integrates multi-factor
+pricing theories with real-time market dynamics, effectively bypassing the
+limitations of existing methods that lack financial theory guidance and ignore
+various trend signals and their interactions. We propose three heterogeneous
+tasks, including price moving average regression, price gap regression and
+change-point detection to trace the short-, intermediate-, and long-term trend
+factors present in the data. In addition, this study also considers the
+cross-sectional correlation characteristics of future contracts, where prices
+of different futures often show strong dynamic correlations. Each variable
+(future contract) depends not only on its historical values (temporal) but also
+on the observation of other variables (cross-sectional). To capture these
+dynamic relationships more accurately, we resort to the spatio-temporal graph
+neural network (STGNN) to enhance the predictive power of the model. The model
+employs a continuous learning strategy to simultaneously consider these tasks
+(factors). Additionally, due to the heterogeneity of the tasks, we propose to
+calculate parameter importance with mutual information between original
+observations and the extracted features to mitigate the catastrophic forgetting
+(CF) problem. Empirical tests on 49 commodity futures in China's futures market
+demonstrate that the proposed model outperforms other state-of-the-art models
+in terms of prediction accuracy. Not only does this research promote the
+integration of financial theory and deep learning, but it also provides a
+scientific basis for actual trading decisions.
+
+
+
+
+
+
+
+ ♻ ☆ FP8-LM: Training FP8 Large Language Models
+
+
+ In this paper, we explore FP8 low-bit data formats for efficient training of
+large language models (LLMs). Our key insight is that most variables, such as
+gradients and optimizer states, in LLM training can employ low-precision data
+formats without compromising model accuracy and requiring no changes to
+hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision
+framework for training LLMs. This framework offers three levels of FP8
+utilization to streamline mixed-precision and distributed parallel training for
+LLMs. It gradually incorporates 8-bit gradients, optimizer states, and
+distributed learning in an incremental manner. Experiment results show that,
+during the training of GPT-175B model on H100 GPU platform, our FP8
+mixed-precision training framework not only achieved a remarkable 39% reduction
+in real memory usage but also ran 75% faster than the widely adopted BF16
+framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer
+Engine by 37%. This largely reduces the training costs for large foundation
+models. Furthermore, our FP8 mixed-precision training methodology is generic.
+It can be seamlessly applied to other tasks such as LLM instruction tuning and
+reinforcement learning with human feedback, offering savings in fine-tuning
+expenses. Our FP8 low-precision training framework is open-sourced at
+{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
+
+
+
+
+
+
+
+ ♻ ☆ Narrowing the Gap between Supervised and Unsupervised Sentence
+ Representation Learning with Large Language Model AAAI24
+
+
+
+
+
+
+
+
+ Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao
+
+
+ Sentence Representation Learning (SRL) is a fundamental task in Natural
+Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings
+(CSE) being the mainstream technique due to its superior performance. An
+intriguing phenomenon in CSE is the significant performance gap between
+supervised and unsupervised methods, with their only difference lying in the
+training data. Previous works attribute this performance gap to differences in
+two representation properties (alignment and uniformity). However, since
+alignment and uniformity only measure the results, they fail to answer "What
+aspects of the training data contribute to the performance gap?" and "How can
+the performance gap be narrowed?", In this paper, we conduct empirical
+experiments to answer these "What" and "How" questions. We first answer the
+"What" question by thoroughly comparing the behavior of supervised and
+unsupervised CSE during their respective training processes. From the
+comparison, we identify the similarity pattern as a key factor to the
+performance gap, and introduce a metric, called Relative Fitting Difficulty
+(RFD), to measure the complexity of the similarity pattern. Then, based on the
+insights gained from the "What" question, we tackle the "How" question by
+increasing the pattern complexity of the training data. We achieve this by
+leveraging the In-Context Learning (ICL) capability of the Large Language Model
+(LLM) to generate data that simulates complex patterns. By utilizing the
+hierarchical patterns in the LLM-generated data, we effectively narrow the gap
+between supervised and unsupervised CSE. We release our codes and appendix at
+https://github.com/BDBC-KG-NLP/NGCSE.
+
+
+
+ comment: Accepted at AAAI24
+
+
+
+
+
+
+ ♻ ☆ Multi-Agent Reinforcement Learning with Action Masking for UAV-enabled
+ Mobile Communications
+
+
+ Unmanned Aerial Vehicles (UAVs) are increasingly used as aerial base stations
+to provide ad hoc communications infrastructure. Building upon prior research
+efforts which consider either static nodes, 2D trajectories or single UAV
+systems, this paper focuses on the use of multiple UAVs for providing wireless
+communication to mobile users in the absence of terrestrial communications
+infrastructure. In particular, we jointly optimize UAV 3D trajectory and NOMA
+power allocation to maximize system throughput. Firstly, a weighted
+K-means-based clustering algorithm establishes UAV-user associations at regular
+intervals. The efficacy of training a novel Shared Deep Q-Network (SDQN) with
+action masking is then explored. Unlike training each UAV separately using DQN,
+the SDQN reduces training time by using the experiences of multiple UAVs
+instead of a single agent. We also show that SDQN can be used to train a
+multi-agent system with differing action spaces. Simulation results confirm
+that: 1) training a shared DQN outperforms a conventional DQN in terms of
+maximum system throughput (+20%) and training time (-10%); 2) it can converge
+for agents with different action spaces, yielding a 9% increase in throughput
+compared to mutual learning algorithms; and 3) combining NOMA with an SDQN
+architecture enables the network to achieve a better sum rate compared with
+existing baseline schemes.
+
+
+ Reinforcement learning (RL) often struggles to accomplish a sparse-reward
+long-horizon task in a complex environment. Goal-conditioned reinforcement
+learning (GCRL) has been employed to tackle this difficult problem via a
+curriculum of easy-to-reach sub-goals. In GCRL, exploring novel sub-goals is
+essential for the agent to ultimately find the pathway to the desired goal. How
+to explore novel sub-goals efficiently is one of the most challenging issues in
+GCRL. Several goal exploration methods have been proposed to address this issue
+but still struggle to find the desired goals efficiently. In this paper, we
+propose a novel learning objective by optimizing the entropy of both achieved
+and new goals to be explored for more efficient goal exploration in sub-goal
+selection based GCRL. To optimize this objective, we first explore and exploit
+the frequently occurring goal-transition patterns mined in the environments
+similar to the current task to compose skills via skill learning. Then, the
+pretrained skills are applied in goal exploration. Evaluation on a variety of
+spare-reward long-horizon benchmark tasks suggests that incorporating our
+method into several state-of-the-art GCRL baselines significantly boosts their
+exploration efficiency while improving or maintaining their performance. The
+source code is available at: https://github.com/GEAPS/GEAPS.
+
+
+
+ comment: Accepted for publication in Machine Learning (Springer): 35 pages, 15
+ figures
+
+
+
+
+
+
+
+ Peter Sorrenson, Felix Draxler, Armand Rousselot, Sander Hummerich, Lea Zimmermann, Ullrich Köthe
+
+
+ Normalizing Flows explicitly maximize a full-dimensional likelihood on the
+training data. However, real data is typically only supported on a
+lower-dimensional manifold leading the model to expend significant compute on
+modeling noise. Injective Flows fix this by jointly learning a manifold and the
+distribution on it. So far, they have been limited by restrictive architectures
+and/or high computational cost. We lift both constraints by a new efficient
+estimator for the maximum likelihood loss, compatible with free-form bottleneck
+architectures. We further show that naively learning both the data manifold and
+the distribution on it can lead to divergent solutions, and use this insight to
+motivate a stable maximum likelihood training objective. We perform extensive
+experiments on toy, tabular and image data, demonstrating the competitive
+performance of the resulting model.
+
+
+
+ comment: Resubmission of previous work: title and abstract have been changed
+ and new content has been added
+
+
+
+
+
+
+ ♻ ☆ Keep the Faith: Faithful Explanations in Convolutional Neural Networks
+ for Case-Based Reasoning AAAI
+
+
+
+
+
+
+
+
+ Tom Nuno Wolf, Fabian Bongratz, Anne-Marie Rickmann, Sebastian Pölsterl, Christian Wachinger
+
+
+ Explaining predictions of black-box neural networks is crucial when applied
+to decision-critical tasks. Thus, attribution maps are commonly used to
+identify important image regions, despite prior work showing that humans prefer
+explanations based on similar examples. To this end, ProtoPNet learns a set of
+class-representative feature vectors (prototypes) for case-based reasoning.
+During inference, similarities of latent features to prototypes are linearly
+classified to form predictions and attribution maps are provided to explain the
+similarity. In this work, we evaluate whether architectures for case-based
+reasoning fulfill established axioms required for faithful explanations using
+the example of ProtoPNet. We show that such architectures allow the extraction
+of faithful explanations. However, we prove that the attribution maps used to
+explain the similarities violate the axioms. We propose a new procedure to
+extract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually,
+these explanations are Shapley values, calculated on the similarity scores of
+each prototype. They allow to faithfully answer which prototypes are present in
+an unseen image and quantify each pixel's contribution to that presence,
+thereby complying with all axioms. The theoretical violations of ProtoPNet
+manifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs,
+RSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50,
+ResNeXt50). Our experiments show a qualitative difference between the
+explanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the
+explanations with the Area Over the Perturbation Curve, on which ProtoPFaith
+outperforms ProtoPNet on all experiments by a factor $>10^3$.
+
+
+
+ comment: To be published in proceedings of AAAI Conference on Artificial
+ Intelligence
+
+
+
+
+
+
+
+ Franz Thaler, Matthias A. F. Gsell, Gernot Plank, Martin Urschler
+
+
+ Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely
+established to assess the viability of myocardial tissue of patients after
+acute myocardial infarction (MI). We propose the Cascading Refinement CNN
+(CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that
+exploits the hierarchical structure of such labeled cardiac data. Throughout
+the three stages of the cascade, the label definition changes and CaRe-CNN
+learns to gradually refine its intermediate predictions accordingly.
+Furthermore, to obtain more consistent qualitative predictions, we propose a
+series of post-processing steps that take anatomical constraints into account.
+Our CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked
+second out of 18 participating teams. CaRe-CNN showed great improvements most
+notably when segmenting the difficult but clinically most relevant myocardial
+infarct tissue (MIT) as well as microvascular obstructions (MVO). When
+computing the average scores over all labels, our method obtained the best
+score in eight out of ten metrics. Thus, accurate cardiac segmentation after
+acute MI via our CaRe-CNN allows generating patient-specific models of the
+heart serving as an important step towards personalized medicine.
+
+
+
+ comment: Accepted at VISIGRAPP 2024, 12 pages
+
+
+
+
+
+
+ ♻ ☆ Recurrent Neural Language Models as Probabilistic Finite-state Automata
+
+
+ Studying language models (LMs) in terms of well-understood formalisms allows
+us to precisely characterize their abilities and limitations. Previous work has
+investigated the representational capacity of recurrent neural network (RNN)
+LMs in terms of their capacity to recognize unweighted formal languages.
+However, LMs do not describe unweighted formal languages -- rather, they define
+\emph{probability distributions} over strings. In this work, we study what
+classes of such probability distributions RNN LMs can represent, which allows
+us to make more direct statements about their capabilities. We show that simple
+RNNs are equivalent to a subclass of probabilistic finite-state automata, and
+can thus model a strict subset of probability distributions expressible by
+finite-state models. Furthermore, we study the space complexity of representing
+finite-state LMs with RNNs. We show that, to represent an arbitrary
+deterministic finite-state LM with $N$ states over an alphabet $\alphabet$, an
+RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a
+first step towards characterizing the classes of distributions RNN LMs can
+represent and thus help us understand their capabilities and limitations.
+
+
+
+ comment: 9 pages
+
+
+
+
+
+
+ ♻ ☆ On the Efficacy of Differentially Private Few-shot Image Classification
+
+
+
+
+
+
+
+
+ Marlon Tobaben, Aliaksandra Shysheya, John Bronskill, Andrew Paverd, Shruti Tople, Santiago Zanella-Beguelin, Richard E Turner, Antti Honkela
+
+
+ There has been significant recent progress in training differentially private
+(DP) models which achieve accuracy that approaches the best non-private models.
+These DP models are typically pretrained on large public datasets and then
+fine-tuned on private downstream datasets that are relatively large and similar
+in distribution to the pretraining data. However, in many applications
+including personalization and federated learning, it is crucial to perform well
+(i) in the few-shot setting, as obtaining large amounts of labeled data may be
+problematic; and (ii) on datasets from a wide variety of domains for use in
+various specialist settings. To understand under which conditions few-shot DP
+can be effective, we perform an exhaustive set of experiments that reveals how
+the accuracy and vulnerability to attack of few-shot DP image classification
+models are affected as the number of shots per class, privacy level, model
+architecture, downstream dataset, and subset of learnable parameters in the
+model vary. We show that to achieve DP accuracy on par with non-private models,
+the shots per class must be increased as the privacy level increases. We also
+show that learning parameter-efficient FiLM adapters under DP is competitive
+with learning just the final classifier layer or learning all of the network
+parameters. Finally, we evaluate DP federated learning systems and establish
+state-of-the-art performance on the challenging FLAIR benchmark.
+
+
+
+ comment: 49 pages, 24 figures; published in TMLR 12/2023
+ https://openreview.net/forum?id=hFsr59Imzm
+
+
+
+
+
+
+ ♻ ☆ Generalizing Adam to Manifolds for Efficiently Training Transformers
+
+
+ One of the primary reasons behind the success of neural networks has been the
+emergence of an array of new, highly-successful optimizers, perhaps most
+importantly the Adam optimizer. It is wiedely used for training neural
+networks, yet notoriously hard to interpret. Lacking a clear physical
+intuition, Adam is difficult to generalize to manifolds. Some attempts have
+been made to directly apply parts of the Adam algorithm to manifolds or to find
+an underlying structure, but a full generalization has remained elusive. In
+this work a new approach is presented that leverages the special structure of
+the manifolds which are relevant for optimization of neural networks, such as
+the Stiefel manifold, the symplectic Stiefel manifold, the Grassmann manifold
+and the symplectic Grassmann manifold: all of these are homogeneous spaces and
+as such admit a global tangent space representation. This global tangent space
+representation is used to perform all of the steps in the Adam optimizer. The
+resulting algorithm is then applied to train a transformer for which
+orthogonality constraints are enforced up to machine precision and we observe
+significant speed-ups in the training process. Optimization of neural networks
+where they weights do not lie on a manifold is identified as a special case of
+the presented framkework. This allows for a flexible implementation in which
+the learning rate is adapted simultaneously for all parameters, irrespective of
+whether they are an element of a general manifold or a vector space.
+
+
+
+ comment: 19 pages, 4 figures, was presented at Enumath2023
+
+
+
+
+
+
+ ♻ ☆ Federated Best Arm Identification with Heterogeneous Clients
+
+
+
+
+
+
+
+
+ Zhirui Chen, P. N. Karthik, Vincent Y. F. Tan, Yeow Meng Chee
+
+
+ We study best arm identification in a federated multi-armed bandit setting
+with a central server and multiple clients, when each client has access to a
+{\em subset} of arms and each arm yields independent Gaussian observations. The
+goal is to identify the best arm of each client subject to an upper bound on
+the error probability; here, the best arm is one that has the largest {\em
+average} value of the means averaged across all clients having access to the
+arm. Our interest is in the asymptotics as the error probability vanishes. We
+provide an asymptotic lower bound on the growth rate of the expected stopping
+time of any algorithm. Furthermore, we show that for any algorithm whose upper
+bound on the expected stopping time matches with the lower bound up to a
+multiplicative constant ({\em almost-optimal} algorithm), the ratio of any two
+consecutive communication time instants must be {\em bounded}, a result that is
+of independent interest. We thereby infer that an algorithm can communicate no
+more sparsely than at exponential time instants in order to be almost-optimal.
+For the class of almost-optimal algorithms, we present the first-of-its-kind
+asymptotic lower bound on the expected number of {\em communication rounds}
+until stoppage. We propose a novel algorithm that communicates at exponential
+time instants, and demonstrate that it is asymptotically almost-optimal.
+
+
+ While reinforcement learning has shown experimental success in a number of
+applications, it is known to be sensitive to noise and perturbations in the
+parameters of the system, leading to high variance in the total reward amongst
+different episodes in slightly different environments. To introduce robustness,
+as well as sample efficiency, risk-sensitive reinforcement learning methods are
+being thoroughly studied. In this work, we provide a definition of robust
+reinforcement learning policies and formulate a risk-sensitive reinforcement
+learning problem to approximate them, by solving an optimization problem with
+respect to a modified objective based on exponential criteria. In particular,
+we study a model-free risk-sensitive variation of the widely-used Monte Carlo
+Policy Gradient algorithm and introduce a novel risk-sensitive online
+Actor-Critic algorithm based on solving a multiplicative Bellman equation using
+stochastic approximation updates. Analytical results suggest that the use of
+exponential criteria generalizes commonly used ad-hoc regularization
+approaches, improves sample efficiency, and introduces robustness with respect
+to perturbations in the model parameters and the environment. The
+implementation, performance, and robustness properties of the proposed methods
+are evaluated in simulated experiments.
+
+
+
+
+
+
+
+ ♻ ☆ Meta-Referential Games to Learn Compositional Learning Behaviours
+
+
+
+
+
+
+
+
+ Kevin Denamganaï, Sondess Missaoui, James Alfred Walker
+
+
+ Human beings use compositionality to generalise from past experiences to
+novel experiences. We assume a separation of our experiences into fundamental
+atomic components that can be recombined in novel ways to support our ability
+to engage with novel experiences. We frame this as the ability to learn to
+generalise compositionally, and we will refer to behaviours making use of this
+ability as compositional learning behaviours (CLBs). A central problem to
+learning CLBs is the resolution of a binding problem (BP). While it is another
+feat of intelligence that human beings perform with ease, it is not the case
+for state-of-the-art artificial agents. Thus, in order to build artificial
+agents able to collaborate with human beings, we propose to develop a novel
+benchmark to investigate agents' abilities to exhibit CLBs by solving a
+domain-agnostic version of the BP. We take inspiration from the language
+emergence and grounding framework of referential games and propose a
+meta-learning extension of referential games, entitled Meta-Referential Games,
+and use this framework to build our benchmark, the Symbolic Behaviour Benchmark
+(S2B). We provide baseline results and error analysis showing that our
+benchmark is a compelling challenge that we hope will spur the research
+community towards developing more capable artificial agents.
+
+
+
+ comment: work in progress
+
+
+
+
+
+
+ ♻ ☆ Detecting fake accounts through Generative Adversarial Network in online
+ social media
+
+
+
+
+
+
+
+
+ Jinus Bordbar, Mohammadreza Mohammadrezaie, Saman Ardalan, Mohammad Ebrahim Shiri
+
+
+ Online social media is integral to human life, facilitating messaging,
+information sharing, and confidential communication while preserving privacy.
+Platforms like Twitter, Instagram, and Facebook exemplify this phenomenon.
+However, users face challenges due to network anomalies, often stemming from
+malicious activities such as identity theft for financial gain or harm. This
+paper proposes a novel method using user similarity measures and the Generative
+Adversarial Network (GAN) algorithm to identify fake user accounts in the
+Twitter dataset. Despite the problem's complexity, the method achieves an AUC
+rate of 80\% in classifying and detecting fake accounts. Notably, the study
+builds on previous research, highlighting advancements and insights into the
+evolving landscape of anomaly detection in online social networks.
+
+
+
+ comment: need more investigation on the paper
+
+
+
+
+
+
+ ♻ ☆ Fake detection in imbalance dataset by Semi-supervised learning with GAN
+
+
+ As social media continues to grow rapidly, the prevalence of harassment on
+these platforms has also increased. This has piqued the interest of researchers
+in the field of fake detection. Social media data, often forms complex graphs
+with numerous nodes, posing several challenges. These challenges and
+limitations include dealing with a significant amount of irrelevant features in
+matrices and addressing issues such as high data dispersion and an imbalanced
+class distribution within the dataset. To overcome these challenges and
+limitations, researchers have employed auto-encoders and a combination of
+semi-supervised learning with a GAN algorithm, referred to as SGAN. Our
+proposed method utilizes auto-encoders for feature extraction and incorporates
+SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN
+compensates for the limited availability of labeled data, making efficient use
+of the limited number of labeled instances. Multiple evaluation metrics were
+employed, including the Confusion Matrix and the ROC curve. The dataset was
+divided into training and testing sets, with 100 labeled samples for training
+and 1,000 samples for testing. The novelty of our research lies in applying
+SGAN to address the issue of imbalanced datasets in fake account detection. By
+optimizing the use of a smaller number of labeled instances and reducing the
+need for extensive computational power, our method offers a more efficient
+solution. Additionally, our study contributes to the field by achieving an 81%
+accuracy in detecting fake accounts using only 100 labeled samples. This
+demonstrates the potential of SGAN as a powerful tool for handling minority
+classes and addressing big data challenges in fake account detection.
+
+
+ We introduce Neuro-Symbolic Continual Learning, where a model has to solve a
+sequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to
+high-level concepts and compute predictions by reasoning consistently with
+prior knowledge. Our key observation is that neuro-symbolic tasks, although
+different, often share concepts whose semantics remains stable over time.
+Traditional approaches fall short: existing continual strategies ignore
+knowledge altogether, while stock neuro-symbolic architectures suffer from
+catastrophic forgetting. We show that leveraging prior knowledge by combining
+neuro-symbolic architectures with continual strategies does help avoid
+catastrophic forgetting, but also that doing so can yield models affected by
+reasoning shortcuts. These undermine the semantics of the acquired concepts,
+even when detailed prior knowledge is provided upfront and inference is exact,
+and in turn continual performance. To overcome these issues, we introduce COOL,
+a COncept-level cOntinual Learning strategy tailored for neuro-symbolic
+continual problems that acquires high-quality concepts and remembers them over
+time. Our experiments on three novel benchmarks highlights how COOL attains
+sustained high performance on neuro-symbolic continual learning tasks in which
+other strategies fail.
+
+
+
+ comment: 40th International Conference on Machine Learning (ICML 2023)
+
+
+
+
+
+
+ ♻ ☆ Hierarchical Autoregressive Modeling for Neural Video Compression ICLR 2021
+
+
+
+
+
+
+
+
+ Ruihan Yang, Yibo Yang, Joseph Marino, Stephan Mandt
+
+
+ Recent work by Marino et al. (2020) showed improved performance in sequential
+density estimation by combining masked autoregressive flows with hierarchical
+latent variable models. We draw a connection between such autoregressive
+generative models and the task of lossy video compression. Specifically, we
+view recent neural video compression methods (Lu et al., 2019; Yang et al.,
+2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal
+autoregressive transform, and propose avenues for enhancement based on this
+insight. Comprehensive evaluations on large-scale video data show improved
+rate-distortion performance over both state-of-the-art neural and conventional
+video compression methods.
+
+
+
+ comment: Published as a conference paper at ICLR 2021
+
+ Federated learning (FL) is an emerging approach for training machine learning
+models collaboratively while preserving data privacy. The need for privacy
+protection makes it difficult for FL models to achieve global transparency and
+explainability. To address this limitation, we incorporate logic-based
+explanations into FL by proposing the Logical Reasoning-based eXplainable
+Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local
+logic rules based on their local data and send them, along with model updates,
+to the FL server. The FL server connects the local logic rules through a proper
+logical connector that is derived based on properties of client data, without
+requiring access to the raw data. In addition, the server also aggregates the
+local model updates with weight values determined by the quality of the
+clients' local data as reflected by their uploaded logic rules. The results
+show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and
+5.41% in terms of classification accuracy, rule accuracy and rule fidelity,
+respectively. The explicit rule evaluation and expression under LR-XFL enable
+human experts to validate and correct the rules on the server side, hence
+improving the global FL model's robustness to errors. It has the potential to
+enhance the transparency of FL models for areas like healthcare and finance
+where both data privacy and explainability are important.
+
+
+
+
+
+
+
+ ♻ ☆ Small Dataset, Big Gains: Enhancing Reinforcement Learning by Offline
+ Pre-Training with Model Based Augmentation
+
+
+
+
+
+
+
+
+ Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov
+
+
+ Offline reinforcement learning leverages pre-collected datasets of
+transitions to train policies. It can serve as effective initialization for
+online algorithms, enhancing sample efficiency and speeding up convergence.
+However, when such datasets are limited in size and quality, offline
+pre-training can produce sub-optimal policies and lead to degraded online
+reinforcement learning performance. In this paper we propose a model-based data
+augmentation strategy to maximize the benefits of offline reinforcement
+learning pre-training and reduce the scale of data needed to be effective. Our
+approach leverages a world model of the environment trained on the offline
+dataset to augment states during offline pre-training. We evaluate our approach
+on a variety of MuJoCo robotic tasks and our results show it can jump-start
+online fine-tuning and substantially reduce - in some cases by an order of
+magnitude - the required number of environment interactions.
+
+
+
+
+
+
+
+ ♻ ☆ Asymmetric Norms to Approximate the Minimum Action Distance
+
+
+
+
+
+
+
+
+ Lorenzo Steccanella, Anders Jonsson
+
+
+ This paper presents a state representation for reward-free Markov decision
+processes. The idea is to learn, in a self-supervised manner, an embedding
+space where distances between pairs of embedded states correspond to the
+minimum number of actions needed to transition between them. Unlike previous
+methods, our approach incorporates an asymmetric norm parametrization, enabling
+accurate approximations of minimum action distances in environments with
+inherent asymmetry. We show how this representation can be leveraged to learn
+goal-conditioned policies, providing a notion of similarity between states and
+goals and a useful heuristic distance to guide planning. To validate our
+approach, we conduct empirical experiments on both symmetric and asymmetric
+environments. Our results show that our asymmetric norm parametrization
+performs comparably to symmetric norms in symmetric environments and surpasses
+symmetric norms in asymmetric environments.
+
+
+
+
+
+
+
+ ♻ ☆ Supervision Interpolation via LossMix: Generalizing Mixup for Object
+ Detection and Beyond AAAI-24
+
+
+ The success of data mixing augmentations in image classification tasks has
+been well-received. However, these techniques cannot be readily applied to
+object detection due to challenges such as spatial misalignment,
+foreground/background distinction, and plurality of instances. To tackle these
+issues, we first introduce a novel conceptual framework called Supervision
+Interpolation (SI), which offers a fresh perspective on interpolation-based
+augmentations by relaxing and generalizing Mixup. Based on SI, we propose
+LossMix, a simple yet versatile and effective regularization that enhances the
+performance and robustness of object detectors and more. Our key insight is
+that we can effectively regularize the training on mixed data by interpolating
+their loss errors instead of ground truth labels. Empirical results on the
+PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently
+outperform state-of-the-art methods widely adopted for detection. Furthermore,
+by jointly leveraging LossMix with unsupervised domain adaptation, we
+successfully improve existing approaches and set a new state of the art for
+cross-domain object detection.
+
+
+
+ comment: AAAI-24 Camera Ready Version, with supplementary material, 15 pages
+
+
+
+
+
+
+ ♻ ☆ Ad-load Balancing via Off-policy Learning in a Content Marketplace RecSys
+ '23
+
+
+ Ad-load balancing is a critical challenge in online advertising systems,
+particularly in the context of social media platforms, where the goal is to
+maximize user engagement and revenue while maintaining a satisfactory user
+experience. This requires the optimization of conflicting objectives, such as
+user satisfaction and ads revenue. Traditional approaches to ad-load balancing
+rely on static allocation policies, which fail to adapt to changing user
+preferences and contextual factors. In this paper, we present an approach that
+leverages off-policy learning and evaluation from logged bandit feedback. We
+start by presenting a motivating analysis of the ad-load balancing problem,
+highlighting the conflicting objectives between user satisfaction and ads
+revenue. We emphasize the nuances that arise due to user heterogeneity and the
+dependence on the user's position within a session. Based on this analysis, we
+define the problem as determining the optimal ad-load for a particular feed
+fetch. To tackle this problem, we propose an off-policy learning framework that
+leverages unbiased estimators such as Inverse Propensity Scoring (IPS) and
+Doubly Robust (DR) to learn and estimate the policy values using offline
+collected stochastic data. We present insights from online A/B experiments
+deployed at scale across over 80 million users generating over 200 million
+sessions, where we find statistically significant improvements in both user
+satisfaction metrics and ads revenue for the platform.
+
+
+
+ comment: Early version presented at the CONSEQUENCES '23 workshop at RecSys
+ '23, final version appearing at WSDM '24
+
+
+
+
+
+
+ ♻ ☆ Identifying Label Errors in Object Detection Datasets by Loss Inspection
+
+
+
+
+
+
+
+
+ Marius Schubert, Tobias Riedlinger, Karsten Kahl, Daniel Kröll, Sebastian Schoenen, Siniša Šegvić, Matthias Rottmann
+
+
+ Labeling datasets for supervised object detection is a dull and
+time-consuming task. Errors can be easily introduced during annotation and
+overlooked during review, yielding inaccurate benchmarks and performance
+degradation of deep neural networks trained on noisy labels. In this work, we
+for the first time introduce a benchmark for label error detection methods on
+object detection datasets as well as a label error detection method and a
+number of baselines. We simulate four different types of randomly introduced
+label errors on train and test sets of well-labeled object detection datasets.
+For our label error detection method we assume a two-stage object detector to
+be given and consider the sum of both stages' classification and regression
+losses. The losses are computed with respect to the predictions and the noisy
+labels including simulated label errors, aiming at detecting the latter. We
+compare our method to three baselines: a naive one without deep learning, the
+object detector's score and the entropy of the classification softmax
+distribution. We outperform all baselines and demonstrate that among the
+considered methods, ours is the only one that detects label errors of all four
+types efficiently. Furthermore, we detect real label errors a) on commonly used
+test datasets in object detection and b) on a proprietary dataset. In both
+cases we achieve low false positives rates, i.e., we detect label errors with a
+precision for a) of up to 71.5% and for b) with 97%.
+
+
+ Conventional Federated Domain Adaptation (FDA) approaches usually demand an
+abundance of assumptions, which makes them significantly less feasible for
+real-world situations and introduces security hazards. This paper relaxes the
+assumptions from previous FDAs and studies a more practical scenario named
+Universal Federated Domain Adaptation (UFDA). It only requires the black-box
+model and the label set information of each source domain, while the label sets
+of different source domains could be inconsistent, and the target-domain label
+set is totally blind. Towards a more effective solution for our newly proposed
+UFDA scenario, we propose a corresponding methodology called Hot-Learning with
+Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain
+shifts and category gaps problems by using one-hot outputs from the black-box
+models of various source domains. Moreover, to better distinguish the shared
+and unknown classes, we further present a cluster-level strategy named
+Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer
+classes from both source and target domains. Extensive experiments on three
+benchmark datasets demonstrate that our method achieves comparable performance
+for our UFDA scenario with much fewer assumptions, compared to previous
+methodologies with comprehensive additional assumptions.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+ ♻ ☆ Hire When You Need to: Gradual Participant Recruitment for Auction-based
+ Federated Learning
+
+
+ The success of Federated Learning (FL) depends on the quantity and quality of
+the data owners (DOs) as well as their motivation to join FL model training.
+Reputation-based FL participant selection methods have been proposed. However,
+they still face the challenges of the cold start problem and potential
+selection bias towards highly reputable DOs. Such a bias can result in lower
+reputation DOs being prematurely excluded from future FL training rounds,
+thereby reducing the diversity of training data and the generalizability of the
+resulting models. To address these challenges, we propose the Gradual
+Participant Selection scheme for Auction-based Federated Learning (GPS-AFL).
+Unlike existing AFL incentive mechanisms which generally assume that all DOs
+required for an FL task must be selected in one go, GPS-AFL gradually selects
+the required DOs over multiple rounds of training as more information is
+revealed through repeated interactions. It is designed to strike a balance
+between cost saving and performance enhancement, while mitigating the drawbacks
+of selection bias in reputation-based FL. Extensive experiments based on
+real-world datasets demonstrate the significant advantages of GPS-AFL, which
+reduces costs by 33.65% and improved total utility by 2.91%, on average
+compared to the best-performing state-of-the-art approach.
+
+
+
+ comment: 9 Pages, 3 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ Graph Attention-based Deep Reinforcement Learning for solving the
+ Chinese Postman Problem with Load-dependent costs
+
+
+ Recently, Deep reinforcement learning (DRL) models have shown promising
+results in solving routing problems. However, most DRL solvers are commonly
+proposed to solve node routing problems, such as the Traveling Salesman Problem
+(TSP). Meanwhile, there has been limited research on applying neural methods to
+arc routing problems, such as the Chinese Postman Problem (CPP), since they
+often feature irregular and complex solution spaces compared to TSP. To fill
+these gaps, this paper proposes a novel DRL framework to address the CPP with
+load-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc
+routing problem with load constraints. The novelty of our method is two-fold.
+First, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential
+model. Subsequently, we introduce an autoregressive model based on DRL, namely
+Arc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge
+effectively. Such a framework allows the DRL model to work efficiently and
+scalably to arc routing problems. Furthermore, we propose a new bio-inspired
+meta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC.
+Extensive experiments show that Arc-DRL outperforms existing meta-heuristic
+methods such as Iterative Local Search (ILS) and Variable Neighborhood Search
+(VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for
+CPP-LC regarding both solution quality and running time; while the EA gives the
+best solution quality with much more running time. We release our C++
+implementations for metaheuristics such as EA, ILS and VNS along with the code
+for data generation and our generated data at
+https://github.com/HySonLab/Chinese_Postman_Problem
+
+
+
+
+
+
+
+ ♻ ☆ MISA: Unveiling the Vulnerabilities in Split Federated Learning ICASSP 2024
+
+
+
+
+
+
+
+
+ Wei Wan, Yuxuan Ning, Shengshan Hu, Lulu Xue, Minghui Li, Leo Yu Zhang, Hai Jin
+
+
+ \textit{Federated learning} (FL) and \textit{split learning} (SL) are
+prevailing distributed paradigms in recent years. They both enable shared
+global model training while keeping data localized on users' devices. The
+former excels in parallel execution capabilities, while the latter enjoys low
+dependence on edge computing resources and strong privacy protection.
+\textit{Split federated learning} (SFL) combines the strengths of both FL and
+SL, making it one of the most popular distributed architectures. Furthermore, a
+recent study has claimed that SFL exhibits robustness against poisoning
+attacks, with a fivefold improvement compared to FL in terms of robustness.
+ In this paper, we present a novel poisoning attack known as MISA. It poisons
+both the top and bottom models, causing a \textbf{\underline{misa}}lignment in
+the global model, ultimately leading to a drastic accuracy collapse. This
+attack unveils the vulnerabilities in SFL, challenging the conventional belief
+that SFL is robust against poisoning attacks. Extensive experiments demonstrate
+that our proposed MISA poses a significant threat to the availability of SFL,
+underscoring the imperative for academia and industry to accord this matter due
+attention.
+
+
+
+ comment: This paper has been accepted by the IEEE International Conference on
+ Acoustics, Speech, and Signal Processing (ICASSP 2024)
+
+
+
+
+
+
+ ♻ ☆ Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding ICASSP2024
+
+
+ Drawing on theoretical insights, we advocate an error-based thresholding
+(EBT) mechanism for learned ISTA (LISTA), which utilizes a function of the
+layer-wise reconstruction error to suggest a specific threshold for each
+observation in the shrinkage function of each layer. We show that the proposed
+EBT mechanism well disentangles the learnable parameters in the shrinkage
+functions from the reconstruction errors, endowing the obtained models with
+improved adaptivity to possible data variations. With rigorous analyses, we
+further show that the proposed EBT also leads to a faster convergence on the
+basis of LISTA or its variants, in addition to its higher adaptivity. Extensive
+experimental results confirm our theoretical analyses and verify the
+effectiveness of our methods.
+
+
+ We provide a systematic investigation of using physics-informed neural
+networks to compute Lyapunov functions. We encode Lyapunov conditions as a
+partial differential equation (PDE) and use this for training neural network
+Lyapunov functions. We analyze the analytical properties of the solutions to
+the Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov
+equation in training neural Lyapunov functions can lead to approximate regions
+of attraction close to the true domain of attraction. We also examine
+approximation errors and the convergence of neural approximations to the unique
+solution of Zubov's equation. We then provide sufficient conditions for the
+learned neural Lyapunov functions that can be readily verified by
+satisfiability modulo theories (SMT) solvers, enabling formal verification of
+both local stability analysis and region-of-attraction estimates in the large.
+Through a number of nonlinear examples, ranging from low to high dimensions, we
+demonstrate that the proposed framework can outperform traditional
+sums-of-squares (SOS) Lyapunov functions obtained using semidefinite
+programming (SDP).
+
+
+
+ comment: The current version has been submitted for publication
+
+
+
+
+
+
+ ♻ ☆ Counter-Empirical Attacking based on Adversarial Reinforcement Learning
+ for Time-Relevant Scoring System
+
+
+
+
+
+
+
+
+ Xiangguo Sun, Hong Cheng, Hang Dong, Bo Qiao, Si Qin, Qingwei Lin
+
+
+ Scoring systems are commonly seen for platforms in the era of big data. From
+credit scoring systems in financial services to membership scores in E-commerce
+shopping platforms, platform managers use such systems to guide users towards
+the encouraged activity pattern, and manage resources more effectively and more
+efficiently thereby. To establish such scoring systems, several "empirical
+criteria" are firstly determined, followed by dedicated top-down design for
+each factor of the score, which usually requires enormous effort to adjust and
+tune the scoring function in the new application scenario. What's worse, many
+fresh projects usually have no ground-truth or any experience to evaluate a
+reasonable scoring system, making the designing even harder. To reduce the
+effort of manual adjustment of the scoring function in every new scoring
+system, we innovatively study the scoring system from the preset empirical
+criteria without any ground truth, and propose a novel framework to improve the
+system from scratch. In this paper, we propose a "counter-empirical attacking"
+mechanism that can generate "attacking" behavior traces and try to break the
+empirical rules of the scoring system. Then an adversarial "enhancer" is
+applied to evaluate the scoring system and find the improvement strategy. By
+training the adversarial learning problem, a proper scoring function can be
+learned to be robust to the attacking activity traces that are trying to
+violate the empirical criteria. Extensive experiments have been conducted on
+two scoring systems including a shared computing resource platform and a
+financial credit system. The experimental results have validated the
+effectiveness of our proposed framework.
+
+
+ Learning from corrupted labels is very common in real-world machine-learning
+applications. Memorizing such noisy labels could affect the learning of the
+model, leading to sub-optimal performances. In this work, we propose a novel
+framework to learn robust machine-learning models from noisy labels. Through an
+empirical study, we find that different models make relatively similar
+predictions on clean examples, while the predictions on noisy examples vary
+much more across different models. Motivated by this observation, we propose
+\em denoising with cross-model agreement \em (DeCA) which aims to minimize the
+KL-divergence between the true label distributions parameterized by two machine
+learning models while maximizing the likelihood of data observation. We employ
+the proposed DeCA on both the binary label scenario and the multiple label
+scenario. For the binary label scenario, we select implicit feedback
+recommendation as the downstream task and conduct experiments with four
+state-of-the-art recommendation models on four datasets. For the multiple-label
+scenario, the downstream application is image classification on two benchmark
+datasets. Experimental results demonstrate that the proposed methods
+significantly improve the model performance compared with normal training and
+other denoising methods on both binary and multiple-label scenarios.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2105.09605
+
+
+
+
+
+
+ ♻ ☆ Perspectives on the State and Future of Deep Learning -- 2023
+
+
+
+
+
+
+
+
+ Micah Goldblum, Anima Anandkumar, Richard Baraniuk, Tom Goldstein, Kyunghyun Cho, Zachary C Lipton, Melanie Mitchell, Preetum Nakkiran, Max Welling, Andrew Gordon Wilson
+
+
+ The goal of this series is to chronicle opinions and issues in the field of
+machine learning as they stand today and as they change over time. The plan is
+to host this survey periodically until the AI singularity
+paperclip-frenzy-driven doomsday, keeping an updated list of topical questions
+and interviewing new community members for each edition. In this issue, we
+probed people's opinions on interpretable AI, the value of benchmarking in
+modern NLP, the state of progress towards understanding deep learning, and the
+future of academia.
+
+
+
+
+
+
+
+
+ Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, Jun Wang
+
+
+ Learning disentanglement aims at finding a low dimensional representation
+which consists of multiple explanatory and generative factors of the
+observational data. The framework of variational autoencoder (VAE) is commonly
+used to disentangle independent factors from observations. However, in real
+scenarios, factors with semantics are not necessarily independent. Instead,
+there might be an underlying causal structure which renders these factors
+dependent. We thus propose a new VAE based framework named CausalVAE, which
+includes a Causal Layer to transform independent exogenous factors into causal
+endogenous ones that correspond to causally related concepts in data. We
+further analyze the model identifiabitily, showing that the proposed model
+learned from observations recovers the true one up to a certain degree by
+providing supervision signals (e.g. feature labels). Experiments are conducted
+on various datasets, including synthetic and real word benchmark CelebA.
+Results show that the causal representations learned by CausalVAE are
+semantically interpretable, and their causal relationship as a Directed Acyclic
+Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that
+the proposed CausalVAE model is able to generate counterfactual data through
+"do-operation" to the causal factors.
+
+
+
+
+
+
+
+ ♻ ☆ Drift Control of High-Dimensional RBM: A Computational Method Based on
+ Neural Networks
+
+
+
+
+
+
+
+
+ Baris Ata, J. Michael Harrison, Nian Si
+
+
+ Motivated by applications in queueing theory, we consider a stochastic
+control problem whose state space is the $d$-dimensional positive orthant. The
+controlled process $Z$ evolves as a reflected Brownian motion whose covariance
+matrix is exogenously specified, as are its directions of reflection from the
+orthant's boundary surfaces. A system manager chooses a drift vector
+$\theta(t)$ at each time $t$ based on the history of $Z$, and the cost rate at
+time $t$ depends on both $Z(t)$ and $\theta(t)$. In our initial problem
+formulation, the objective is to minimize expected discounted cost over an
+infinite planning horizon, after which we treat the corresponding ergodic
+control problem. Extending earlier work by Han et al. (Proceedings of the
+National Academy of Sciences, 2018, 8505-8510), we develop and illustrate a
+simulation-based computational method that relies heavily on deep neural
+network technology. For test problems studied thus far, our method is accurate
+to within a fraction of one percent, and is computationally feasible in
+dimensions up to at least $d=30$.
+
+
+ Pseudo Labeling is a technique used to improve the performance of
+semi-supervised Graph Neural Networks (GNNs) by generating additional
+pseudo-labels based on confident predictions. However, the quality of generated
+pseudo-labels has been a longstanding concern due to the sensitivity of the
+classification objective with respect to the given labels. To avoid the
+untrustworthy classification supervision indicating ``a node belongs to a
+specific class,'' we favor the fault-tolerant contrasting supervision
+demonstrating ``two nodes do not belong to the same class.'' Thus, the problem
+of generating high-quality pseudo-labels is then transformed into a relaxed
+version, i.e., identifying reliable negative pairs. To achieve this, we propose
+a general framework for GNNs, termed Pseudo Contrastive Learning (PCL). It
+separates two nodes whose positive and negative pseudo-labels target the same
+class. To incorporate topological knowledge into learning, we devise a
+topologically weighted contrastive loss that spends more effort separating
+negative pairs with smaller topological distances. Experimentally, we apply PCL
+to various GNNs, which consistently outperform their counterparts using other
+popular general techniques on five real-world graphs.
+
+
+
+ comment: Under Review
+
+
+
+
+
+
+ ♻ ☆ Towards Consistent Stochastic Human Motion Prediction via Motion
+ Diffusion
+
+
+ Stochastic Human Motion Prediction (HMP) aims to predict multiple possible
+upcoming pose sequences based on past human motion trajectories. Although
+previous approaches have shown impressive performance, they face several
+issues, including complex training processes and a tendency to generate
+predictions that are often inconsistent with the provided history, and
+sometimes even becoming entirely unreasonable. To overcome these issues, we
+propose DiffMotion, an end-to-end diffusion-based stochastic HMP framework.
+DiffMotion's motion predictor is composed of two modules, including (1) a
+Transformer-based network for initial motion reconstruction from corrupted
+motion, and (2) a Graph Convolutional Network (GCN) to refine the generated
+motion considering past observations. Our method, facilitated by this novel
+Transformer-GCN module design and a proposed variance scheduler, excels in
+predicting accurate, realistic, and consistent motions, while maintaining an
+appropriate level of diversity. Our results on benchmark datasets show that
+DiffMotion significantly outperforms previous methods in terms of both accuracy
+and fidelity, while demonstrating superior robustness.
+
+
+
+
+
+
+
+ ♻ ☆ Double Machine Learning for Static Panel Models with Fixed Effects
+
+
+ Machine Learning (ML) algorithms are powerful data-driven tools for
+approximating highdimensional or non-linear nuisance functions which are useful
+in practice because the true functional form of the predictors is ex-ante
+unknown. In this paper, we develop estimators of policy interventions from
+panel data which allow for non-linear effects of the confounding regressors,
+and investigate the performance of these estimators using three well-known ML
+algorithms, specifically, LASSO, classification and regression trees, and
+random forests. We use Double Machine Learning (DML) (Chernozhukov et al.,
+2018) for the estimation of causal effects of homogeneous treatments with
+unobserved individual heterogeneity (fixed effects) and no unobserved
+confounding by extending Robinson (1988)'s partially linear regression model.
+We develop three alternative approaches for handling unobserved individual
+heterogeneity based on extending the within-group estimator, first-difference
+estimator, and correlated random effect estimator (Mundlak, 1978) for
+non-linear models. Using Monte Carlo simulations, we find that conventional
+least squares estimators can perform well even if the data generating process
+is nonlinear, but there are substantial performance gains in terms of bias
+reduction under a process where the true effect of the regressors is non-linear
+and discontinuous. However, for the same scenarios, we also find - despite
+extensive hyperparameter tuning - inference to be problematic for both
+tree-based learners because these lead to highly non-normal estimator
+distributions and the estimator variance being severely under-estimated. This
+contradicts the performance of trees in other circumstances and requires
+further investigation. Finally, we provide an illustrative example of DML for
+observational panel data showing the impact of the introduction of the national
+minimum wage in the UK.
+
+
+
+
+
+
+
+ ♻ ☆ Multipoint-BAX: A New Approach for Efficiently Tuning Particle
+ Accelerator Emittance via Virtual Objectives
+
+
+
+
+
+
+
+
+ Sara A. Miskovich, Willie Neiswanger, William Colocho, Claudio Emma, Jacqueline Garrahan, Timothy Maxwell, Christopher Mayes, Stefano Ermon, Auralee Edelen, Daniel Ratner
+
+
+ Although beam emittance is critical for the performance of high-brightness
+accelerators, optimization is often time limited as emittance calculations,
+commonly done via quadrupole scans, are typically slow. Such calculations are a
+type of $\textit{multipoint query}$, i.e. each query requires multiple
+secondary measurements. Traditional black-box optimizers such as Bayesian
+optimization are slow and inefficient when dealing with such objectives as they
+must acquire the full series of measurements, but return only the emittance,
+with each query. We propose a new information-theoretic algorithm,
+Multipoint-BAX, for black-box optimization on multipoint queries, which queries
+and models individual beam-size measurements using techniques from Bayesian
+Algorithm Execution (BAX). Our method avoids the slow multipoint query on the
+accelerator by acquiring points through a $\textit{virtual objective}$, i.e.
+calculating the emittance objective from a fast learned model rather than
+directly from the accelerator. We use Multipoint-BAX to minimize emittance at
+the Linac Coherent Light Source (LCLS) and the Facility for Advanced
+Accelerator Experimental Tests II (FACET-II). In simulation, our method is
+20$\times$ faster and more robust to noise compared to existing methods. In
+live tests, it matched the hand-tuned emittance at FACET-II and achieved a 24%
+lower emittance than hand-tuning at LCLS. Our method represents a conceptual
+shift for optimizing multipoint queries, and we anticipate that it can be
+readily adapted to similar problems in particle accelerators and other
+scientific instruments.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 10
+
+
+
+
+
+ ☆ A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
+
+
+ The surge of interest towards Multi-modal Large Language Models (MLLMs),
+e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
+academia and industry. They endow Large Language Models (LLMs) with powerful
+capabilities in visual understanding, enabling them to tackle diverse
+multi-modal tasks. Very recently, Google released Gemini, its newest and most
+capable MLLM built from the ground up for multi-modality. In light of the
+superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
+in multi-modal learning? In this paper, we present a preliminary exploration of
+Gemini Pro's visual understanding proficiency, which comprehensively covers
+four domains: fundamental perception, advanced cognition, challenging vision
+tasks, and various expert capacities. We compare Gemini Pro with the
+state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
+open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
+black-box systems. The qualitative samples indicate that, while GPT-4V and
+Gemini showcase different answering styles and preferences, they can exhibit
+comparable visual reasoning capabilities, and Sphinx still trails behind them
+concerning domain generalizability. Specifically, GPT-4V tends to elaborate
+detailed explanations and intermediate steps, and Gemini prefers to output a
+direct and concise answer. The quantitative evaluation on the popular MME
+benchmark also demonstrates the potential of Gemini to be a strong challenger
+to GPT-4V. Our early investigation of Gemini also observes some common issues
+of MLLMs, indicating that there still remains a considerable distance towards
+artificial general intelligence. Our project for tracking the progress of MLLM
+is released at
+https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
+
+
+
+ comment: Total 120 pages. See our project at
+ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
+
+ A transcoding scheme for the High Efficiency Video Coding (HEVC) is proposed
+that allows any partial frame modification to be followed by a partial
+re-compression of only the modified areas, while guaranteeing identical
+reconstruction of non-modified areas. To this end, first, syntax elements of
+all Coding Units (CU) in the frame are parsed and decoded according to their
+scan order. Then CUs that are collocated with a replaced area are re-encoded
+with new content to generate a partial set of new syntax elements. In order to
+avoid spatial propagation of the decoding mismatch due to the new content, CUs
+on the border of the replaced area are losslessly coded such that
+reconstruction of immediately neighboring CUs in the scan order are protected
+from the modification. The proposed method has been implemented on top of the
+HEVC test Model (HM) in All-Intra (AI) coding configuration and experiments
+show that, depending on the test parameters, it can offer both a bitrate saving
+(up to 4% in terms of BD-BR) and a transcoding acceleration (up to 83%)
+compared to a full transcoding scheme.
+
+
+
+
+
+
+
+ ☆ Comparative Study of Hardware and Software Power Measurements in Video
+ Compression
+
+
+
+
+
+
+
+
+ Angeliki Katsenou, Xinyi Wang, Daniel Schien, David Bull
+
+
+ The environmental impact of video streaming services has been discussed as
+part of the strategies towards sustainable information and communication
+technologies. A first step towards that is the energy profiling and assessment
+of energy consumption of existing video technologies. This paper presents a
+comprehensive study of power measurement techniques in video compression,
+comparing the use of hardware and software power meters. An experimental
+methodology to ensure reliability of measurements is introduced. Key findings
+demonstrate the high correlation of hardware and software based energy
+measurements for two video codecs across different spatial and temporal
+resolutions at a lower computational overhead.
+
+
+
+ comment: 5 pages
+
+
+
+
+
+
+ ☆ An effective image copy-move forgery detection using entropy image
+
+
+ Image forensics has become increasingly important in our daily lives. As a
+fundamental type of forgeries, Copy-Move Forgery Detection (CMFD) has received
+significant attention in the academic community. Keypoint-based algorithms,
+particularly those based on SIFT, have achieved good results in CMFD. However,
+the most of keypoint detection algorithms often fail to generate sufficient
+matches when tampered patches are present in smooth areas. To tackle this
+problem, we introduce entropy images to determine the coordinates and scales of
+keypoints, resulting significantly increasing the number of keypoints.
+Furthermore, we develop an entropy level clustering algorithm to avoid
+increased matching complexity caused by non-ideal distribution of grayscale
+values in keypoints. Experimental results demonstrate that our algorithm
+achieves a good balance between performance and time efficiency.
+
+
+
+
+
+
+
+ ☆ InstructVideo: Instructing Video Diffusion Models with Human Feedback
+
+
+
+
+
+
+
+
+ Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
+
+
+ Diffusion models have emerged as the de facto paradigm for video generation.
+However, their reliance on web-scale data of varied quality often yields
+results that are visually unappealing and misaligned with the textual prompts.
+To tackle this problem, we propose InstructVideo to instruct text-to-video
+diffusion models with human feedback by reward fine-tuning. InstructVideo has
+two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by
+generating through the full DDIM sampling chain, we recast reward fine-tuning
+as editing. By leveraging the diffusion process to corrupt a sampled video,
+InstructVideo requires only partial inference of the DDIM sampling chain,
+reducing fine-tuning cost while improving fine-tuning efficiency. 2) To
+mitigate the absence of a dedicated video reward model for human preferences,
+we repurpose established image reward models, e.g., HPSv2. To this end, we
+propose Segmental Video Reward, a mechanism to provide reward signals based on
+segmental sparse sampling, and Temporally Attenuated Reward, a method that
+mitigates temporal modeling degradation during fine-tuning. Extensive
+experiments, both qualitative and quantitative, validate the practicality and
+efficacy of using image reward models in InstructVideo, significantly enhancing
+the visual quality of generated videos without compromising generalization
+capabilities. Code and models will be made publicly available.
+
+
+ Recent advances in autonomous robotic technologies have highlighted the
+growing need for precise environmental analysis. LiDAR semantic segmentation
+has gained attention to accomplish fine-grained scene understanding by acting
+directly on raw content provided by sensors. Recent solutions showed how
+different learning techniques can be used to improve the performance of the
+model, without any architectural or dataset change. Following this trend, we
+present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK)
+derived from a standard model. First, classes are clustered into macro groups
+according to mutual prediction errors; then, the learning process is
+regularized by: (1) aligning class-conditional prototypical feature
+representation for both fine and coarse classes, (2) weighting instances with a
+per-class fairness index. Our LEAK approach is very general and can be
+seamlessly applied on top of any segmentation architecture; indeed,
+experimental results showed that it enables state-of-the-art performances on
+different architectures, datasets and tasks, while ensuring more balanced
+class-wise results and faster convergence.
+
+
+ Despite commendable achievements made by existing work, prevailing multimodal
+sarcasm detection studies rely more on textual content over visual information.
+It unavoidably induces spurious correlations between textual words and labels,
+thereby significantly hindering the models' generalization capability. To
+address this problem, we define the task of out-of-distribution (OOD)
+multimodal sarcasm detection, which aims to evaluate models' generalizability
+when the word distribution is different in training and testing settings.
+Moreover, we propose a novel debiasing multimodal sarcasm detection framework
+with contrastive learning, which aims to mitigate the harmful effect of biased
+textual factors for robust OOD generalization. In particular, we first design
+counterfactual data augmentation to construct the positive samples with
+dissimilar word biases and negative samples with similar word biases.
+Subsequently, we devise an adapted debiasing contrastive learning mechanism to
+empower the model to learn robust task-relevant features and alleviate the
+adverse effect of biased words. Extensive experiments show the superiority of
+the proposed framework.
+
+
+ Uveitis demands the precise diagnosis of anterior chamber inflammation (ACI)
+for optimal treatment. However, current diagnostic methods only rely on a
+limited single-modal disease perspective, which leads to poor performance. In
+this paper, we investigate a promising yet challenging way to fuse multimodal
+data for ACI diagnosis. Notably, existing fusion paradigms focus on empowering
+implicit modality interactions (i.e., self-attention and its variants), but
+neglect to inject explicit modality interactions, especially from clinical
+knowledge and imaging property. To this end, we propose a jointly Explicit and
+implicit Cross-Modal Interaction Network (EiCI-Net) for Anterior Chamber
+Inflammation Diagnosis that uses anterior segment optical coherence tomography
+(AS-OCT) images, slit-lamp images, and clinical data jointly. Specifically, we
+first develop CNN-Based Encoders and Tabular Processing Module (TPM) to extract
+efficient feature representations in different modalities. Then, we devise an
+Explicit Cross-Modal Interaction Module (ECIM) to generate attention maps as a
+kind of explicit clinical knowledge based on the tabular feature maps, then
+integrated them into the slit-lamp feature maps, allowing the CNN-Based Encoder
+to focus on more effective informativeness of the slit-lamp images. After that,
+the Implicit Cross-Modal Interaction Module (ICIM), a transformer-based
+network, further implicitly enhances modality interactions. Finally, we
+construct a considerable real-world dataset from our collaborative hospital and
+conduct sufficient experiments to demonstrate the superior performance of our
+proposed EiCI-Net compared with the state-of-the-art classification methods in
+various metrics.
+
+
+
+
+
+
+
+ ♻ ☆ How to Bridge the Gap between Modalities: A Comprehensive Survey on
+ Multimodal Large Language Model
+
+
+ This review paper explores Multimodal Large Language Models (MLLMs), which
+integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data
+such as text and vision. MLLMs demonstrate capabilities like generating image
+narratives and answering image-based questions, bridging the gap towards
+real-world human-computer interactions and hinting at a potential pathway to
+artificial general intelligence. However, MLLMs still face challenges in
+processing the semantic gap in multimodality, which may lead to erroneous
+generation, posing potential risks to society. Choosing the appropriate
+modality alignment method is crucial, as improper methods might require more
+parameters with limited performance improvement. This paper aims to explore
+modality alignment methods for LLMs and their existing capabilities.
+Implementing modality alignment allows LLMs to address environmental issues and
+enhance accessibility. The study surveys existing modal alignment methods in
+MLLMs into four groups: (1) Multimodal Converters that change data into
+something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs
+perceive different types of data; (3) Tools Assistance for changing data into
+one common format, usually text; and (4) Data-Driven methods that teach LLMs to
+understand specific types of data in a dataset. This field is still in a phase
+of exploration and experimentation, and we will organize and update various
+existing research methods for multimodal information alignment.
+
+
+
+
+
+
+
+ ♻ ☆ Comparing the robustness of modern no-reference image- and video-quality
+ metrics to adversarial attacks
+
+
+ Nowadays neural-network-based image- and video-quality metrics show better
+performance compared to traditional methods. However, they also became more
+vulnerable to adversarial attacks that increase metrics' scores without
+improving visual quality. The existing benchmarks of quality metrics compare
+their performance in terms of correlation with subjective quality and
+calculation time. However, the adversarial robustness of image-quality metrics
+is also an area worth researching. In this paper, we analyse modern metrics'
+robustness to different adversarial attacks. We adopted adversarial attacks
+from computer vision tasks and compared attacks' efficiency against 15
+no-reference image/video-quality metrics. Some metrics showed high resistance
+to adversarial attacks which makes their usage in benchmarks safer than
+vulnerable metrics. The benchmark accepts new metrics submissions for
+researchers who want to make their metrics more robust to attacks or to find
+such metrics for their needs. Try our benchmark using pip install
+robustness-benchmark.
+
+
+
+
+
+
+
+
+ Paulo Pirozelli, Marcos M. José, Paulo de Tarso P. Filho, Anarosa A. F. Brandão, Fabio G. Cozman
+
+
+ Logical reasoning is central to complex human activities, such as thinking,
+debating, and planning; it is also a central component of many AI systems as
+well. In this paper, we investigate the extent to which encoder-only
+transformer language models (LMs) can reason according to logical rules. We ask
+whether those LMs can deduce theorems in propositional calculus and first-order
+logic; if their relative success in these problems reflects general logical
+capabilities; and which layers contribute the most to the task. First, we show
+for several encoder-only LMs that they can be trained, to a reasonable degree,
+to determine logical validity on various datasets. Next, by cross-probing
+fine-tuned models on these datasets, we show that LMs have difficulty in
+transferring their putative logical reasoning ability, which suggests that they
+may have learned dataset-specific features, instead of a general capability.
+Finally, we conduct a layerwise probing experiment, which shows that the
+hypothesis classification task is mostly solved through higher layers.
+
+
+
+
+
+
+
+ ☆ Shaping Political Discourse using multi-source News Summarization
+
+
+ Multi-document summarization is the process of automatically generating a
+concise summary of multiple documents related to the same topic. This summary
+can help users quickly understand the key information from a large collection
+of documents. Multi-document summarization systems are more complex than
+single-document summarization systems due to the need to identify and combine
+information from multiple sources. In this paper, we have developed a machine
+learning model that generates a concise summary of a topic from multiple news
+documents. The model is designed to be unbiased by sampling its input equally
+from all the different aspects of the topic, even if the majority of the news
+sources lean one way.
+
+
+
+
+
+
+
+ ☆ Opportunities and Challenges of Applying Large Language Models in
+ Building Energy Efficiency and Decarbonization Studies: An Exploratory
+ Overview
+
+
+ In recent years, the rapid advancement and impressive capabilities of Large
+Language Models (LLMs) have been evident across various domains. This paper
+explores the application, implications, and potential of LLMs in building
+energy efficiency and decarbonization studies. The wide-ranging capabilities of
+LLMs are examined in the context of the building energy field, including
+intelligent control systems, code generation, data infrastructure, knowledge
+extraction, and education. Despite the promising potential of LLMs, challenges
+including complex and expensive computation, data privacy, security and
+copyright, complexity in fine-tuned LLMs, and self-consistency are discussed.
+The paper concludes with a call for future research focused on the enhancement
+of LLMs for domain-specific tasks, multi-modal LLMs, and collaborative research
+between AI and energy experts.
+
+
+
+
+
+
+
+ ☆ Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows
+
+
+
+
+
+
+
+
+ Madeleine Grunde-McLaughlin, Michelle S. Lam, Ranjay Krishna, Daniel S. Weld, Jeffrey Heer
+
+
+ LLM chains enable complex tasks by decomposing work into a sequence of
+sub-tasks. Crowdsourcing workflows similarly decompose complex tasks into
+smaller tasks for human crowdworkers. Chains address LLM errors analogously to
+the way crowdsourcing workflows address human error. To characterize
+opportunities for LLM chaining, we survey 107 papers across the crowdsourcing
+and chaining literature to construct a design space for chain development. The
+design space connects an LLM designer's objectives to strategies they can use
+to achieve those objectives, and tactics to implement each strategy. To explore
+how techniques from crowdsourcing may apply to chaining, we adapt crowdsourcing
+workflows to implement LLM chains across three case studies: creating a
+taxonomy, shortening text, and writing a short story. From the design space and
+our case studies, we identify which techniques transfer from crowdsourcing to
+LLM chaining and raise implications for future research and development.
+
+
+
+
+
+
+
+
+ Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano
+
+
+ In this report, we explore the ability of language model agents to acquire
+resources, create copies of themselves, and adapt to novel challenges they
+encounter in the wild. We refer to this cluster of capabilities as "autonomous
+replication and adaptation" or ARA. We believe that systems capable of ARA
+could have wide-reaching and hard-to-anticipate consequences, and that
+measuring and forecasting ARA may be useful for informing measures around
+security, monitoring, and alignment. Additionally, once a system is capable of
+ARA, placing bounds on a system's capabilities may become significantly more
+difficult.
+ We construct four simple example agents that combine language models with
+tools that allow them to take actions in the world. We then evaluate these
+agents on 12 tasks relevant to ARA. We find that these language model agents
+can only complete the easiest tasks from this list, although they make some
+progress on the more challenging tasks. Unfortunately, these evaluations are
+not adequate to rule out the possibility that near-future agents will be
+capable of ARA. In particular, we do not think that these evaluations provide
+good assurance that the ``next generation'' of language models (e.g. 100x
+effective compute scaleup on existing models) will not yield agents capable of
+ARA, unless intermediate evaluations are performed during pretraining.
+Relatedly, we expect that fine-tuning of the existing models could produce
+substantially more competent agents, even if the fine-tuning is not directly
+targeted at ARA.
+
+
+
+ comment: 14 pages
+
+
+
+
+
+
+ ☆ Cascade Speculative Drafting for Even Faster LLM Inference
+
+
+ Speculative decoding enhances the efficiency of large language models (LLMs)
+by leveraging a draft model to draft for a larger target model to review.
+However, drafting in speculative decoding involves slow autoregressive
+generation and generating tokens of different importance with the same time
+allocation. These two inefficiencies lead to its suboptimal performance. To
+address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a
+novel approach that employs two types of cascades. The Vertical Cascade
+eliminates autoregressive generation from neural models. The Horizontal Cascade
+constitutes efficient time allocation in drafting with its optimality supported
+by our theoretical analysis. Combining both cascades, our CS. Drafting
+algorithm has achieved up to 72 percent additional speedup over speculative
+decoding in our experiments while keeping the same output distribution.
+
+
+
+
+
+
+
+ ☆ An In-depth Look at Gemini's Language Abilities
+
+
+
+
+
+
+
+
+ Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig
+
+
+ The recently released Google Gemini class of models are the first to
+comprehensively report results that rival the OpenAI GPT series across a wide
+variety of tasks. In this paper, we do an in-depth exploration of Gemini's
+language abilities, making two contributions. First, we provide a third-party,
+objective comparison of the abilities of the OpenAI GPT and Google Gemini
+models with reproducible code and fully transparent results. Second, we take a
+closer look at the results, identifying areas where one of the two model
+classes excels. We perform this analysis over 10 datasets testing a variety of
+language abilities, including reasoning, answering knowledge-based questions,
+solving math problems, translating between languages, generating code, and
+acting as instruction-following agents. From this analysis, we find that Gemini
+Pro achieves accuracy that is close but slightly inferior to the corresponding
+GPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations
+for some of this under-performance, including failures in mathematical
+reasoning with many digits, sensitivity to multiple-choice answer ordering,
+aggressive content filtering, and others. We also identify areas where Gemini
+demonstrates comparably high performance, including generation into non-English
+languages, and handling longer and more complex reasoning chains. Code and data
+for reproduction can be found at https://github.com/neulab/gemini-benchmark
+
+
+
+
+
+
+
+ ☆ Social Learning: Towards Collaborative Learning with Large Language
+ Models
+
+
+
+
+
+
+
+
+ Amirkeivan Mohtashami, Florian Hartmann, Sian Gooding, Lukas Zilka, Matt Sharifi, Blaise Aguera y Arcas
+
+
+ We introduce the framework of "social learning" in the context of large
+language models (LLMs), whereby models share knowledge with each other in a
+privacy-aware manner using natural language. We present and evaluate two
+approaches for knowledge transfer between LLMs. In the first scenario, we allow
+the model to generate abstract prompts aiming to teach the task. In our second
+approach, models transfer knowledge by generating synthetic examples. We
+evaluate these methods across diverse datasets and quantify memorization as a
+proxy for privacy loss. These techniques inspired by social learning yield
+promising results with low memorization of the original data. In particular, we
+show that performance using these methods is comparable to results with the use
+of original labels and prompts. Our work demonstrates the viability of social
+learning for LLMs, establishes baseline approaches and highlights several
+unexplored areas for future work.
+
+
+
+
+
+
+
+ ☆ Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM
+ Finetuning
+
+
+ This paper introduces an efficient strategy to transform Large Language
+Models (LLMs) into Multi-Modal Large Language Models (MLLMs). By
+conceptualizing this transformation as a domain adaptation process, i.e.,
+transitioning from text understanding to embracing multiple modalities, we
+intriguingly note that, within each attention block, tuning LayerNorm suffices
+to yield strong performance. Moreover, when benchmarked against other tuning
+approaches like full parameter finetuning or LoRA, its benefits on efficiency
+are substantial. For example, when compared to LoRA on a 13B model scale,
+performance can be enhanced by an average of over 20% across five multi-modal
+tasks, and meanwhile, results in a significant reduction of trainable
+parameters by 41.9% and a decrease in GPU memory usage by 17.6%. On top of this
+LayerNorm strategy, we showcase that selectively tuning only with
+conversational data can improve efficiency further. Beyond these empirical
+outcomes, we provide a comprehensive analysis to explore the role of LayerNorm
+in adapting LLMs to the multi-modal domain and improving the expressive power
+of the model.
+
+
+
+ comment: The first two authors contributed equally
+
+
+
+
+
+
+ ☆ News Signals: An NLP Library for Text and Time Series EMNLP
+
+
+ We present an open-source Python library for building and using datasets
+where inputs are clusters of textual data, and outputs are sequences of real
+values representing one or more time series signals. The news-signals library
+supports diverse data science and NLP problem settings related to the
+prediction of time series behaviour using textual data feeds. For example, in
+the news domain, inputs are document clusters corresponding to daily news
+articles about a particular entity, and targets are explicitly associated
+real-valued time series: the volume of news about a particular person or
+company, or the number of pageviews of specific Wikimedia pages. Despite many
+industry and research use cases for this class of problem settings, to the best
+of our knowledge, News Signals is the only open-source library designed
+specifically to facilitate data science and research settings with natural
+language inputs and time series targets. In addition to the core codebase for
+building and interacting with datasets, we also conduct a suite of experiments
+using several popular Machine Learning libraries, which are used to establish
+baselines for time series anomaly prediction using textual inputs.
+
+
+
+ comment: EMNLP NLP-OSS Workshop, December 2023
+
+
+
+
+
+
+ ☆ Verb Categorisation for Hindi Word Problem Solving
+
+
+ Word problem Solving is a challenging NLP task that deals with solving
+mathematical problems described in natural language. Recently, there has been
+renewed interest in developing word problem solvers for Indian languages. As
+part of this paper, we have built a Hindi arithmetic word problem solver which
+makes use of verbs. Additionally, we have created verb categorization data for
+Hindi. Verbs are very important for solving word problems with
+addition/subtraction operations as they help us identify the set of operations
+required to solve the word problems. We propose a rule-based solver that uses
+verb categorisation to identify operations in a word problem and generate
+answers for it. To perform verb categorisation, we explore several approaches
+and present a comparative study.
+
+
+
+
+
+
+
+ ☆ G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
+
+
+
+
+
+
+
+
+ Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong
+
+
+ Large language models (LLMs) have shown remarkable proficiency in human-level
+reasoning and generation capabilities, which encourages extensive research on
+their application in mathematical problem solving. However, current work has
+been largely focused on text-based mathematical problems, with limited
+investigation in problems involving geometric information. Addressing this gap,
+we aim to enable LLMs to solve geometric problems by understanding image input.
+We first analyze the limitations of current Multimodal Large Language Models
+(MLLMs) in this area: they struggle to accurately comprehending basic geometric
+elements and their relationships. To overcome these challenges, we take
+advantage of the unique characteristics of geometric problems (such as unique
+geometric logical form, and geometric scalability) and the capacity of the
+textual LLMs to build an enriched multimodal geometry dataset based on existing
+data. The augmented dataset, Geo170K, contains more than 170K geometric
+image-caption and question-answer pairs. Utilizing our constructed Geo170K
+dataset, we develop G-LLaVA, which demonstrates exceptional performance in
+solving geometric problems, significantly outperforming GPT-4-V on the
+MathVista benchmark with only 7B parameters.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ NoMIRACL: Knowing When You Don't Know for Robust Multilingual
+ Retrieval-Augmented Generation
+
+
+
+
+
+
+
+
+ Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin
+
+
+ Retrieval-augmented generation (RAG) grounds large language model (LLM)
+output by leveraging external knowledge sources to reduce factual
+hallucinations. However, prior works lack a comprehensive evaluation of
+different language families, making it challenging to evaluate LLM robustness
+against errors in external retrieved knowledge. To overcome this, we establish
+NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across
+18 typologically diverse languages. NoMIRACL includes both a non-relevant and a
+relevant subset. Queries in the non-relevant subset contain passages manually
+judged as non-relevant or noisy, whereas queries in the relevant subset include
+at least a single judged relevant passage. We measure LLM robustness using two
+metrics: (i) hallucination rate, measuring model tendency to hallucinate an
+answer, when the answer is not present in passages in the non-relevant subset,
+and (ii) error rate, measuring model inaccuracy to recognize relevant passages
+in the relevant subset. We build a GPT-4 baseline which achieves a 33.2%
+hallucination rate on the non-relevant and a 14.9% error rate on the relevant
+subset on average. Our evaluation reveals that GPT-4 hallucinates frequently in
+high-resource languages, such as French or English. This work highlights an
+important avenue for future research to improve LLM robustness to learn how to
+better reject non-relevant information in RAG.
+
+
+
+
+
+
+
+ ☆ The Problem of Coherence in Natural Language Explanations of
+ Recommendations ECAI 2023
+
+
+
+
+
+
+
+
+ Jakub Raczyński, Mateusz Lango, Jerzy Stefanowski
+
+
+ Providing natural language explanations for recommendations is particularly
+useful from the perspective of a non-expert user. Although several methods for
+providing such explanations have recently been proposed, we argue that an
+important aspect of explanation quality has been overlooked in their
+experimental evaluation. Specifically, the coherence between generated text and
+predicted rating, which is a necessary condition for an explanation to be
+useful, is not properly captured by currently used evaluation measures. In this
+paper, we highlight the issue of explanation and prediction coherence by 1)
+presenting results from a manual verification of explanations generated by one
+of the state-of-the-art approaches 2) proposing a method of automatic coherence
+evaluation 3) introducing a new transformer-based method that aims to produce
+more coherent explanations than the state-of-the-art approaches 4) performing
+an experimental evaluation which demonstrates that this method significantly
+improves the explanation coherence without affecting the other aspects of
+recommendation performance.
+
+
+
+ comment: ECAI 2023
+
+
+
+
+
+
+ ☆ Implicit Affordance Acquisition via Causal Action-Effect Modeling in the
+ Video Domain AACL 2023
+
+
+ Affordance knowledge is a fundamental aspect of commonsense knowledge. Recent
+findings indicate that world knowledge emerges through large-scale
+self-supervised pretraining, motivating our exploration of acquiring affordance
+knowledge from the visual domain. To this end, we augment an existing
+instructional video resource to create the new Causal Action-Effect (CAE)
+dataset and design two novel pretraining tasks -- Masked Action Modeling (MAM)
+and Masked Effect Modeling (MEM) -- promoting the acquisition of two affordance
+properties in models: behavior and entity equivalence, respectively. We
+empirically demonstrate the effectiveness of our proposed methods in learning
+affordance properties. Furthermore, we show that a model pretrained on both
+tasks outperforms a strong image-based visual-linguistic foundation model
+(FLAVA) as well as pure linguistic models on a zero-shot physical reasoning
+probing task.
+
+
+
+
+
+
+
+
+ Christoph Tillmann, Aashka Trivedi, Sara Rosenthal, Santosh Borse, Rong Zhang, Avirup Sil, Bishwaranjan Bhattacharjee
+
+
+ Offensive language such as hate, abuse, and profanity (HAP) occurs in various
+content on the web. While previous work has mostly dealt with sentence level
+annotations, there have been a few recent attempts to identify offensive spans
+as well. We build upon this work and introduce Muted, a system to identify
+multilingual HAP content by displaying offensive arguments and their targets
+using heat maps to indicate their intensity. Muted can leverage any
+transformer-based HAP-classification model and its attention mechanism
+out-of-the-box to identify toxic spans, without further fine-tuning. In
+addition, we use the spaCy library to identify the specific targets and
+arguments for the words predicted by the attention heatmaps. We present the
+model's performance on identifying offensive spans and their targets in
+existing datasets and present new annotations on German text. Finally, we
+demonstrate our proposed visualization tool on multilingual inputs.
+
+
+
+
+
+
+
+ ☆ APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT
+ Training Data Creation
+
+
+ Automatic Post-Editing (APE) is the task of automatically identifying and
+correcting errors in the Machine Translation (MT) outputs. We propose a
+repair-filter-use methodology that uses an APE system to correct errors on the
+target side of the MT training data. We select the sentence pairs from the
+original and corrected sentence pairs based on the quality scores computed
+using a Quality Estimation (QE) model. To the best of our knowledge, this is a
+novel adaptation of APE and QE to extract quality parallel corpus from the
+pseudo-parallel corpus. By training with this filtered corpus, we observe an
+improvement in the Machine Translation system's performance by 5.64 and 9.91
+BLEU points, for English-Marathi and Marathi-English, over the baseline model.
+The baseline model is the one that is trained on the whole pseudo-parallel
+corpus. Our work is not limited by the characteristics of English or Marathi
+languages; and is language pair-agnostic, given the necessary QE and APE data.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2306.03507
+
+
+
+
+
+
+ ☆ From Generalized Laughter to Personalized Chuckles: Unleashing the Power
+ of Data Fusion in Subjective Humor Detection
+
+
+ The vast area of subjectivity in Natural Language Processing (NLP) poses a
+challenge to the solutions typically used in generalized tasks. As exploration
+in the scope of generalized NLP is much more advanced, it implies the
+tremendous gap that is still to be addressed amongst all feasible tasks where
+an opinion, taste, or feelings are inherent, thus creating a need for a
+solution, where a data fusion could take place. We have chosen the task of
+funniness, as it heavily relies on the sense of humor, which is fundamentally
+subjective. Our experiments across five personalized and four generalized
+datasets involving several personalized deep neural architectures have shown
+that the task of humor detection greatly benefits from the inclusion of
+personalized data in the training process. We tested five scenarios of training
+data fusion that focused on either generalized (majority voting) or
+personalized approaches to humor detection. The best results were obtained for
+the setup, in which all available personalized datasets were joined to train
+the personalized reasoning model. It boosted the prediction performance by up
+to approximately 35% of the macro F1 score. Such a significant gain was
+observed for all five personalized test sets. At the same time, the impact of
+the model's architecture was much less than the personalization itself. It
+seems that concatenating personalized datasets, even with the cost of
+normalizing the range of annotations across all datasets, if combined with the
+personalized models, results in an enormous increase in the performance of
+humor detection.
+
+
+
+ comment: 10 pages, 13 figures, 2 tables
+
+
+
+
+
+
+ ☆ LLM-ARK: Knowledge Graph Reasoning Using Large Language Models via Deep
+ Reinforcement Learning
+
+
+ With the evolution of pre-training methods, large language models (LLMs) have
+exhibited exemplary reasoning capabilities via prompt engineering. However, the
+absence of Knowledge Graph (KG) environment awareness and the challenge of
+engineering viable optimization mechanisms for intermediary reasoning
+processes, constrict the performance of LLMs on KG reasoning tasks compared to
+smaller models. We introduce LLM-ARK, a LLM grounded KG reasoning agent
+designed to deliver precise and adaptable predictions on KG paths. LLM-ARK
+utilizes Full Textual Environment (FTE) prompts to assimilate state information
+for each step-sized intelligence. Leveraging LLMs to richly encode and
+represent various types of inputs and integrate the knowledge graph further
+with path environment data, before making the final decision. Reframing the
+Knowledge Graph (KG) multi-hop inference problem as a sequential
+decision-making issue, we optimize our model using the Proximal Policy
+Optimization (PPO) online policy gradient reinforcement learning algorithm
+which allows the model to learn from a vast array of reward signals across
+diverse tasks and environments. We evaluate state-of-the-art LLM(GPT-4) and our
+method which using open-source models of varying sizes on OpenDialKG dataset.
+Our experiment shows that LLaMA7B-ARK provides excellent results with a
+performance rate of 48.75% for the target@1 evaluation metric, far exceeding
+the current state-of-the-art model by 17.64 percentage points. Meanwhile, GPT-4
+accomplished a score of only 14.91%, further highlighting the efficacy and
+complexity of our methodology. Our code is available on GitHub for further
+access.
+
+
+
+
+
+
+
+ ☆ Disentangling continuous and discrete linguistic signals in
+ transformer-based sentence embeddings
+
+
+ Sentence and word embeddings encode structural and semantic information in a
+distributed manner. Part of the information encoded -- particularly lexical
+information -- can be seen as continuous, whereas other -- like structural
+information -- is most often discrete. We explore whether we can compress
+transformer-based sentence embeddings into a representation that separates
+different linguistic signals -- in particular, information relevant to
+subject-verb agreement and verb alternations. We show that by compressing an
+input sequence that shares a targeted phenomenon into the latent layer of a
+variational autoencoder-like system, the targeted linguistic information
+becomes more explicit. A latent layer with both discrete and continuous
+components captures better the targeted phenomena than a latent layer with only
+discrete or only continuous components. These experiments are a step towards
+separating linguistic signals from distributed text embeddings and linking them
+to more symbolic representations.
+
+
+ Recent advancements in Text-to-SQL methods employing Large Language Models
+(LLMs) have demonstrated remarkable performance. Nonetheless, these approaches
+continue to encounter difficulties when handling extensive databases, intricate
+user queries, and erroneous SQL results. To tackle these challenges, we present
+\textbf{MAC-SQL}, a LLM-based multi-agent collaborative Text- to-SQL framework
+based on LLMs. This framework comprises three agents: the \textit{Selector},
+accountable for condensing voluminous databases and preserving relevant table
+schemas for user questions; the \textit{Decomposer}, which disassembles complex
+user questions into more straightforward sub-problems and resolves them
+progressively; and the \textit{Refiner}, tasked with validating and refining
+defective SQL queries. We perform thorough experiments on two Text-to-SQL
+datasets, BIRD and Spider, attaining a state-of-the-art execution accuracy of
+59.59\% on the BIRD test set. Moreover, we have open-sourced an instruction
+fine-tuning model, \textbf{SQL-Llama}, based on Code Llama 7B, in addition to
+an agent instruction dataset derived from training data based on BIRD and
+Spider. The SQL-Llama model has demonstrated encouraging outcomes on the
+development sets of both BIRD and Spider. However, when compared to the GPT-4
+model, there remains a notable potential for enhancement. Our code and data can
+be accessed publicly at
+\href{https://github.com/wbbeyourself/MAC-SQL}{https://github.com/wbbeyourself/MAC-SQL}.
+
+
+
+
+
+
+
+
+ Kun Peng, Lei Jiang, Hao Peng, Rui Liu, Zhengtao Yu, Jiaqian Ren, Zhifeng Hao, Philip S. Yu
+
+
+ Aspect Sentiment Triplet Extraction (ASTE) is an emerging task to extract a
+given sentence's triplets, which consist of aspects, opinions, and sentiments.
+Recent studies tend to address this task with a table-filling paradigm, wherein
+word relations are encoded in a two-dimensional table, and the process involves
+clarifying all the individual cells to extract triples. However, these studies
+ignore the deep interaction between neighbor cells, which we find quite helpful
+for accurate extraction. To this end, we propose a novel model for the ASTE
+task, called Prompt-based Tri-Channel Graph Convolution Neural Network
+(PT-GCN), which converts the relation table into a graph to explore more
+comprehensive relational information. Specifically, we treat the original table
+cells as nodes and utilize a prompt attention score computation module to
+determine the edges' weights. This enables us to construct a target-aware
+grid-like graph to enhance the overall extraction process. After that, a
+triple-channel convolution module is conducted to extract precise sentiment
+knowledge. Extensive experiments on the benchmark datasets show that our model
+achieves state-of-the-art performance. The code is available at
+https://github.com/KunPunCN/PT-GCN.
+
+
+
+ comment: Accepted in SIAM International Conference on Data Mining (SDM24)
+
+ Self-supervised learning enables the training of large neural models without
+the need for large, labeled datasets. It has been generating breakthroughs in
+several fields, including computer vision, natural language processing,
+biology, and speech. In particular, the state-of-the-art in several speech
+processing applications, such as automatic speech recognition or speaker
+identification, are models where the latent representation is learned using
+self-supervised approaches. Several configurations exist in self-supervised
+learning for speech, including contrastive, predictive, and multilingual
+approaches. There is, however, a crucial limitation in most existing
+approaches: their high computational costs. These costs limit the deployment of
+models, the size of the training dataset, and the number of research groups
+that can afford research with large self-supervised models. Likewise, we should
+consider the environmental costs that high energy consumption implies. Efforts
+in this direction comprise optimization of existing models, neural architecture
+efficiency, improvements in finetuning for speech processing tasks, and data
+efficiency. But despite current efforts, more work could be done to address
+high computational costs in self-supervised representation learning.
+
+
+
+ comment: 16 pages, 3 figures
+
+
+
+
+
+
+ ☆ Linear Attention via Orthogonal Memory
+
+
+
+
+
+
+
+
+ Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, Lingpeng Kong
+
+
+ Efficient attentions have greatly improved the computational efficiency of
+Transformers. However, most existing linear attention mechanisms suffer from an
+\emph{efficiency degradation} problem, leading to inefficiencies in causal
+language modeling and hindering their application in long-range language
+models. This problem is more pronounced under language modeling with unbounded
+contexts. In this paper, we propose \textbf{L}inear \textbf{A}ttention
+\textbf{V}ia \textbf{O}rthogonal memory~(\shortname) to address these
+limitations, achieving strong performance while maintaining linear complexity.
+\shortname employs orthogonal decomposition to compress a context into a
+fixed-size orthogonal memory while effectively minimizing redundancy within the
+context. Given that orthogonal memory compresses global information, we further
+dissect the context to amplify fine-grained local information. Additionally, we
+embed the relative position encoding into \shortname to improve the
+extrapolation ability. Experimental results show that \shortname greatly
+improves the efficiency of the causal language model with the best
+extrapolation performance and outperforms other efficient baselines. Further,
+we endeavor to employ \shortname for unbounded language modeling and
+successfully scale the context length to 128K.
+
+
+
+ comment: 16 pages
+
+
+
+
+
+
+ ☆ Patterns of Closeness and Abstractness in Colexifications: The Case of
+ Indigenous Languages in the Americas
+
+
+ Colexification refers to linguistic phenomena where multiple concepts
+(meanings) are expressed by the same lexical form, such as polysemy or
+homophony. Colexifications have been found to be pervasive across languages and
+cultures. The problem of concreteness/abstractness of concepts is
+interdisciplinary, studied from a cognitive standpoint in linguistics,
+psychology, psycholinguistics, neurophysiology, etc. In this paper, we
+hypothesize that concepts that are closer in concreteness/abstractness are more
+likey to colexify, and we test the hypothesis across indigenous languages in
+Americas.
+
+
+ Relation extraction is essentially a text classification problem, which can
+be tackled by fine-tuning a pre-trained language model (LM). However, a key
+challenge arises from the fact that relation extraction cannot
+straightforwardly be reduced to sequence or token classification. Existing
+approaches therefore solve the problem in an indirect way: they fine-tune an LM
+to learn embeddings of the head and tail entities, and then predict the
+relationship from these entity embeddings. Our hypothesis in this paper is that
+relation extraction models can be improved by capturing relationships in a more
+direct way. In particular, we experiment with appending a prompt with a [MASK]
+token, whose contextualised representation is treated as a relation embedding.
+While, on its own, this strategy significantly underperforms the aforementioned
+approach, we find that the resulting relation embeddings are highly
+complementary to what is captured by embeddings of the head and tail entity. By
+jointly considering both types of representations, we end up with a simple
+model that outperforms the state-of-the-art across several relation extraction
+benchmarks.
+
+
+
+
+
+
+
+ ☆ TDeLTA: A Light-weight and Robust Table Detection Method based on
+ Learning Text Arrangement AAAI 2024
+
+
+ The diversity of tables makes table detection a great challenge, leading to
+existing models becoming more tedious and complex. Despite achieving high
+performance, they often overfit to the table style in training set, and suffer
+from significant performance degradation when encountering out-of-distribution
+tables in other domains. To tackle this problem, we start from the essence of
+the table, which is a set of text arranged in rows and columns. Based on this,
+we propose a novel, light-weighted and robust Table Detection method based on
+Learning Text Arrangement, namely TDeLTA. TDeLTA takes the text blocks as
+input, and then models the arrangement of them with a sequential encoder and an
+attention module. To locate the tables precisely, we design a
+text-classification task, classifying the text blocks into 4 categories
+according to their semantic roles in the tables. Experiments are conducted on
+both the text blocks parsed from PDF and extracted by open-source OCR tools,
+respectively. Compared to several state-of-the-art methods, TDeLTA achieves
+competitive results with only 3.1M model parameters on the large-scale public
+datasets. Moreover, when faced with the cross-domain data under the 0-shot
+setting, TDeLTA outperforms baselines by a large margin of nearly 7%, which
+shows the strong robustness and transferability of the proposed model.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ☆ UniGen: A Unified Generative Framework for Retrieval and Question
+ Answering with Large Language Models
+
+
+ Generative information retrieval, encompassing two major tasks of Generative
+Document Retrieval (GDR) and Grounded Answer Generation (GAR), has gained
+significant attention in the area of information retrieval and natural language
+processing. Existing methods for GDR and GAR rely on separate retrieval and
+reader modules, which hinder simultaneous optimization. To overcome this, we
+present \textbf{UniGen}, a \textbf{Uni}fied \textbf{Gen}erative framework for
+retrieval and question answering that integrates both tasks into a single
+generative model leveraging the capabilities of large language models. UniGen
+employs a shared encoder and two distinct decoders for generative retrieval and
+question answering. To facilitate the learning of both tasks, we introduce
+connectors, generated by large language models, to bridge the gaps between
+query inputs and generation targets, as well as between document identifiers
+and answers. Furthermore, we propose an iterative enhancement strategy that
+leverages generated answers and retrieved documents to iteratively improve both
+tasks. Through extensive experiments on the MS MARCO and NQ datasets, we
+demonstrate the effectiveness of UniGen, showcasing its superior performance in
+both the retrieval and the question answering tasks.
+
+
+
+
+
+
+
+ ☆ Information Type Classification with Contrastive Task-Specialized
+ Sentence Encoders
+
+
+ User-generated information content has become an important information source
+in crisis situations. However, classification models suffer from noise and
+event-related biases which still poses a challenging task and requires
+sophisticated task-adaptation. To address these challenges, we propose the use
+of contrastive task-specialized sentence encoders for downstream
+classification. We apply the task-specialization on the CrisisLex, HumAID, and
+TrecIS information type classification tasks and show performance gains w.r.t.
+F1-score. Furthermore, we analyse the cross-corpus and cross-lingual
+capabilities for two German event relevancy classification datasets.
+
+
+
+ comment: Accepted at KONVENS 2023
+
+
+
+
+
+
+ ☆ VinaLLaMA: LLaMA-based Vietnamese Foundation Model
+
+
+ In this technical report, we present VinaLLaMA, an open-weight,
+state-of-the-art (SOTA) Large Language Model for the Vietnamese language, built
+upon LLaMA-2 with an additional 800 billion trained tokens. VinaLLaMA not only
+demonstrates fluency in Vietnamese but also exhibits a profound understanding
+of Vietnamese culture, making it a truly indigenous model. VinaLLaMA-7B-chat,
+trained on 1 million high-quality synthetic samples, achieves SOTA results on
+key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese, marking
+a significant advancement in the Vietnamese AI landscape and offering a
+versatile resource for various applications.
+
+
+ Large language models (LLMs) demonstrate powerful capabilities, but they
+still face challenges in practical applications, such as hallucinations, slow
+knowledge updates, and lack of transparency in answers. Retrieval-Augmented
+Generation (RAG) refers to the retrieval of relevant information from external
+knowledge bases before answering questions with LLMs. RAG has been demonstrated
+to significantly enhance answer accuracy, reduce model hallucination,
+particularly for knowledge-intensive tasks. By citing sources, users can verify
+the accuracy of answers and increase trust in model outputs. It also
+facilitates knowledge updates and the introduction of domain-specific
+knowledge. RAG effectively combines the parameterized knowledge of LLMs with
+non-parameterized external knowledge bases, making it one of the most important
+methods for implementing large language models. This paper outlines the
+development paradigms of RAG in the era of LLMs, summarizing three paradigms:
+Naive RAG, Advanced RAG, and Modular RAG. It then provides a summary and
+organization of the three main components of RAG: retriever, generator, and
+augmentation methods, along with key technologies in each component.
+Furthermore, it discusses how to evaluate the effectiveness of RAG models,
+introducing two evaluation methods for RAG, emphasizing key metrics and
+abilities for evaluation, and presenting the latest automatic evaluation
+framework. Finally, potential future research directions are introduced from
+three aspects: vertical optimization, horizontal scalability, and the technical
+stack and ecosystem of RAG.
+
+
+
+
+
+
+
+ ☆ Data Contamination Issues in Brain-to-Text Decoding
+
+
+ Decoding non-invasive cognitive signals to natural language has long been the
+goal of building practical brain-computer interfaces (BCIs). Recent major
+milestones have successfully decoded cognitive signals like functional Magnetic
+Resonance Imaging (fMRI) and electroencephalogram (EEG) into text under open
+vocabulary setting. However, how to split the datasets for training,
+validating, and testing in cognitive signal decoding task still remains
+controversial. In this paper, we conduct systematic analysis on current dataset
+splitting methods and find the existence of data contamination largely
+exaggerates model performance. Specifically, first we find the leakage of test
+subjects' cognitive signals corrupts the training of a robust encoder. Second,
+we prove the leakage of text stimuli causes the auto-regressive decoder to
+memorize information in test set. The decoder generates highly accurate text
+not because it truly understands cognitive signals. To eliminate the influence
+of data contamination and fairly evaluate different models' generalization
+ability, we propose a new splitting method for different types of cognitive
+datasets (e.g. fMRI, EEG). We also test the performance of SOTA Brain-to-Text
+decoding models under the proposed dataset splitting paradigm as baselines for
+further research.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ Knowledge Graphs and Pre-trained Language Models enhanced Representation
+ Learning for Conversational Recommender Systems
+
+
+
+
+
+
+
+
+ Zhangchi Qiu, Ye Tao, Shirui Pan, Alan Wee-Chung Liew
+
+
+ Conversational recommender systems (CRS) utilize natural language
+interactions and dialogue history to infer user preferences and provide
+accurate recommendations. Due to the limited conversation context and
+background knowledge, existing CRSs rely on external sources such as knowledge
+graphs to enrich the context and model entities based on their inter-relations.
+However, these methods ignore the rich intrinsic information within entities.
+To address this, we introduce the Knowledge-Enhanced Entity Representation
+Learning (KERL) framework, which leverages both the knowledge graph and a
+pre-trained language model to improve the semantic understanding of entities
+for CRS. In our KERL framework, entity textual descriptions are encoded via a
+pre-trained language model, while a knowledge graph helps reinforce the
+representation of these entities. We also employ positional encoding to
+effectively capture the temporal information of entities in a conversation. The
+enhanced entity representation is then used to develop a recommender component
+that fuses both entity and contextual representations for more informed
+recommendations, as well as a dialogue component that generates informative
+entity-related information in the response text. A high-quality knowledge graph
+with aligned entity descriptions is constructed to facilitate our study, namely
+the Wiki Movie Knowledge Graph (WikiMKG). The experimental results show that
+KERL achieves state-of-the-art results in both recommendation and response
+generation tasks.
+
+
+
+
+
+
+
+ ☆ Generative linguistic representation for spoken language identification
+
+
+ Effective extraction and application of linguistic features are central to
+the enhancement of spoken Language IDentification (LID) performance. With the
+success of recent large models, such as GPT and Whisper, the potential to
+leverage such pre-trained models for extracting linguistic features for LID
+tasks has become a promising area of research. In this paper, we explore the
+utilization of the decoder-based network from the Whisper model to extract
+linguistic features through its generative mechanism for improving the
+classification accuracy in LID tasks. We devised two strategies - one based on
+the language embedding method and the other focusing on direct optimization of
+LID outputs while simultaneously enhancing the speech recognition tasks. We
+conducted experiments on the large-scale multilingual datasets MLS,
+VoxLingua107, and CommonVoice to test our approach. The experimental results
+demonstrated the effectiveness of the proposed method on both in-domain and
+out-of-domain datasets for LID tasks.
+
+
+ Aspect-based sentiment analysis (ABSA), a fine-grained sentiment
+classification task, has received much attention recently. Many works
+investigate sentiment information through opinion words, such as ''good'' and
+''bad''. However, implicit sentiment widely exists in the ABSA dataset, which
+refers to the sentence containing no distinct opinion words but still expresses
+sentiment to the aspect term. To deal with implicit sentiment, this paper
+proposes an ABSA method that integrates explicit sentiment augmentations. And
+we propose an ABSA-specific augmentation method to create such augmentations.
+Specifically, we post-trains T5 by rule-based data. We employ Syntax Distance
+Weighting and Unlikelihood Contrastive Regularization in the training procedure
+to guide the model to generate an explicit sentiment. Meanwhile, we utilize the
+Constrained Beam Search to ensure the augmentation sentence contains the aspect
+terms. We test ABSA-ESA on two of the most popular benchmarks of ABSA. The
+results show that ABSA-ESA outperforms the SOTA baselines on implicit and
+explicit sentiment accuracy.
+
+
+ Multi-talker overlapped speech recognition remains a significant challenge,
+requiring not only speech recognition but also speaker diarization tasks to be
+addressed. In this paper, to better address these tasks, we first introduce
+speaker labels into an autoregressive transformer-based speech recognition
+model to support multi-speaker overlapped speech recognition. Then, to improve
+speaker diarization, we propose a novel speaker mask branch to detection the
+speech segments of individual speakers. With the proposed model, we can perform
+both speech recognition and speaker diarization tasks simultaneously using a
+single model. Experimental results on the LibriSpeech-based overlapped dataset
+demonstrate the effectiveness of the proposed method in both speech recognition
+and speaker diarization tasks, particularly enhancing the accuracy of speaker
+diarization in relatively complex multi-talker scenarios.
+
+
+
+
+
+
+
+ ☆ Soft Alignment of Modality Space for End-to-end Speech Translation ICASSP2024
+
+
+ End-to-end Speech Translation (ST) aims to convert speech into target text
+within a unified model. The inherent differences between speech and text
+modalities often impede effective cross-modal and cross-lingual transfer.
+Existing methods typically employ hard alignment (H-Align) of individual speech
+and text segments, which can degrade textual representations. To address this,
+we introduce Soft Alignment (S-Align), using adversarial training to align the
+representation spaces of both modalities. S-Align creates a modality-invariant
+space while preserving individual modality quality. Experiments on three
+languages from the MuST-C dataset show S-Align outperforms H-Align across
+multiple tasks and offers translation capabilities on par with specialized
+translation models.
+
+
+
+ comment: Accepted to ICASSP2024
+
+
+
+
+
+
+ ☆ Regularized Conditional Alignment for Multi-Domain Text Classification ICASSP 2024
+
+
+ The most successful multi-domain text classification (MDTC) approaches employ
+the shared-private paradigm to facilitate the enhancement of domain-invariant
+features through domain-specific attributes. Additionally, they employ
+adversarial training to align marginal feature distributions. Nevertheless,
+these methodologies encounter two primary challenges: (1) Neglecting
+class-aware information during adversarial alignment poses a risk of
+misalignment; (2) The limited availability of labeled data across multiple
+domains fails to ensure adequate discriminative capacity for the model. To
+tackle these issues, we propose a method called Regularized Conditional
+Alignment (RCA) to align the joint distributions of domains and classes, thus
+matching features within the same category and amplifying the discriminative
+qualities of acquired features. Moreover, we employ entropy minimization and
+virtual adversarial training to constrain the uncertainty of predictions
+pertaining to unlabeled data and enhance the model's robustness. Empirical
+results on two benchmark datasets demonstrate that our RCA approach outperforms
+state-of-the-art MDTC techniques.
+
+
+
+ comment: This paper has been accepted by ICASSP 2024
+
+ We introduce a language-grounded visual prompting method to adapt the visual
+encoder of vision-language models for downstream tasks. By capitalizing on
+language integration, we devise a parameter-efficient strategy to adjust the
+input of the visual encoder, eliminating the need to modify or add to the
+model's parameters. Due to this design choice, our algorithm can operate even
+in black-box scenarios, showcasing adaptability in situations where access to
+the model's parameters is constrained. We will empirically demonstrate that,
+compared to prior art, grounding visual prompts with language enhances both the
+accuracy and speed of adaptation. Moreover, our algorithm excels in
+base-to-novel class generalization, overcoming limitations of visual prompting
+and exhibiting the capacity to generalize beyond seen classes. We thoroughly
+assess and evaluate our method across a variety of image recognition datasets,
+such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning
+situations, including few-shot learning, base-to-novel class generalization,
+and transfer learning.
+
+
+
+ comment: The 38th Annual AAAI Conference on Artificial Intelligence
+
+
+
+
+
+
+ ☆ Satellite Captioning: Large Language Models to Augment Labeling
+
+
+ With the growing capabilities of modern object detection networks and
+datasets to train them, it has gotten more straightforward and, importantly,
+less laborious to get up and running with a model that is quite adept at
+detecting any number of various objects. However, while image datasets for
+object detection have grown and continue to proliferate (the current most
+extensive public set, ImageNet, contains over 14m images with over 14m
+instances), the same cannot be said for textual caption datasets. While they
+have certainly been growing in recent years, caption datasets present a much
+more difficult challenge due to language differences, grammar, and the time it
+takes for humans to generate them. Current datasets have certainly provided
+many instances to work with, but it becomes problematic when a captioner may
+have a more limited vocabulary, one may not be adequately fluent in the
+language, or there are simple grammatical mistakes. These difficulties are
+increased when the images get more specific, such as remote sensing images.
+This paper aims to address this issue of potential information and
+communication shortcomings in caption datasets. To provide a more precise
+analysis, we specify our domain of images to be remote sensing images in the
+RSICD dataset and experiment with the captions provided here. Our findings
+indicate that ChatGPT grammar correction is a simple and effective way to
+increase the performance accuracy of caption models by making data captions
+more diverse and grammatically correct.
+
+
+
+ comment: 9 pages, 4 figures, 4 tables
+
+
+
+
+
+
+ ☆ Generalized Category Discovery with Large Language Models in the Loop
+
+
+ Generalized Category Discovery (GCD) is a crucial task that aims to recognize
+both known and novel categories from a set of unlabeled data by utilizing a few
+labeled data with only known categories. Due to the lack of supervision and
+category information, current methods usually perform poorly on novel
+categories and struggle to reveal semantic meanings of the discovered clusters,
+which limits their applications in the real world. To mitigate above issues, we
+propose Loop, an end-to-end active-learning framework that introduces Large
+Language Models (LLMs) into the training loop, which can boost model
+performance and generate category names without relying on any human efforts.
+Specifically, we first propose Local Inconsistent Sampling (LIS) to select
+samples that have a higher probability of falling to wrong clusters, based on
+neighborhood prediction consistency and entropy of cluster assignment
+probabilities. Then we propose a Scalable Query strategy to allow LLMs to
+choose true neighbors of the selected samples from multiple candidate samples.
+Based on the feedback from LLMs, we perform Refined Neighborhood Contrastive
+Learning (RNCL) to pull samples and their neighbors closer to learn
+clustering-friendly representations. Finally, we select representative samples
+from clusters corresponding to novel categories to allow LLMs to generate
+category names for them. Extensive experiments on three benchmark datasets show
+that Loop outperforms SOTA models by a large margin and generates accurate
+category names for the discovered clusters. We will release our code and data
+after publication.
+
+
+
+ comment: Preprint
+
+
+
+
+
+
+ ☆ Towards Better Serialization of Tabular Data for Few-shot Classification
+
+
+ We present a study on the integration of Large Language Models (LLMs) in
+tabular data classification, emphasizing an efficient framework. Building upon
+existing work done in TabLLM (arXiv:2210.10723), we introduce three novel
+serialization techniques, including the standout LaTeX serialization method.
+This method significantly boosts the performance of LLMs in processing
+domain-specific datasets, Our method stands out for its memory efficiency and
+ability to fully utilize complex data structures. Through extensive
+experimentation, including various serialization approaches like feature
+combination and importance, we demonstrate our work's superiority in accuracy
+and efficiency over traditional models.
+
+
+
+ comment: 4 pages, 2 figures
+
+
+
+
+
+
+ ♻ ☆ No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based
+ Language Models
+
+
+
+
+
+
+
+
+ Shengyao Zhang, Mi Zhang, Xudong Pan, Min Yang
+
+
+ To reduce the computation cost and the energy consumption in large language
+models (LLM), skimming-based acceleration dynamically drops unimportant tokens
+of the input sequence progressively along layers of the LLM while preserving
+the tokens of semantic importance. However, our work for the first time reveals
+the acceleration may be vulnerable to Denial-of-Service (DoS) attacks. In this
+paper, we propose No-Skim, a general framework to help the owners of
+skimming-based LLM to understand and measure the robustness of their
+acceleration scheme. Specifically, our framework searches minimal and
+unnoticeable perturbations at character-level and token-level to generate
+adversarial inputs that sufficiently increase the remaining token ratio, thus
+increasing the computation cost and energy consumption. We systematically
+evaluate the vulnerability of the skimming acceleration in various LLM
+architectures including BERT and RoBERTa on the GLUE benchmark. In the worst
+case, the perturbation found by No-Skim substantially increases the running
+cost of LLM by over 145% on average. Moreover, No-Skim extends the evaluation
+framework to various scenarios, making the evaluation conductible with
+different level of knowledge.
+
+
+
+
+
+
+
+ ♻ ☆ Teaching Specific Scientific Knowledge into Large Language Models
+ through Additional Training
+
+
+
+
+
+
+
+
+ Kan Hatakeyama-Sato, Yasuhiko Igarashi, Shun Katakami, Yuta Nabae, Teruaki Hayakawa
+
+
+ Through additional training, we explore embedding specialized scientific
+knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that
+effective knowledge integration requires reading texts from multiple
+perspectives, especially in instructional formats. We utilize text augmentation
+to tackle the scarcity of specialized texts, including style conversions and
+translations. Hyperparameter optimization proves crucial, with different size
+models (7b, 13b, and 70b) reasonably undergoing additional training. Validating
+our methods, we construct a dataset of 65,000 scientific papers. Although we
+have succeeded in partially embedding knowledge, the study highlights the
+complexities and limitations of incorporating specialized information into
+LLMs, suggesting areas for further improvement.
+
+
+
+ comment: added token information for some texts, and fixed typo
+
+
+
+
+
+
+ ♻ ☆ LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World
+ Knowledge in Language Model Alignment
+
+
+ Supervised fine-tuning (SFT) is a crucial step for large language models
+(LLMs), enabling them to align with human instructions and enhance their
+capabilities in downstream tasks. When the models are required to align with a
+broader range of downstream tasks, or there is a desire to notably improve the
+performance on a specific task, a substantial increase in fine-tuning data
+often emerges as the solution. However, we find that large-scale increases in
+instruction data can disrupt the world knowledge previously stored in the LLMs,
+i.e., world knowledge forgetting. In this paper, we introduce LoRAMoE to
+address the above challenge. The LoRAMoE is a plugin version of Mixture of
+Experts (MoE). The plugin form ensures the integrity of world knowledge by
+freezing the backbone model during the training phase. We then propose the use
+of localized balancing constraints to coordinate parts of experts for task
+utilization, meanwhile enabling other experts to fully leverage the world
+knowledge stored in the models. Experimental results demonstrate that LoRAMoE
+can reasonably coordinate experts based on data type during inference, and even
+dramatically increasing instruction data does not result in knowledge
+forgetting. Moreover, LoRAMoE provides additional benefits for the performance
+of downstream tasks, indicating the potential of our approach for multi-task
+learning.
+
+
+
+ comment: 17 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ YUAN 2.0: A Large Language Model with Localized Filtering-based
+ Attention
+
+
+ In this work, we develop and release Yuan 2.0, a series of large language
+models with parameters ranging from 2.1 billion to 102.6 billion. The Localized
+Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of
+local dependencies of natural language into Attention. A data filtering and
+generating system is presented to build pre-training and fine-tuning dataset in
+high quality. A distributed training method with non-uniform pipeline parallel,
+data parallel, and optimizer parallel is proposed, which greatly reduces the
+bandwidth requirements of intra-node communication, and achieves good
+performance in large-scale distributed training. Yuan 2.0 models display
+impressive ability in code generation, math problem-solving, and chatting
+compared with existing models. The latest version of YUAN 2.0, including model
+weights and source code, is accessible at Github.
+
+
+
+
+
+
+
+ ♻ ☆ AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using
+ Open-Source LLMs
+
+
+ Responding to the thousands of student questions on online QA platforms each
+semester has a considerable human cost, particularly in computing courses with
+rapidly growing enrollments. To address the challenges of scalable and
+intelligent question-answering (QA), we introduce an innovative solution that
+leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to
+ensure data privacy. Our approach combines augmentation techniques such as
+retrieval augmented generation (RAG), supervised fine-tuning (SFT), and
+learning from human preferences data using Direct Preference Optimization
+(DPO). Through extensive experimentation on a Piazza dataset from an
+introductory CS course, comprising 10,000 QA pairs and 1,500 pairs of
+preference data, we demonstrate a significant 30% improvement in the quality of
+answers, with RAG being a particularly impactful addition. Our contributions
+include the development of a novel architecture for educational QA, extensive
+evaluations of LLM performance utilizing both human assessments and LLM-based
+metrics, and insights into the challenges and future directions of educational
+data processing. This work paves the way for the development of AI-TA, an
+intelligent QA assistant customizable for courses with an online QA platform
+
+
+ Traditional evaluation metrics like ROUGE compare lexical overlap between the
+reference and generated summaries without taking argumentative structure into
+account, which is important for legal summaries. In this paper, we propose a
+novel legal summarization evaluation framework that utilizes GPT-4 to generate
+a set of question-answer pairs that cover main points and information in the
+reference summary. GPT-4 is then used to generate answers based on the
+generated summary for the questions from the reference summary. Finally, GPT-4
+grades the answers from the reference summary and the generated summary. We
+examined the correlation between GPT-4 grading with human grading. The results
+suggest that this question-answering approach with GPT-4 can be a useful tool
+for gauging the quality of the summary.
+
+
+
+
+
+
+
+ ♻ ☆ In-Context Exemplars as Clues to Retrieving from Large Associative
+ Memory ICML 2023
+
+
+ Recently, large language models (LLMs) have made remarkable progress in
+natural language processing. The most representative ability of LLMs is
+in-context learning (ICL), which enables LLMs to learn patterns from in-context
+exemplars without training. The performance of ICL greatly depends on the
+exemplars used. However, how to choose exemplars remains unclear due to the
+lack of understanding of how in-context learning works. In this paper, we
+present a novel perspective on ICL by conceptualizing it as contextual
+retrieval from a model of associative memory. We establish a theoretical
+framework of ICL based on Hopfield Networks. Based on our framework, we look
+into how in-context exemplars influence the performance of ICL and propose more
+efficient active exemplar selection. Our study sheds new light on the mechanism
+of ICL by connecting it to memory retrieval, with potential implications for
+advancing the understanding of LLMs.
+
+
+
+ comment: Presented at Neural Conversational AI @ ICML 2023 and Associative
+ Memory & Hopfield Networks @ NeurIPS 2023
+
+
+
+
+
+
+
+ Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad Abdul-Mageed
+
+
+ Traditional NER systems are typically trained to recognize coarse-grained
+entities, and less attention is given to classifying entities into a hierarchy
+of fine-grained lower-level subtypes. This article aims to advance Arabic NER
+with fine-grained entities. We chose to extend Wojood (an open-source Nested
+Arabic Named Entity Corpus) with subtypes. In particular, four main entity
+types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG),
+and facility (FAC), are extended with 31 subtypes. To do this, we first revised
+Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's
+ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC,
+ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE
+sub-types. We refer to this extended version of Wojood as WojoodF ine. To
+evaluate our annotations, we measured the inter-annotator agreement (IAA) using
+both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively.
+To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic
+BERT encoders in three settings: flat NER, nested NER and nested NER with
+subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our
+corpus and models are open-source and available at
+https://sina.birzeit.edu/wojood/.
+
+
+
+
+
+
+
+ ♻ ☆ Camoscio: an Italian Instruction-tuned LLaMA
+
+
+ In recent years Large Language Models (LLMs) have increased the state of the
+art on several natural language processing tasks. However, their accessibility
+is often limited to paid API services, posing challenges for researchers in
+conducting extensive investigations. On the other hand, while some open-source
+models have been proposed by the community, they are typically English-centric
+or multilingual without a specific adaptation for the Italian language. In an
+effort to democratize the available and open resources for the Italian
+language, in this paper we introduce Camoscio: a language model specifically
+tuned to follow users' prompts in Italian. Specifically, we finetuned the
+smallest variant of LLaMA (7b) with LoRA on a corpus of instruction prompts
+translated to Italian via ChatGPT. Results indicate that the model's zero-shot
+performance on various downstream tasks in Italian competes favorably with
+existing models specifically finetuned for those tasks. All the artifacts
+(code, dataset, model) are released to the community at the following url:
+https://github.com/teelinsan/camoscio
+
+
+
+ comment: Published at CLiC-it 2023
+
+
+
+
+
+
+ ♻ ☆ On the Unexpected Abilities of Large Language Models
+
+
+ Large Language Models (LLMs) are capable of displaying a wide range of
+abilities that are not directly connected with the task for which they are
+trained: predicting the next words of human-written texts. In this article, I
+review recent research investigating the cognitive abilities developed by LLMs
+and their relation to human cognition. I discuss the nature of the indirect
+process that leads to the acquisition of these cognitive abilities, their
+relation to other indirect processes, and the implications for the acquisition
+of integrated abilities. Moreover, I propose the factors that enable the
+development of abilities that are related only very indirectly to the proximal
+objective of the training task. Finally, I discuss whether the full set of
+capabilities that LLMs could possibly develop is predictable.
+
+
+
+ comment: 13 pages
+
+
+
+
+
+
+ ♻ ☆ When Do Program-of-Thoughts Work for Reasoning? AAAI 2024
+
+
+ In the realm of embodied artificial intelligence, the reasoning capabilities
+of Large Language Models (LLMs) play a pivotal role. Although there are
+effective methods like program-of-thought prompting for LLMs which uses
+programming language to tackle complex reasoning tasks, the specific impact of
+code data on the improvement of reasoning capabilities remains under-explored.
+To address this gap, we propose complexity-impacted reasoning score (CIRS),
+which combines structural and logical attributes, to measure the correlation
+between code and reasoning abilities. Specifically, we use the abstract syntax
+tree to encode the structural information and calculate logical complexity by
+considering the difficulty and the cyclomatic complexity. Through an empirical
+analysis, we find not all code data of complexity can be learned or understood
+by LLMs. Optimal level of complexity is critical to the improvement of
+reasoning abilities by program-aided prompting. Then we design an
+auto-synthesizing and stratifying algorithm, and apply it to instruction
+generation for mathematical reasoning and code data filtering for code
+generation tasks. Extensive results demonstrates the effectiveness of our
+proposed approach. Code will be integrated into the EasyInstruct framework at
+https://github.com/zjunlp/EasyInstruct.
+
+
+ Recently decades have witnessed the empirical success of framing Knowledge
+Graph (KG) embeddings via language models. However, language model-based KG
+embeddings are usually deployed as static artifacts, making them difficult to
+modify post-deployment without re-training after deployment. To address this
+issue, we propose a new task of editing language model-based KG embeddings in
+this paper. This task is designed to facilitate rapid, data-efficient updates
+to KG embeddings without compromising the performance of other aspects. We
+build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and
+evaluate several knowledge editing baselines demonstrating the limited ability
+of previous models to handle the proposed challenging task. We further propose
+a simple yet strong baseline dubbed KGEditor, which utilizes additional
+parametric layers of the hypernetwork to edit/add facts. Our comprehensive
+experimental results reveal that KGEditor excels in updating specific facts
+without impacting the overall performance, even when faced with limited
+training resources. Code and datasets are available in
+https://github.com/zjunlp/PromptKG/tree/main/deltaKG.
+
+
+
+ comment: AAAI 2024. The project website is
+ https://zjunlp.github.io/project/KGE_Editing/
+
+
+
+
+
+
+ ♻ ☆ Chinese Spelling Correction as Rephrasing Language Model AAAI'2024
+
+
+ This paper studies Chinese Spelling Correction (CSC), which aims to detect
+and correct the potential spelling errors in a given sentence. Current
+state-of-the-art methods regard CSC as a sequence tagging task and fine-tune
+BERT-based models on sentence pairs. However, we note a critical flaw in the
+process of tagging one character to another, that the correction is excessively
+conditioned on the error. This is opposite from human mindset, where
+individuals rephrase the complete sentence based on its semantics, rather than
+solely on the error patterns memorized before. Such a counter-intuitive
+learning process results in the bottleneck of generalizability and
+transferability of machine spelling correction. To address this, we propose
+Rephrasing Language Model (ReLM), where the model is trained to rephrase the
+entire sentence by infilling additional slots, instead of
+character-to-character tagging. This novel training paradigm achieves the new
+state-of-the-art results across fine-tuned and zero-shot CSC benchmarks,
+outperforming previous counterparts by a large margin. Our method also learns
+transferable language representation when CSC is jointly trained with other
+tasks.
+
+
+
+ comment: Accepted by AAAI'2024
+
+
+
+
+
+
+ ♻ ☆ PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models ICASSP 2024
+
+
+ Prompts have significantly improved the performance of pretrained Large
+Language Models (LLMs) on various downstream tasks recently, making them
+increasingly indispensable for a diverse range of LLM application scenarios.
+However, the backdoor vulnerability, a serious security threat that can
+maliciously alter the victim model's normal predictions, has not been
+sufficiently explored for prompt-based LLMs. In this paper, we present
+POISONPROMPT, a novel backdoor attack capable of successfully compromising both
+hard and soft prompt-based LLMs. We evaluate the effectiveness, fidelity, and
+robustness of POISONPROMPT through extensive experiments on three popular
+prompt methods, using six datasets and three widely used LLMs. Our findings
+highlight the potential security threats posed by backdoor attacks on
+prompt-based LLMs and emphasize the need for further research in this area.
+
+
+
+ comment: To Appear in IEEE ICASSP 2024, code is available at:
+ https://github.com/grasses/PoisonPrompt
+
+
+
+
+
+
+ ♻ ☆ High-Fidelity Speech Synthesis with Minimal Supervision: All Using
+ Diffusion Models ICASSP 2024
+
+
+ Text-to-speech (TTS) methods have shown promising results in voice cloning,
+but they require a large number of labeled text-speech pairs.
+Minimally-supervised speech synthesis decouples TTS by combining two types of
+discrete speech representations(semantic \& acoustic) and using two
+sequence-to-sequence tasks to enable training with minimal supervision.
+However, existing methods suffer from information redundancy and dimension
+explosion in semantic representation, and high-frequency waveform distortion in
+discrete acoustic representation. Autoregressive frameworks exhibit typical
+instability and uncontrollability issues. And non-autoregressive frameworks
+suffer from prosodic averaging caused by duration prediction models. To address
+these issues, we propose a minimally-supervised high-fidelity speech synthesis
+method, where all modules are constructed based on the diffusion models. The
+non-autoregressive framework enhances controllability, and the duration
+diffusion model enables diversified prosodic expression. Contrastive
+Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic
+representation to solve the problems of information redundancy and dimension
+explosion in existing semantic coding methods. Mel-spectrogram is used as the
+acoustic representation. Both semantic and acoustic representations are
+predicted by continuous variable regression tasks to solve the problem of
+high-frequency fine-grained waveform distortion. Experimental results show that
+our proposed method outperforms the baseline method. We provide audio samples
+on our website.
+
+
+
+ comment: Accepted by ICASSP 2024. arXiv admin note: substantial text overlap
+ with arXiv:2307.15484; text overlap with arXiv:2309.00424
+
+ For fine-grained generation and recognition tasks such as
+minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic
+speech recognition (ASR), the intermediate representations extracted from
+speech should serve as a "bridge" between text and acoustic information,
+containing information from both modalities. The semantic content is
+emphasized, while the paralinguistic information such as speaker identity and
+acoustic details should be de-emphasized. However, existing methods for
+extracting fine-grained intermediate representations from speech suffer from
+issues of excessive redundancy and dimension explosion. Contrastive learning is
+a good method for modeling intermediate representations from two modalities.
+However, existing contrastive learning methods in the audio field focus on
+extracting global descriptive information for downstream audio classification
+tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these
+issues, we propose a method named "Contrastive Token-Acoustic Pretraining
+(CTAP)", which uses two encoders to bring phoneme and speech into a joint
+multimodal space, learning how to connect phoneme and speech at the frame
+level. The CTAP model is trained on 210k speech and phoneme pairs, achieving
+minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a
+promising solution for fine-grained generation and recognition downstream tasks
+in speech processing. We provide a website with audio samples.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
+ and Language Model: A Comparative Study of Semantic Coding ICASSP 2024
+
+
+ Recently, there has been a growing interest in text-to-speech (TTS) methods
+that can be trained with minimal supervision by combining two types of discrete
+speech representations and using two sequence-to-sequence tasks to decouple
+TTS. However, existing methods suffer from three problems: the high
+dimensionality and waveform distortion of discrete speech representations, the
+prosodic averaging problem caused by the duration prediction model in
+non-autoregressive frameworks, and the information redundancy and dimension
+explosion problems of existing semantic encoding methods. To address these
+problems, three progressive methods are proposed. First, we propose
+Diff-LM-Speech, an autoregressive structure consisting of a language model and
+diffusion models, which models the semantic embedding into the mel-spectrogram
+based on a diffusion model to achieve higher audio quality. We also introduce a
+prompt encoder structure based on a variational autoencoder and a prosody
+bottleneck to improve prompt representation ability. Second, we propose
+Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion
+model-based modules that design a duration diffusion model to achieve diverse
+prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive
+structure consisting of three diffusion model-based modules that verify the
+non-necessity of existing semantic encoding models and achieve the best
+results. Experimental results show that our proposed methods outperform
+baseline methods. We provide a website with audio samples.
+
+
+
+ comment: Accepted by ICASSP 2024
+
+
+
+
+
+
+ ♻ ☆ VILAS: Exploring the Effects of Vision and Language Context in Automatic
+ Speech Recognition ICASSP 2024
+
+
+ Enhancing automatic speech recognition (ASR) performance by leveraging
+additional multimodal information has shown promising results in previous
+studies. However, most of these works have primarily focused on utilizing
+visual cues derived from human lip motions. In fact, context-dependent visual
+and linguistic cues can also benefit in many scenarios. In this paper, we first
+propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel
+multimodal ASR model based on the continuous integrate-and-fire (CIF)
+mechanism, which can integrate visual and textual context simultaneously or
+separately, to facilitate speech recognition. Next, we introduce an effective
+training strategy that improves performance in modal-incomplete test scenarios.
+Then, to explore the effects of integrating vision and language, we create
+VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese
+and English versions. Finally, empirical results are reported on the public
+Flickr8K and self-constructed VSDial datasets. We explore various cross-modal
+fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and
+provide insights into the effects of integrating multimodal information on
+speech recognition.
+
+
+ Prompt tuning (PT), where a small amount of trainable soft (continuous)
+prompt vectors is affixed to the input of language models (LM), has shown
+promising results across various tasks and models for parameter-efficient
+fine-tuning (PEFT). PT stands out from other PEFT approaches because it
+maintains competitive performance with fewer trainable parameters and does not
+drastically scale up its parameters as the model size expands. However, PT
+introduces additional soft prompt tokens, leading to longer input sequences,
+which significantly impacts training and inference time and memory usage due to
+the Transformer's quadratic complexity. Particularly concerning for Large
+Language Models (LLMs) that face heavy daily querying. To address this issue,
+we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt
+into a shorter soft prompt and a pair of low-rank matrices that are then
+optimised with two different learning rates. This allows DePT to achieve better
+performance while saving over 20% memory and time costs compared to vanilla PT
+and its variants, without changing trainable parameter sizes. Through extensive
+experiments on 23 natural language processing (NLP) and vision-language (VL)
+tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches,
+including the full fine-tuning baseline in some scenarios. Additionally, we
+empirically show that DEPT grows more efficient as the model size increases.
+Our further study reveals that DePT integrates seamlessly with
+parameter-efficient transfer learning in the few-shot learning setting and
+highlights its adaptability to various model architectures and sizes.
+
+
+
+ comment: Code is available at https://github.com/ZhengxiangShi/DePT
+
+
+
+
+
+
+ ♻ ☆ NExT-Chat: An LMM for Chat, Detection and Segmentation
+
+
+
+
+
+
+
+
+ Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua
+
+
+ The development of large language models (LLMs) has greatly advanced the
+field of multimodal understanding, leading to the emergence of large multimodal
+models (LMMs). In order to enhance the level of visual comprehension, recent
+studies have equipped LMMs with region-level understanding capabilities by
+representing object bounding box coordinates as a series of text sequences
+(pix2seq). In this paper, we introduce a novel paradigm for object location
+modeling called pix2emb method, where we ask the LMM to output the location
+embeddings and then decode them with different decoders. This paradigm allows
+us to use different location formats (such as bounding boxes and masks) in
+multimodal conversations. Leveraging the proposed pix2emb method, we train an
+LMM named NExT-Chat and demonstrate its capability of handling multiple tasks
+like visual grounding, region captioning, and grounded reasoning. Comprehensive
+experiments show the effectiveness of our NExT-Chat on various tasks, e.g.,
+NExT-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NExT-Chat (68.9) vs. LISA
+(67.9) on referring expression segmentation task, and NExT-Chat (79.6) vs.
+Kosmos-2 (62.3) on region caption task. The code and model are released at
+https://github.com/NExT-ChatV/NExT-Chat.
+
+
+ Data contamination in language model evaluation is increasingly prevalent as
+the popularity of large language models. It allows models to "cheat" via
+memorisation instead of displaying true capabilities. Therefore, contamination
+analysis has became an crucial part of reliable model evaluation to validate
+results. However, existing contamination analysis is usually conducted
+internally by LLM developers and often lacks transparency and completeness.
+This paper present an open source data contamination reports for the Llama
+series models. We analyse six popular multi-choice QA benchmarks and quantify
+their overlapping with the training set of Llama. Various levels of
+contamination ranging from 1\% to 8.7\% are found across benchmarks. Our
+comparison also reveals that Llama models can gain over 5\% higher accuracy on
+contaminated subsets versus clean subsets. Data and code are available at:
+https://github.com/liyucheng09/Contamination_Detector.
+
+
+
+
+
+
+
+ ♻ ☆ DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models
+ for Emotion Recognition in Conversations
+
+
+ Large language models (LLMs) and their variants have shown extraordinary
+efficacy across numerous downstream natural language processing (NLP) tasks,
+which has presented a new vision for the development of NLP. Despite their
+remarkable performance in natural language generating (NLG), LLMs lack a
+distinct focus on the emotion understanding domain. As a result, using LLMs for
+emotion recognition may lead to suboptimal and inadequate precision. Another
+limitation of LLMs is that they are typical trained without leveraging
+multi-modal information. To overcome these limitations, we propose DialogueLLM,
+a context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA
+models with 13,638 multi-modal (i.e., texts and videos) emotional dialogues.
+The visual information is considered as the supplementary knowledge to
+construct high-quality instructions. We offer a comprehensive evaluation of our
+proposed model on three benchmarking emotion recognition in conversations (ERC)
+datasets and compare the results against the SOTA baselines and other SOTA
+LLMs. Additionally, DialogueLLM-7B can be easily trained using LoRA on a 40GB
+A100 GPU in 5 hours, facilitating reproducibility for other researchers.
+
+
+
+
+
+
+
+ ♻ ☆ Fly-Swat or Cannon? Cost-Effective Language Model Choice via
+ Meta-Modeling
+
+
+
+
+
+
+
+
+ Marija Šakota, Maxime Peyrard, Robert West
+
+
+ Generative language models (LMs) have become omnipresent across data science.
+For a wide variety of tasks, inputs can be phrased as natural language prompts
+for an LM, from whose output the solution can then be extracted. LM performance
+has consistently been increasing with model size - but so has the monetary cost
+of querying the ever larger models. Importantly, however, not all inputs are
+equally hard: some require larger LMs for obtaining a satisfactory solution,
+whereas for others smaller LMs suffice. Based on this fact, we design a
+framework for cost-effective language model choice, called "Fly-swat or cannon"
+(FORC). Given a set of inputs and a set of candidate LMs, FORC judiciously
+assigns each input to an LM predicted to do well on the input according to a
+so-called meta-model, aiming to achieve high overall performance at low cost.
+The cost-performance tradeoff can be flexibly tuned by the user. Options
+include, among others, maximizing total expected performance (or the number of
+processed inputs) while staying within a given cost budget, or minimizing total
+cost while processing all inputs. We evaluate FORC on 14 datasets covering five
+natural language tasks, using four candidate LMs of vastly different size and
+cost. With FORC, we match the performance of the largest available LM while
+achieving a cost reduction of 63%. Via our publicly available library,
+researchers as well as practitioners can thus save large amounts of money
+without sacrificing performance.
+
+
+
+
+
+
+
+ ♻ ☆ Self Generated Wargame AI: Double Layer Agent Task Planning Based on
+ Large Language Model
+
+
+
+
+
+
+
+
+ Y. Sun, J. Zhao, C. Yu, W. Wang, X. Zhou
+
+
+ The large language models represented by ChatGPT have a disruptive impact on
+the field of artificial intelligence. But it mainly focuses on natural language
+processing, speech recognition, machine learning and natural language
+understanding. This paper innovatively applies the large language model to the
+field of intelligent decision-making, places the large language model in the
+decision-making center, and constructs an agent architecture with the large
+language model as the core. Based on this, it further proposes a two-layer
+agent task planning, issues and executes decision commands through the
+interaction of natural language, and carries out simulation verification
+through the wargame simulation environment. Through the game confrontation
+simulation experiment, it is found that the intelligent decision-making ability
+of the large language model is significantly stronger than the commonly used
+reinforcement learning AI and rule AI, and the intelligence, understandability
+and generalization are all better. And through experiments, it was found that
+the intelligence of the large language model is closely related to prompt. This
+work also extends the large language model from previous human-computer
+interaction to the field of intelligent decision-making, which has important
+reference value and significance for the development of intelligent
+decision-making.
+
+
+
+
+
+
+
+ ♻ ☆ How to Evaluate the Generalization of Detection? A Benchmark for
+ Comprehensive Open-Vocabulary Detection AAAI 2024
+
+
+ Object detection (OD) in computer vision has made significant progress in
+recent years, transitioning from closed-set labels to open-vocabulary detection
+(OVD) based on large-scale vision-language pre-training (VLP). However, current
+evaluation methods and datasets are limited to testing generalization over
+object types and referral expressions, which do not provide a systematic,
+fine-grained, and accurate benchmark of OVD models' abilities. In this paper,
+we propose a new benchmark named OVDEval, which includes 9 sub-tasks and
+introduces evaluations on commonsense knowledge, attribute understanding,
+position understanding, object relation comprehension, and more. The dataset is
+meticulously created to provide hard negatives that challenge models' true
+understanding of visual and linguistic input. Additionally, we identify a
+problem with the popular Average Precision (AP) metric when benchmarking models
+on these fine-grained label datasets and propose a new metric called
+Non-Maximum Suppression Average Precision (NMS-AP) to address this issue.
+Extensive experimental results show that existing top OVD models all fail on
+the new tasks except for simple object types, demonstrating the value of the
+proposed dataset in pinpointing the weakness of current OVD models and guiding
+future research. Furthermore, the proposed NMS-AP metric is verified by
+experiments to provide a much more truthful evaluation of OVD models, whereas
+traditional AP metrics yield deceptive results. Data is available at
+\url{https://github.com/om-ai-lab/OVDEval}
+
+
+ In recent years, the explosion of web videos makes text-video retrieval
+increasingly essential and popular for video filtering, recommendation, and
+search. Text-video retrieval aims to rank relevant text/video higher than
+irrelevant ones. The core of this task is to precisely measure the cross-modal
+similarity between texts and videos. Recently, contrastive learning methods
+have shown promising results for text-video retrieval, most of which focus on
+the construction of positive and negative pairs to learn text and video
+representations. Nevertheless, they do not pay enough attention to hard
+negative pairs and lack the ability to model different levels of semantic
+similarity. To address these two issues, this paper improves contrastive
+learning using two novel techniques. First, to exploit hard examples for robust
+discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module
+(DMAE) to mine hard negative pairs from textual and visual clues. By further
+introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively
+identify all these hard negatives and explicitly highlight their impacts in the
+training loss. Second, our work argues that triplet samples can better model
+fine-grained semantic similarity compared to pairwise samples. We thereby
+present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to
+construct partial order triplet samples by automatically generating
+fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL
+designs an adaptive token masking strategy with cross-modal interaction to
+model subtle semantic differences. Extensive experiments demonstrate that the
+proposed approach outperforms existing methods on four widely-used text-video
+retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
+
+
+
+ comment: Accepted by ACM MM 2023
+
+
+
+
+
+
+ ♻ ☆ Forbidden Facts: An Investigation of Competing Objectives in Llama-2 NeurIPS 2023
+
+
+
+
+
+
+
+
+ Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit
+
+
+ LLMs often face competing pressures (for example helpfulness vs.
+harmlessness). To understand how models resolve such conflicts, we study
+Llama-2-chat models on the forbidden fact task. Specifically, we instruct
+Llama-2 to truthfully complete a factual recall statement while forbidding it
+from saying the correct answer. This often makes the model give incorrect
+answers. We decompose Llama-2 into 1000+ components, and rank each one with
+respect to how useful it is for forbidding the correct answer. We find that in
+aggregate, around 35 components are enough to reliably implement the full
+suppression behavior. However, these components are fairly heterogeneous and
+many operate using faulty heuristics. We discover that one of these heuristics
+can be exploited via a manually designed adversarial attack which we call The
+California Attack. Our results highlight some roadblocks standing in the way of
+being able to successfully interpret advanced ML systems. Project website
+available at https://forbiddenfacts.github.io .
+
+
+
+ comment: Accepted to the ATTRIB and SoLaR workshops at NeurIPS 2023; (v2:
+ fixed typos)
+
+
+
+
+
+
+ ♻ ☆ ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
+ Strategy for Diverse Generative Tasks
+
+
+
+
+
+
+
+
+ Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao
+
+
+ This study examines 4-bit quantization methods like GPTQ in large language
+models (LLMs), highlighting GPTQ's overfitting and limited enhancement in
+Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we
+extend task scope to more generative categories such as code generation and
+abstractive summarization, in which we found that INT4 quantization can
+significantly underperform. However, simply shifting to higher precision
+formats like FP6 has been particularly challenging, thus overlooked, due to
+poor performance caused by the lack of sophisticated integration and system
+acceleration strategies on current AI hardware. Our results show that FP6, even
+with a coarse-grain quantization scheme, performs robustly across various
+algorithms and tasks, demonstrating its superiority in accuracy and
+versatility. Notably, with the FP6 quantization, \codestar-15B model performs
+comparably to its FP16 counterpart in code generation, and for smaller models
+like the 406M it closely matches their baselines in summarization. Neither can
+be achieved by INT4. To better accommodate various AI hardware and achieve the
+best system performance, we propose a novel 4+2 design for FP6 to achieve
+similar latency to the state-of-the-art INT4 fine-grain quantization. With our
+design, FP6 can become a promising solution to the current 4-bit quantization
+methods used in LLMs.
+
+
+
+
+
+
+
+ ♻ ☆ RTQ: Rethinking Video-language Understanding Based on Image-text Model ACM MM 2023
+
+
+ Recent advancements in video-language understanding have been established on
+the foundation of image-text models, resulting in promising outcomes due to the
+shared knowledge between images and videos. However, video-language
+understanding presents unique challenges due to the inclusion of highly complex
+semantic details, which result in information redundancy, temporal dependency,
+and scene complexity. Current techniques have only partially tackled these
+issues, and our quantitative analysis indicates that some of these methods are
+complementary. In light of this, we propose a novel framework called RTQ
+(Refine, Temporal model, and Query), which addresses these challenges
+simultaneously. The approach involves refining redundant information within
+frames, modeling temporal relations among frames, and querying task-specific
+information from the videos. Remarkably, our model demonstrates outstanding
+performance even in the absence of video-language pre-training, and the results
+are comparable with or superior to those achieved by state-of-the-art
+pre-training methods. Code is available at
+https://github.com/SCZwangxiao/RTQ-MM2023.
+
+
+
+ comment: Accepted by ACM MM 2023 as Oral representation
+
+
+
+
+
+
+ ♻ ☆ From Beginner to Expert: Modeling Medical Knowledge into General LLMs
+
+
+
+
+
+
+
+
+ Qiang Li, Xiaoyan Yang, Haowen Wang, Qin Wang, Lei Liu, Junjie Wang, Yang Zhang, Mingyuan Chu, Sen Hu, Yicheng Chen, Yue Shen, Cong Fan, Wangshu Zhang, Teng Xu, Jinjie Gu, Jing Zheng, Guannan Zhang Ant Group
+
+
+ Recently, large language model (LLM) based artificial intelligence (AI)
+systems have demonstrated remarkable capabilities in natural language
+understanding and generation. However, these models face a significant
+challenge when it comes to sensitive applications, such as reasoning over
+medical knowledge and answering medical questions in a physician-like manner.
+Prior studies attempted to overcome this challenge by increasing the model size
+(>100B) to learn more general medical knowledge, while there is still room for
+improvement in LLMs with smaller-scale model sizes (<100B). In this work, we
+start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a
+medical beginner towards a medical expert (called AntGLM-Med-10B), which
+leverages a 3-stage optimization procedure, i.e., general medical knowledge
+injection, medical domain instruction tuning, and specific medical task
+adaptation. Our contributions are threefold: (1) We specifically investigate
+how to adapt a pre-trained general LLM in medical domain, especially for a
+specific medical task. (2) We collect and construct large-scale medical
+datasets for each stage of the optimization process. These datasets encompass
+various data types and tasks, such as question-answering, medical reasoning,
+multi-choice questions, and medical conversations. (3) Specifically for
+multi-choice questions in the medical domain, we propose a novel
+Verification-of-Choice approach for prompting engineering, which significantly
+enhances the reasoning ability of LLMs. Remarkably, by combining the above
+approaches, our AntGLM-Med-10B model can outperform the most of LLMs on
+PubMedQA, including both general and medical LLMs, even when these LLMs have
+larger model size.
+
+
+
+ comment: Developed by Ant Group for PubMedQA leaderboard
+
+
+
+
+
+
+ ♻ ☆ T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large
+ Language Model Signals for Science Question Answering AAAI 2024
+
+
+
+
+
+
+
+
+ Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, Heng Tao Shen
+
+
+ Large Language Models (LLMs) have recently demonstrated exceptional
+performance in various Natural Language Processing (NLP) tasks. They have also
+shown the ability to perform chain-of-thought (CoT) reasoning to solve complex
+problems. Recent studies have explored CoT reasoning in complex multimodal
+scenarios, such as the science question answering task, by fine-tuning
+multimodal models with high-quality human-annotated CoT rationales. However,
+collecting high-quality COT rationales is usually time-consuming and costly.
+Besides, the annotated rationales are hardly accurate due to the external
+essential information missed. To address these issues, we propose a novel
+method termed T-SciQ that aims at teaching science question answering with LLM
+signals. The T-SciQ approach generates high-quality CoT rationales as teaching
+signals and is advanced to train much smaller models to perform CoT reasoning
+in complex modalities. Additionally, we introduce a novel data mixing strategy
+to produce more effective teaching data samples for simple and complex science
+question answer problems. Extensive experimental results show that our T-SciQ
+method achieves a new state-of-the-art performance on the ScienceQA benchmark,
+with an accuracy of 96.18%. Moreover, our approach outperforms the most
+powerful fine-tuned baseline by 4.5%. The code is publicly available at
+https://github.com/T-SciQ/T-SciQ.
+
+
+
+ comment: AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual
+ Questions AAAI
+
+
+
+
+
+
+
+
+ Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
+
+
+ Vision Language Models (VLMs), which extend Large Language Models (LLM) by
+incorporating visual understanding capability, have demonstrated significant
+advancements in addressing open-ended visual question-answering (VQA) tasks.
+However, these models cannot accurately interpret images infused with text, a
+common occurrence in real-world scenarios. Standard procedures for extracting
+information from images often involve learning a fixed set of query embeddings.
+These embeddings are designed to encapsulate image contexts and are later used
+as soft prompt inputs in LLMs. Yet, this process is limited to the token count,
+potentially curtailing the recognition of scenes with text-rich context. To
+improve upon them, the present study introduces BLIVA: an augmented version of
+InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings
+from InstructBLIP and also directly projects encoded patch embeddings into the
+LLM, a technique inspired by LLaVA. This approach assists the model to capture
+intricate details potentially missed during the query decoding process.
+Empirical evidence demonstrates that our model, BLIVA, significantly enhances
+performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA
+benchmark) and in undertaking general (not particularly text-rich) VQA
+benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved
+17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME),
+comparing to our baseline InstructBLIP. BLIVA demonstrates significant
+capability in decoding real-world images, irrespective of text presence. To
+demonstrate the broad industry applications enabled by BLIVA, we evaluate the
+model using a new dataset comprising YouTube thumbnails paired with
+question-answer sets across 11 diverse categories. Our code and models are
+freely accessible at https://github.com/mlpc-ucsd/BLIVA.
+
+
+
+ comment: Accepted at AAAI Conference on Artificial Intelligence (AAAI-24)
+
+
+
+
+
+
+ ♻ ☆ All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
+ Scientific Information Extraction
+
+
+ Extracting key information from scientific papers has the potential to help
+researchers work more efficiently and accelerate the pace of scientific
+progress. Over the last few years, research on Scientific Information
+Extraction (SciIE) witnessed the release of several new systems and benchmarks.
+However, existing paper-focused datasets mostly focus only on specific parts of
+a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
+table-only), due to complex processing and expensive annotations. Moreover,
+core information can be present in either text or tables or across both. To
+close this gap in data availability and enable cross-modality IE, while
+alleviating labeling costs, we propose a semi-supervised pipeline for
+annotating entities in text, as well as entities and relations in tables, in an
+iterative procedure. Based on this pipeline, we release novel resources for the
+scientific community, including a high-quality benchmark, a large-scale corpus,
+and a semi-supervised annotation pipeline. We further report the performance of
+state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
+Lastly, we explore the potential capability of large language models such as
+ChatGPT for the current task. Our new dataset, results, and analysis validate
+the effectiveness and efficiency of our semi-supervised pipeline, and we
+discuss its remaining limitations.
+
+
+
+ comment: Work in progress; 17 pages, 6 figures, 11 tables
+
+
+
+
+
+
+ ♻ ☆ MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
+ Benchmark for Expert AGI
+
+
+ We introduce MMMU: a new benchmark designed to evaluate multimodal models on
+massive multi-discipline tasks demanding college-level subject knowledge and
+deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal
+questions from college exams, quizzes, and textbooks, covering six core
+disciplines: Art & Design, Business, Science, Health & Medicine, Humanities &
+Social Science, and Tech & Engineering. These questions span 30 subjects and
+183 subfields, comprising 30 highly heterogeneous image types, such as charts,
+diagrams, maps, tables, music sheets, and chemical structures. Unlike existing
+benchmarks, MMMU focuses on advanced perception and reasoning with
+domain-specific knowledge, challenging models to perform tasks akin to those
+faced by experts. The evaluation of 14 open-source LMMs as well as the
+proprietary GPT-4V(ision) and Gemini highlights the substantial challenges
+posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve
+accuracies of 56% and 59% respectively, indicating significant room for
+improvement. We believe MMMU will stimulate the community to build
+next-generation multimodal foundation models towards expert artificial general
+intelligence.
+
+
+
+
+
+
+
+
+ Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
+
+
+ The recent surge in research interest in applying large language models
+(LLMs) to decision-making tasks has flourished by leveraging the extensive
+world knowledge embedded in LLMs. While there is a growing demand to tailor
+LLMs for custom decision-making tasks, finetuning them for specific tasks is
+resource-intensive and may diminish the model's generalization capabilities.
+Moreover, state-of-the-art language models like GPT-4 and Claude are primarily
+accessible through API calls, with their parametric weights remaining
+proprietary and unavailable to the public. This scenario emphasizes the growing
+need for new methodologies that allow learning from agent experiences without
+requiring parametric updates. To address these problems, we introduce the
+Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences
+and extracts knowledge using natural language from a collection of training
+tasks. At inference, the agent recalls its extracted insights and past
+experiences to make informed decisions. Our empirical results highlight the
+robust learning efficacy of the ExpeL agent, indicating a consistent
+enhancement in its performance as it accumulates experiences. We further
+explore the emerging capabilities and transfer learning potential of the ExpeL
+agent through qualitative observations and additional experiments.
+
+
+
+ comment: Accepted by the 38th Annual AAAI Conference on Artificial
+ Intelligence (AAAI-24)
+
+
+
+
+
+
+
+ Ruian He, Shili Zhou, Yuqi Sun, Ri Cheng, Weimin Tan, Bo Yan
+
+
+ With the rise of real-time rendering and the evolution of display devices,
+there is a growing demand for post-processing methods that offer
+high-resolution content in a high frame rate. Existing techniques often suffer
+from quality and latency issues due to the disjointed treatment of frame
+supersampling and extrapolation. In this paper, we recognize the shared context
+and mechanisms between frame supersampling and extrapolation, and present a
+novel framework, Space-time Supersampling (STSS). By integrating them into a
+unified framework, STSS can improve the overall quality with lower latency. To
+implement an efficient architecture, we treat the aliasing and warping holes
+unified as reshading regions and put forth two key components to compensate the
+regions, namely Random Reshading Masking (RRM) and Efficient Reshading Module
+(ERM). Extensive experiments demonstrate that our approach achieves superior
+visual fidelity compared to state-of-the-art (SOTA) methods. Notably, the
+performance is achieved within only 4ms, saving up to 75\% of time against the
+conventional two-stage pipeline that necessitates 17ms.
+
+
+
+ comment: Accepted to AAAI 2024
+
+
+
+
+
+
+ ☆ Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
+ Animation AAAI-24
+
+
+ Speech-driven 3D facial animation aims to synthesize vivid facial animations
+that accurately synchronize with speech and match the unique speaking style.
+However, existing works primarily focus on achieving precise lip
+synchronization while neglecting to model the subject-specific speaking style,
+often resulting in unrealistic facial animations. To the best of our knowledge,
+this work makes the first attempt to explore the coupled information between
+the speaking style and the semantic content in facial motions. Specifically, we
+introduce an innovative speaking style disentanglement method, which enables
+arbitrary-subject speaking style encoding and leads to a more realistic
+synthesis of speech-driven facial animations. Subsequently, we propose a novel
+framework called \textbf{Mimic} to learn disentangled representations of the
+speaking style and content from facial motions by building two latent spaces
+for style and content, respectively. Moreover, to facilitate disentangled
+representation learning, we introduce four well-designed constraints: an
+auxiliary style classifier, an auxiliary inverse classifier, a content
+contrastive loss, and a pair of latent cycle losses, which can effectively
+contribute to the construction of the identity-related style space and
+semantic-related content space. Extensive qualitative and quantitative
+experiments conducted on three publicly available datasets demonstrate that our
+approach outperforms state-of-the-art methods and is capable of capturing
+diverse speaking styles for speech-driven 3D facial animation. The source code
+and supplementary video are publicly available at:
+https://zeqing-wang.github.io/Mimic/
+
+
+
+ comment: 7 pages, 6 figures, accepted by AAAI-24
+
+
+
+
+
+
+ ☆ Country-Scale Cropland Mapping in Data-Scarce Settings Using Deep
+ Learning: A Case Study of Nigeria
+
+
+ Cropland maps are a core and critical component of remote-sensing-based
+agricultural monitoring, providing dense and up-to-date information about
+agricultural development. Machine learning is an effective tool for large-scale
+agricultural mapping, but relies on geo-referenced ground-truth data for model
+training and testing, which can be scarce or time-consuming to obtain. In this
+study, we explore the usefulness of combining a global cropland dataset and a
+hand-labeled dataset to train machine learning models for generating a new
+cropland map for Nigeria in 2020 at 10 m resolution. We provide the models with
+pixel-wise time series input data from remote sensing sources such as
+Sentinel-1 and 2, ERA5 climate data, and DEM data, in addition to binary labels
+indicating cropland presence. We manually labeled 1827 evenly distributed
+pixels across Nigeria, splitting them into 50\% training, 25\% validation, and
+25\% test sets used to fit the models and test our output map. We evaluate and
+compare the performance of single- and multi-headed Long Short-Term Memory
+(LSTM) neural network classifiers, a Random Forest classifier, and three
+existing 10 m resolution global land cover maps (Google's Dynamic World, ESRI's
+Land Cover, and ESA's WorldCover) on our proposed test set. Given the regional
+variations in cropland appearance, we additionally experimented with excluding
+or sub-setting the global crowd-sourced Geowiki cropland dataset, to
+empirically assess the trade-off between data quantity and data quality in
+terms of the similarity to the target data distribution of Nigeria. We find
+that the existing WorldCover map performs the best with an F1-score of 0.825
+and accuracy of 0.870 on the test set, followed by a single-headed LSTM model
+trained with our hand-labeled training samples and the Geowiki data points in
+Nigeria, with a F1-score of 0.814 and accuracy of 0.842.
+
+
+
+
+
+
+
+ ☆ The Right Losses for the Right Gains: Improving the Semantic Consistency
+ of Deep Text-to-Image Generation with Distribution-Sensitive Losses
+
+
+ One of the major challenges in training deep neural networks for
+text-to-image generation is the significant linguistic discrepancy between
+ground-truth captions of each image in most popular datasets. The large
+difference in the choice of words in such captions results in synthesizing
+images that are semantically dissimilar to each other and to their ground-truth
+counterparts. Moreover, existing models either fail to generate the
+fine-grained details of the image or require a huge number of parameters that
+renders them inefficient for text-to-image synthesis. To fill this gap in the
+literature, we propose using the contrastive learning approach with a novel
+combination of two loss functions: fake-to-fake loss to increase the semantic
+consistency between generated images of the same caption, and fake-to-real loss
+to reduce the gap between the distributions of real images and fake ones. We
+test this approach on two baseline models: SSAGAN and AttnGAN (with style
+blocks to enhance the fine-grained details of the images.) Results show that
+our approach improves the qualitative results on AttnGAN with style blocks on
+the CUB dataset. Additionally, on the challenging COCO dataset, our approach
+achieves competitive results against the state-of-the-art Lafite model,
+outperforms the FID score of SSAGAN model by 44.
+
+
+
+
+
+
+
+ ☆ Ultrasound Image Enhancement using CycleGAN and Perceptual Loss
+
+
+ Purpose: The objective of this work is to introduce an advanced framework
+designed to enhance ultrasound images, especially those captured by portable
+hand-held devices, which often produce lower quality images due to hardware
+constraints. Additionally, this framework is uniquely capable of effectively
+handling non-registered input ultrasound image pairs, addressing a common
+challenge in medical imaging. Materials and Methods: In this retrospective
+study, we utilized an enhanced generative adversarial network (CycleGAN) model
+for ultrasound image enhancement across five organ systems. Perceptual loss,
+derived from deep features of pretrained neural networks, is applied to ensure
+the human-perceptual quality of the enhanced images. These images are compared
+with paired images acquired from high resolution devices to demonstrate the
+model's ability to generate realistic high-quality images across organ systems.
+Results: Preliminary validation of the framework reveals promising performance
+metrics. The model generates images that result in a Structural Similarity
+Index (SSI) score of 0.722, Locally Normalized Cross-Correlation (LNCC) score
+of 0.902 and 28.802 for the Peak Signal-to-Noise Ratio (PSNR) metric.
+Conclusion: This work presents a significant advancement in medical imaging
+through the development of a CycleGAN model enhanced with Perceptual Loss (PL),
+effectively bridging the quality gap between ultrasound images from varied
+devices. By training on paired images, the model not only improves image
+quality but also ensures the preservation of vital anatomic structural content.
+This approach may improve equity in access to healthcare by enhancing portable
+device capabilities, although further validation and optimizations are
+necessary for broader clinical application.
+
+
+ Demand for efficient onboard object detection is increasing due to its key
+role in autonomous navigation. However, deploying object detection models such
+as YOLO on resource constrained edge devices is challenging due to the high
+computational requirements of such models. In this paper, an compressed object
+detection model named Squeezed Edge YOLO is examined. This model is compressed
+and optimized to kilobytes of parameters in order to fit onboard such edge
+devices. To evaluate Squeezed Edge YOLO, two use cases - human and shape
+detection - are used to show the model accuracy and performance. Moreover, the
+model is deployed onboard a GAP8 processor with 8 RISC-V cores and an NVIDIA
+Jetson Nano with 4GB of memory. Experimental results show Squeezed Edge YOLO
+model size is optimized by a factor of 8x which leads to 76% improvements in
+energy efficiency and 3.3x faster throughout.
+
+
+
+ comment: ML with New Compute Paradigms (MLNCP) Workshop at NeurIPS 2023
+
+
+
+
+
+
+ ☆ Unified framework for diffusion generative models in SO(3): applications
+ in computer vision and astrophysics AAAI-2024
+
+
+ Diffusion-based generative models represent the current state-of-the-art for
+image generation. However, standard diffusion models are based on Euclidean
+geometry and do not translate directly to manifold-valued data. In this work,
+we develop extensions of both score-based generative models (SGMs) and
+Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D
+rotations, SO(3). SO(3) is of particular interest in many disciplines such as
+robotics, biochemistry and astronomy/cosmology science. Contrary to more
+general Riemannian manifolds, SO(3) admits a tractable solution to heat
+diffusion, and allows us to implement efficient training of diffusion models.
+We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and
+demonstrate state-of-the-art results. Additionally, we demonstrate the
+practicality of our model on pose estimation tasks and in predicting correlated
+galaxy orientations for astrophysics/cosmology.
+
+
+
+ comment: Accepted at AAAI-2024 Main Track
+
+
+
+
+
+
+ ☆ HAAR: Text-Conditioned Generative Model of 3D Strand-based Human
+ Hairstyles
+
+
+
+
+
+
+
+
+ Vanessa Sklyarova, Egor Zakharov, Otmar Hilliges, Michael J. Black, Justus Thies
+
+
+ We present HAAR, a new strand-based generative model for 3D human hairstyles.
+Specifically, based on textual inputs, HAAR produces 3D hairstyles that could
+be used as production-level assets in modern computer graphics engines. Current
+AI-based generative models take advantage of powerful 2D priors to reconstruct
+3D content in the form of point clouds, meshes, or volumetric functions.
+However, by using the 2D priors, they are intrinsically limited to only
+recovering the visual parts. Highly occluded hair structures can not be
+reconstructed with those methods, and they only model the ''outer shell'',
+which is not ready to be used in physics-based rendering or simulation
+pipelines. In contrast, we propose a first text-guided generative method that
+uses 3D hair strands as an underlying representation. Leveraging 2D visual
+question-answering (VQA) systems, we automatically annotate synthetic hair
+models that are generated from a small set of artist-created hairstyles. This
+allows us to train a latent diffusion model that operates in a common hairstyle
+UV space. In qualitative and quantitative studies, we demonstrate the
+capabilities of the proposed model and compare it to existing hairstyle
+generation approaches.
+
+
+
+ comment: For more results please refer to the project page
+ https://haar.is.tue.mpg.de/
+
+ The goal of this paper is to discover, segment, and track independently
+moving objects in complex visual scenes. Previous approaches have explored the
+use of optical flow for motion segmentation, leading to imperfect predictions
+due to partial motion, background distraction, and object articulations and
+interactions. To address this issue, we introduce an appearance-based
+refinement method that leverages temporal consistency in video streams to
+correct inaccurate flow-based proposals. Our approach involves a simple
+selection mechanism that identifies accurate flow-predicted masks as exemplars,
+and an object-centric architecture that refines problematic masks based on
+exemplar information. The model is pre-trained on synthetic data and then
+adapted to real-world videos in a self-supervised manner, eliminating the need
+for human annotations. Its performance is evaluated on multiple video
+segmentation benchmarks, including DAVIS, YouTubeVOS, SegTrackv2, and FBMS-59.
+We achieve competitive performance on single-object segmentation, while
+significantly outperforming existing models on the more challenging problem of
+multi-object segmentation. Finally, we investigate the benefits of using our
+model as a prompt for a per-frame Segment Anything Model.
+
+
+
+ comment: Total 26 pages, 13 figures (including main text: 9 pages, 5 figures)
+
+
+
+
+
+
+ ☆ GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
+
+
+
+
+
+
+
+
+ Ye Yuan, Xueting Li, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal
+
+
+ Gaussian splatting has emerged as a powerful 3D representation that harnesses
+the advantages of both explicit (mesh) and implicit (NeRF) 3D representations.
+In this paper, we seek to leverage Gaussian splatting to generate realistic
+animatable avatars from textual descriptions, addressing the limitations (e.g.,
+flexibility and efficiency) imposed by mesh or NeRF-based representations.
+However, a naive application of Gaussian splatting cannot generate high-quality
+animatable avatars and suffers from learning instability; it also cannot
+capture fine avatar geometries and often leads to degenerate body parts. To
+tackle these problems, we first propose a primitive-based 3D Gaussian
+representation where Gaussians are defined inside pose-driven primitives to
+facilitate animation. Second, to stabilize and amortize the learning of
+millions of Gaussians, we propose to use neural implicit fields to predict the
+Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries
+and extract detailed meshes, we propose a novel SDF-based implicit mesh
+learning approach for 3D Gaussians that regularizes the underlying geometries
+and extracts highly detailed textured meshes. Our proposed method, GAvatar,
+enables the large-scale generation of diverse animatable avatars using only
+text prompts. GAvatar significantly surpasses existing methods in terms of both
+appearance and geometry quality, and achieves extremely fast rendering (100
+fps) at 1K resolution.
+
+
+
+
+
+
+
+ ☆ Hybrid Internal Model: A Simple and Efficient Learner for Agile Legged
+ Locomotion
+
+
+
+
+
+
+
+
+ Junfeng Long, Zirui Wang, Quanyi Li, Jiawei Gao, Liu Cao, Jiangmiao Pang
+
+
+ Robust locomotion control depends on accurate state estimations. However, the
+sensors of most legged robots can only provide partial and noisy observations,
+making the estimation particularly challenging, especially for external states
+like terrain frictions and elevation maps. Inspired by the classical Internal
+Model Control principle, we consider these external states as disturbances and
+introduce Hybrid Internal Model (HIM) to estimate them according to the
+response of the robot. The response, which we refer to as the hybrid internal
+embedding, contains the robot's explicit velocity and implicit stability
+representation, corresponding to two primary goals for locomotion tasks:
+explicitly tracking velocity and implicitly maintaining stability. We use
+contrastive learning to optimize the embedding to be close to the robot's
+successor state, in which the response is naturally embedded. HIM has several
+appealing benefits: It only needs the robot's proprioceptions, i.e., those from
+joint encoders and IMU as observations. It innovatively maintains consistent
+observations between simulation reference and reality that avoids information
+loss in mimicking learning. It exploits batch-level information that is more
+robust to noises and keeps better sample efficiency. It only requires 1 hour of
+training on an RTX 4090 to enable a quadruped robot to traverse any terrain
+under any disturbances. A wealth of real-world experiments demonstrates its
+agility, even in high-difficulty tasks and cases never occurred during the
+training process, revealing remarkable open-world generalizability.
+
+
+
+ comment: Use 1 hour to train a quadruped robot capable of traversing any
+ terrain under any disturbances in the open world, Project Page:
+ https://github.com/OpenRobotLab/HIMLoco
+
+ This paper introduces a pioneering 3D volumetric encoder designed for
+text-to-3D generation. To scale up the training data for the diffusion model, a
+lightweight network is developed to efficiently acquire feature volumes from
+multi-view images. The 3D volumes are then trained on a diffusion model for
+text-to-3D generation using a 3D U-Net. This research further addresses the
+challenges of inaccurate object captions and high-dimensional feature volumes.
+The proposed model, trained on the public Objaverse dataset, demonstrates
+promising outcomes in producing diverse and recognizable samples from text
+prompts. Notably, it empowers finer control over object part characteristics
+through textual cues, fostering model creativity by seamlessly combining
+multiple concepts within a single object. This research significantly
+contributes to the progress of 3D generation by introducing an efficient,
+flexible, and scalable representation methodology. Code is available at
+https://github.com/tzco/VolumeDiffusion.
+
+
+
+
+
+
+
+
+ Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, Lei Xiao
+
+
+ We propose a method for dynamic scene reconstruction using deformable 3D
+Gaussians that is tailored for monocular video. Building upon the efficiency of
+Gaussian splatting, our approach extends the representation to accommodate
+dynamic elements via a deformable set of Gaussians residing in a canonical
+space, and a time-dependent deformation field defined by a multi-layer
+perceptron (MLP). Moreover, under the assumption that most natural scenes have
+large regions that remain static, we allow the MLP to focus its
+representational power by additionally including a static Gaussian point cloud.
+The concatenated dynamic and static point clouds form the input for the
+Gaussian Splatting rasterizer, enabling real-time rendering. The differentiable
+pipeline is optimized end-to-end with a self-supervised rendering loss. Our
+method achieves results that are comparable to state-of-the-art dynamic neural
+radiance field methods while allowing much faster optimization and rendering.
+Project website: https://lynl7130.github.io/gaufre/index.html
+
+
+
+ comment: 10 pages, 8 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ TMP: Temporal Motion Propagation for Online Video Super-Resolution
+
+
+
+
+
+
+
+
+ Zhengqiang Zhang, Ruihuang Li, Shi Guo, Yang Cao, Lei Zhang
+
+
+ Online video super-resolution (online-VSR) highly relies on an effective
+alignment module to aggregate temporal information, while the strict latency
+requirement makes accurate and efficient alignment very challenging. Though
+much progress has been achieved, most of the existing online-VSR methods
+estimate the motion fields of each frame separately to perform alignment, which
+is computationally redundant and ignores the fact that the motion fields of
+adjacent frames are correlated. In this work, we propose an efficient Temporal
+Motion Propagation (TMP) method, which leverages the continuity of motion field
+to achieve fast pixel-level alignment among consecutive frames. Specifically,
+we first propagate the offsets from previous frames to the current frame, and
+then refine them in the neighborhood, which significantly reduces the matching
+space and speeds up the offset estimation process. Furthermore, to enhance the
+robustness of alignment, we perform spatial-wise weighting on the warped
+features, where the positions with more precise offsets are assigned higher
+importance. Experiments on benchmark datasets demonstrate that the proposed TMP
+method achieves leading online-VSR accuracy as well as inference speed. The
+source code of TMP can be found at https://github.com/xtudbxk/TMP.
+
+
+ Compared to conventional semantic segmentation with pixel-level supervision,
+Weakly Supervised Semantic Segmentation (WSSS) with image-level labels poses
+the challenge that it always focuses on the most discriminative regions,
+resulting in a disparity between fully supervised conditions. A typical
+manifestation is the diminished precision on the object boundaries, leading to
+a deteriorated accuracy of WSSS. To alleviate this issue, we propose to
+adaptively partition the image content into deterministic regions (e.g.,
+confident foreground and background) and uncertain regions (e.g., object
+boundaries and misclassified categories) for separate processing. For uncertain
+cues, we employ an activation-based masking strategy and seek to recover the
+local information with self-distilled knowledge. We further assume that the
+unmasked confident regions should be robust enough to preserve the global
+semantics. Building upon this, we introduce a complementary self-enhancement
+method that constrains the semantic consistency between these confident regions
+and an augmented image with the same class labels. Extensive experiments
+conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed
+single-stage approach for WSSS not only outperforms state-of-the-art benchmarks
+remarkably but also surpasses multi-stage methodologies that trade complexity
+for accuracy. The code can be found at
+\url{https://github.com/Jessie459/feature-self-reinforcement}.
+
+
+
+ comment: Accepted by AAAI 2024
+
+
+
+
+
+
+ ♻ ☆ Towards Training-free Open-world Segmentation via Image Prompt
+ Foundation Models
+
+
+
+
+
+
+
+
+ Lv Tang, Peng-Tao Jiang, Hao-Ke Xiao, Bo Li
+
+
+ The realm of computer vision has witnessed a paradigm shift with the advent
+of foundational models, mirroring the transformative influence of large
+language models in the domain of natural language processing. This paper delves
+into the exploration of open-world segmentation, presenting a novel approach
+called Image Prompt Segmentation (IPSeg) that harnesses the power of vision
+foundational models. IPSeg lies the principle of a training-free paradigm,
+which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a
+single image containing a subjective visual concept as a flexible prompt to
+query vision foundation models like DINOv2 and Stable Diffusion. Our approach
+extracts robust features for the prompt image and input image, then matches the
+input representations to the prompt representations via a novel feature
+interaction module to generate point prompts highlighting target objects in the
+input image. The generated point prompts are further utilized to guide the
+Segment Anything Model to segment the target object in the input image. The
+proposed method stands out by eliminating the need for exhaustive training
+sessions, thereby offering a more efficient and scalable solution. Experiments
+on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for
+flexible open-world segmentation using intuitive image prompts. This work
+pioneers tapping foundation models for open-world understanding through visual
+concepts conveyed in images.
+
+
+
+
+
+
+
+ ♻ ☆ From heavy rain removal to detail restoration: A faster and better
+ network
+
+
+ The profound accumulation of precipitation during intense rainfall events can
+markedly degrade the quality of images, leading to the erosion of textural
+details. Despite the improvements observed in existing learning-based methods
+specialized for heavy rain removal, it is discerned that a significant
+proportion of these methods tend to overlook the precise reconstruction of the
+intricate details. In this work, we introduce a simple dual-stage progressive
+enhancement network, denoted as DPENet, aiming to achieve effective deraining
+while preserving the structural accuracy of rain-free images. This approach
+comprises two key modules, a rain streaks removal network (R$^2$Net) focusing
+on accurate rain removal, and a details reconstruction network (DRNet) designed
+to recover the textural details of rain-free images. Firstly, we introduce a
+dilated dense residual block (DDRB) within R$^2$Net, enabling the aggregation
+of high-level and low-level features. Secondly, an enhanced residual pixel-wise
+attention block (ERPAB) is integrated into DRNet to facilitate the
+incorporation of contextual information. To further enhance the fidelity of our
+approach, we employ a comprehensive loss function that accentuates both the
+marginal and regional accuracy of rain-free images. Extensive experiments
+conducted on publicly available benchmarks demonstrates the noteworthy
+efficiency and effectiveness of our proposed DPENet. The source code and
+pre-trained models are currently available at
+\url{https://github.com/chdwyb/DPENet}.
+
+
+
+ comment: Accepted by Pattern Recognition
+
+
+
+
+
+
+ ♻ ☆ NoisyNN: Exploring the Influence of Information Entropy Change in
+ Learning Systems
+
+
+
+
+
+
+
+
+ Xiaowei Yu, Yao Xue, Lu Zhang, Li Wang, Tianming Liu, Dajiang Zhu
+
+
+ We explore the impact of entropy change in deep learning systems via noise
+injection at different levels, i.e., the latent space and input image. The
+series of models that employ our methodology are collectively known as Noisy
+Neural Networks (NoisyNN), with examples such as NoisyViT and NoisyCNN. Noise
+is conventionally viewed as a harmful perturbation in various deep learning
+architectures, such as convolutional neural networks (CNNs) and vision
+transformers (ViTs), as well as different learning tasks like image
+classification and transfer learning. However, this work shows noise can be an
+effective way to change the entropy of the learning system. We demonstrate that
+specific noise can boost the performance of various deep architectures under
+certain conditions. We theoretically prove the enhancement gained from positive
+noise by reducing the task complexity defined by information entropy and
+experimentally show the significant performance gain in large image datasets,
+such as the ImageNet. Herein, we use the information entropy to define the
+complexity of the task. We categorize the noise into two types, positive noise
+(PN) and harmful noise (HN), based on whether the noise can help reduce the
+complexity of the task. Extensive experiments of CNNs and ViTs have shown
+performance improvements by proactively injecting positive noise, where we
+achieved an unprecedented top 1 accuracy of over 95$\%$ on ImageNet. Both
+theoretical analysis and empirical evidence have confirmed that the presence of
+positive noise, can benefit the learning process, while the traditionally
+perceived harmful noise indeed impairs deep learning models. The different
+roles of noise offer new explanations for deep models on specific tasks and
+provide a new paradigm for improving model performance. Moreover, it reminds us
+that we can influence the performance of learning systems via information
+entropy change.
+
+
+
+ comment: Information Entropy, NoisyNN, ViT, CNN
+
+
+
+
+
+
+ ♻ ☆ Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt
+ for Segmenting Camouflaged Objects AAAI2024
+
+
+ Camouflaged object detection (COD) approaches heavily rely on pixel-level
+annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse
+annotations like scribbles or points to reduce annotation effort, but this can
+lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable
+segmentation ability with sparse prompts like points. However, manual prompt is
+not always feasible, as it may not be accessible in real-world application.
+Additionally, it only provides localization information instead of semantic
+one, which can intrinsically cause ambiguity in interpreting the targets. In
+this work, we aim to eliminate the need for manual prompt. The key idea is to
+employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts
+using the semantic information given by a generic text prompt. To that end, we
+introduce a test-time adaptation per-instance mechanism called Generalizable
+SAM (GenSAM) to automatically enerate and optimize visual prompts the generic
+task prompt for WSCOD. In particular, CCTP maps a single generic text prompt
+onto image-specific consensus foreground and background heatmaps using
+vision-language models, acquiring reliable visual prompts. Moreover, to
+test-time adapt the visual prompts, we further propose Progressive Mask
+Generation (PMG) to iteratively reweight the input image, guiding the model to
+focus on the targets in a coarse-to-fine manner. Crucially, all network
+parameters are fixed, avoiding the need for additional training. Experiments
+demonstrate the superiority of GenSAM. Experiments on three benchmarks
+demonstrate that GenSAM outperforms point supervision approaches and achieves
+comparable results to scribble supervision ones, solely relying on general task
+descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.
+
+
+
+ comment: Accepted by AAAI2024
+
+
+
+
+
+
+
+
+
+ Information Retrieval 23
+
+
+
+
+
+ ☆ A novel diffusion recommendation algorithm based on multi-scale cnn and
+ residual lstm
+
+
+ Sequential recommendation aims to infer user preferences from historical
+interaction sequences and predict the next item that users may be interested in
+the future. The current mainstream design approach is to represent items as
+fixed vectors, capturing the underlying relationships between items and user
+preferences based on the order of interactions. However, relying on a single
+fixed-item embedding may weaken the modeling capability of the system, and the
+global dynamics and local saliency exhibited by user preferences need to be
+distinguished. To address these issues, this paper proposes a novel diffusion
+recommendation algorithm based on multi-scale cnn and residual lstm (AREAL). We
+introduce diffusion models into the recommend system, representing items as
+probability distributions instead of fixed vectors. This approach enables
+adaptive reflection of multiple aspects of the items and generates item
+distributions in a denoising manner. We use multi-scale cnn and residual lstm
+methods to extract the local and global dependency features of user history
+interactions, and use attention mechanism to distinguish weights as the guide
+features of reverse diffusion recovery. The effectiveness of the proposed
+method is validated through experiments conducted on two real-world datasets.
+Specifically, AREAL obtains improvements over the best baselines by 2.63% and
+4.25% in terms of HR@20 and 5.05% and 3.94% in terms of NDCG@20 on all
+datasets.
+
+
+
+
+
+
+
+ ☆ On-Device Recommender Systems: A Tutorial on The New-Generation
+ Recommendation Paradigm
+
+
+
+
+
+
+
+
+ Hongzhi Yin, Tong Chen, Liang Qu, Bin Cui
+
+
+ Given the sheer volume of contemporary e-commerce applications, recommender
+systems (RSs) have gained significant attention in both academia and industry.
+However, traditional cloud-based RSs face inevitable challenges, such as
+resource-intensive computation, reliance on network access, and privacy
+breaches. In response, a new paradigm called on-device recommender systems
+(ODRSs) has emerged recently in various industries like Taobao, Google, and
+Kuaishou. ODRSs unleash the computational capacity of user devices with
+lightweight recommendation models tailored for resource-constrained
+environments, enabling real-time inference with users' local data. This
+tutorial aims to systematically introduce methodologies of ODRSs, including (1)
+an overview of existing research on ODRSs; (2) a comprehensive taxonomy of
+ODRSs, where the core technical content to be covered span across three major
+ODRS research directions, including on-device deployment and inference,
+on-device training, and privacy/security of ODRSs; (3) limitations and future
+directions of ODRSs. This tutorial expects to lay the foundation and spark new
+insights for follow-up research and applications concerning this new
+recommendation paradigm.
+
+
+
+ comment: Technical tutorial; to appear at The Web Conference 2024
+
+
+
+
+
+
+ ☆ Shaping Political Discourse using multi-source News Summarization
+
+
+ Multi-document summarization is the process of automatically generating a
+concise summary of multiple documents related to the same topic. This summary
+can help users quickly understand the key information from a large collection
+of documents. Multi-document summarization systems are more complex than
+single-document summarization systems due to the need to identify and combine
+information from multiple sources. In this paper, we have developed a machine
+learning model that generates a concise summary of a topic from multiple news
+documents. The model is designed to be unbiased by sampling its input equally
+from all the different aspects of the topic, even if the majority of the news
+sources lean one way.
+
+
+
+
+
+
+
+ ☆ NoMIRACL: Knowing When You Don't Know for Robust Multilingual
+ Retrieval-Augmented Generation
+
+
+
+
+
+
+
+
+ Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin
+
+
+ Retrieval-augmented generation (RAG) grounds large language model (LLM)
+output by leveraging external knowledge sources to reduce factual
+hallucinations. However, prior works lack a comprehensive evaluation of
+different language families, making it challenging to evaluate LLM robustness
+against errors in external retrieved knowledge. To overcome this, we establish
+NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across
+18 typologically diverse languages. NoMIRACL includes both a non-relevant and a
+relevant subset. Queries in the non-relevant subset contain passages manually
+judged as non-relevant or noisy, whereas queries in the relevant subset include
+at least a single judged relevant passage. We measure LLM robustness using two
+metrics: (i) hallucination rate, measuring model tendency to hallucinate an
+answer, when the answer is not present in passages in the non-relevant subset,
+and (ii) error rate, measuring model inaccuracy to recognize relevant passages
+in the relevant subset. We build a GPT-4 baseline which achieves a 33.2%
+hallucination rate on the non-relevant and a 14.9% error rate on the relevant
+subset on average. Our evaluation reveals that GPT-4 hallucinates frequently in
+high-resource languages, such as French or English. This work highlights an
+important avenue for future research to improve LLM robustness to learn how to
+better reject non-relevant information in RAG.
+
+
+
+
+
+
+
+ ☆ The Problem of Coherence in Natural Language Explanations of
+ Recommendations ECAI 2023
+
+
+
+
+
+
+
+
+ Jakub Raczyński, Mateusz Lango, Jerzy Stefanowski
+
+
+ Providing natural language explanations for recommendations is particularly
+useful from the perspective of a non-expert user. Although several methods for
+providing such explanations have recently been proposed, we argue that an
+important aspect of explanation quality has been overlooked in their
+experimental evaluation. Specifically, the coherence between generated text and
+predicted rating, which is a necessary condition for an explanation to be
+useful, is not properly captured by currently used evaluation measures. In this
+paper, we highlight the issue of explanation and prediction coherence by 1)
+presenting results from a manual verification of explanations generated by one
+of the state-of-the-art approaches 2) proposing a method of automatic coherence
+evaluation 3) introducing a new transformer-based method that aims to produce
+more coherent explanations than the state-of-the-art approaches 4) performing
+an experimental evaluation which demonstrates that this method significantly
+improves the explanation coherence without affecting the other aspects of
+recommendation performance.
+
+
+
+ comment: ECAI 2023
+
+
+
+
+
+
+ ☆ DRDT: Dynamic Reflection with Divergent Thinking for LLM-based
+ Sequential Recommendation
+
+
+ The rise of Large Language Models (LLMs) has sparked interest in their
+application to sequential recommendation tasks as they can provide supportive
+item information. However, due to the inherent complexities of sequential
+recommendation, such as sequential patterns across datasets, noise within
+sequences, and the temporal evolution of user preferences, existing LLM
+reasoning strategies, such as in-context learning and chain-of-thought are not
+fully effective. To address these challenges, we introduce a novel reasoning
+principle: Dynamic Reflection with Divergent Thinking within a
+retriever-reranker framework. Our approach starts with a collaborative
+in-context demonstration retriever, which collects sequences exhibiting
+collaborative behaviors as in-context examples. Following this, we abstract
+high-level user preferences across multiple aspects, providing a more nuanced
+understanding of user interests and circumventing the noise within the raw
+sequences. The cornerstone of our methodology is dynamic reflection, a process
+that emulates human learning through probing, critiquing, and reflecting, using
+user feedback to tailor the analysis more effectively to the target user in a
+temporal manner. We evaluate our approach on three datasets using six
+pre-trained LLMs. The superior performance observed across these models
+demonstrates the efficacy of our reasoning strategy, notably achieved without
+the need to fine-tune the LLMs. With our principle, we managed to outperform
+GPT-Turbo-3.5 on three datasets using 7b models e.g., Vicuna-7b and Openchat-7b
+on NDCG@10. This research not only highlights the potential of LLMs in
+enhancing sequential recommendation systems but also underscores the importance
+of developing tailored reasoning strategies to fully harness their
+capabilities.
+
+
+
+
+
+
+
+ ☆ UniGen: A Unified Generative Framework for Retrieval and Question
+ Answering with Large Language Models
+
+
+ Generative information retrieval, encompassing two major tasks of Generative
+Document Retrieval (GDR) and Grounded Answer Generation (GAR), has gained
+significant attention in the area of information retrieval and natural language
+processing. Existing methods for GDR and GAR rely on separate retrieval and
+reader modules, which hinder simultaneous optimization. To overcome this, we
+present \textbf{UniGen}, a \textbf{Uni}fied \textbf{Gen}erative framework for
+retrieval and question answering that integrates both tasks into a single
+generative model leveraging the capabilities of large language models. UniGen
+employs a shared encoder and two distinct decoders for generative retrieval and
+question answering. To facilitate the learning of both tasks, we introduce
+connectors, generated by large language models, to bridge the gaps between
+query inputs and generation targets, as well as between document identifiers
+and answers. Furthermore, we propose an iterative enhancement strategy that
+leverages generated answers and retrieved documents to iteratively improve both
+tasks. Through extensive experiments on the MS MARCO and NQ datasets, we
+demonstrate the effectiveness of UniGen, showcasing its superior performance in
+both the retrieval and the question answering tasks.
+
+
+
+
+
+
+
+
+ Kangbo Liu, Yang Li, Yaoxin Wu, Zhaoxuan Wang, Xiaoxu Wang
+
+
+ Bundle recommendations strive to offer users a set of items as a package
+named bundle, enhancing convenience and contributing to the seller's revenue.
+While previous approaches have demonstrated notable performance, we argue that
+they may compromise the ternary relationship among users, items, and bundles.
+This compromise can result in information loss, ultimately impacting the
+overall model performance. To address this gap, we develop a unified model for
+bundle recommendation, termed hypergraph-enhanced dual convolutional neural
+network (HED). Our approach is characterized by two key aspects. Firstly, we
+construct a complete hypergraph to capture interaction dynamics among users,
+items, and bundles. Secondly, we incorporate U-B interaction information to
+enhance the information representation derived from users and bundle embedding
+vectors. Extensive experimental results on the Youshu and Netease datasets have
+demonstrated that HED surpasses state-of-the-art baselines, proving its
+effectiveness. In addition, various ablation studies and sensitivity analyses
+revealed the working mechanism and proved our effectiveness. Codes and datasets
+are available at https://github.com/AAI-Lab/HED
+
+
+
+
+
+
+
+ ☆ PARs: Predicate-based Association Rules for Efficient and Accurate
+ Model-Agnostic Anomaly Explanation
+
+
+ While new and effective methods for anomaly detection are frequently
+introduced, many studies prioritize the detection task without considering the
+need for explainability. Yet, in real-world applications, anomaly explanation,
+which aims to provide explanation of why specific data instances are identified
+as anomalies, is an equally important task. In this work, we present a novel
+approach for efficient and accurate model-agnostic anomaly explanation for
+tabular data using Predicate-based Association Rules (PARs). PARs can provide
+intuitive explanations not only about which features of the anomaly instance
+are abnormal, but also the reasons behind their abnormality. Our user study
+indicates that the anomaly explanation form of PARs is better comprehended and
+preferred by regular users of anomaly detection systems as compared to existing
+model-agnostic explanation options. Furthermore, we conduct extensive
+experiments on various benchmark datasets, demonstrating that PARs compare
+favorably to state-of-the-art model-agnostic methods in terms of computing
+efficiency and explanation accuracy on anomaly explanation tasks. The code for
+PARs tool is available at https://github.com/NSIBF/PARs-EXAD.
+
+
+
+
+
+
+
+ ☆ Knowledge Graphs and Pre-trained Language Models enhanced Representation
+ Learning for Conversational Recommender Systems
+
+
+
+
+
+
+
+
+ Zhangchi Qiu, Ye Tao, Shirui Pan, Alan Wee-Chung Liew
+
+
+ Conversational recommender systems (CRS) utilize natural language
+interactions and dialogue history to infer user preferences and provide
+accurate recommendations. Due to the limited conversation context and
+background knowledge, existing CRSs rely on external sources such as knowledge
+graphs to enrich the context and model entities based on their inter-relations.
+However, these methods ignore the rich intrinsic information within entities.
+To address this, we introduce the Knowledge-Enhanced Entity Representation
+Learning (KERL) framework, which leverages both the knowledge graph and a
+pre-trained language model to improve the semantic understanding of entities
+for CRS. In our KERL framework, entity textual descriptions are encoded via a
+pre-trained language model, while a knowledge graph helps reinforce the
+representation of these entities. We also employ positional encoding to
+effectively capture the temporal information of entities in a conversation. The
+enhanced entity representation is then used to develop a recommender component
+that fuses both entity and contextual representations for more informed
+recommendations, as well as a dialogue component that generates informative
+entity-related information in the response text. A high-quality knowledge graph
+with aligned entity descriptions is constructed to facilitate our study, namely
+the Wiki Movie Knowledge Graph (WikiMKG). The experimental results show that
+KERL achieves state-of-the-art results in both recommendation and response
+generation tasks.
+
+
+
+
+
+
+
+ ☆ LabelCraft: Empowering Short Video Recommendations with Automated Label
+ Crafting WSDM'24
+
+
+ Short video recommendations often face limitations due to the quality of user
+feedback, which may not accurately depict user interests. To tackle this
+challenge, a new task has emerged: generating more dependable labels from
+original feedback. Existing label generation methods rely on manual rules,
+demanding substantial human effort and potentially misaligning with the desired
+objectives of the platform. To transcend these constraints, we introduce
+LabelCraft, a novel automated label generation method explicitly optimizing
+pivotal operational metrics for platform success. By formulating label
+generation as a higher-level optimization problem above recommender model
+optimization, LabelCraft introduces a trainable labeling model for automatic
+label mechanism modeling. Through meta-learning techniques, LabelCraft
+effectively addresses the bi-level optimization hurdle posed by the recommender
+and labeling models, enabling the automatic acquisition of intricate label
+generation mechanisms.Extensive experiments on real-world datasets corroborate
+LabelCraft's excellence across varied operational metrics, encompassing usage
+time, user engagement, and retention. Codes are available at
+https://github.com/baiyimeng/LabelCraft.
+
+
+ In healthcare, artificial intelligence (AI) has been changing the way doctors
+and health experts take care of people. This paper will cover how AI is making
+major changes in the health care system, especially with nutrition. Various
+machine learning and deep learning algorithms have been developed to extract
+valuable information from healthcare data which help doctors, nutritionists,
+and health experts to make better decisions and make our lifestyle healthy.
+This paper provides an overview of the current state of AI applications in
+healthcare with a focus on the utilization of AI-driven recommender systems in
+nutrition. It will discuss the positive outcomes and challenges that arise when
+AI is used in this field. This paper addresses the challenges to develop AI
+recommender systems in healthcare, providing a well-rounded perspective on the
+complexities. Real-world examples and research findings are presented to
+underscore the tangible and significant impact AI recommender systems have in
+the field of healthcare, particularly in nutrition. The ongoing efforts of
+applying AI in nutrition lay the groundwork for a future where personalized
+recommendations play a pivotal role in guiding individuals toward healthier
+lifestyles.
+
+
+ Recent studies on pre-trained vision/language models have demonstrated the
+practical benefit of a new, promising solution-building paradigm in AI where
+models can be pre-trained on broad data describing a generic task space and
+then adapted successfully to solve a wide range of downstream tasks, even when
+training data is severely limited (e.g., in zero- or few-shot learning
+scenarios). Inspired by such progress, we investigate in this paper the
+possibilities and challenges of adapting such a paradigm to the context of
+recommender systems, which is less investigated from the perspective of
+pre-trained model. In particular, we propose to develop a generic recommender
+that captures universal interaction patterns by training on generic user-item
+interaction data extracted from different domains, which can then be fast
+adapted to improve few-shot learning performance in unseen new domains (with
+limited data).
+ However, unlike vision/language data which share strong conformity in the
+semantic space, universal patterns underlying recommendation data collected
+across different domains (e.g., different countries or different E-commerce
+platforms) are often occluded by both in-domain and cross-domain biases
+implicitly imposed by the cultural differences in their user and item bases, as
+well as their uses of different e-commerce platforms. As shown in our
+experiments, such heterogeneous biases in the data tend to hinder the
+effectiveness of the pre-trained model. To address this challenge, we further
+introduce and formalize a causal debiasing perspective, which is substantiated
+via a hierarchical Bayesian deep learning model, named PreRec. Our empirical
+studies on real-world data show that the proposed model could significantly
+improve the recommendation performance in zero- and few-shot learning settings
+under both cross-market and cross-platform scenarios.
+
+
+
+ comment: 8 pages, WSDM 24
+
+
+
+
+
+
+ ♻ ☆ Multi-Modality is All You Need for Transferable Recommender Systems ICDE'24
+
+
+ ID-based Recommender Systems (RecSys), where each item is assigned a unique
+identifier and subsequently converted into an embedding vector, have dominated
+the designing of RecSys. Though prevalent, such ID-based paradigm is not
+suitable for developing transferable RecSys and is also susceptible to the
+cold-start issue. In this paper, we unleash the boundaries of the ID-based
+paradigm and propose a Pure Multi-Modality based Recommender system (PMMRec),
+which relies solely on the multi-modal contents of the items (e.g., texts and
+images) and learns transition patterns general enough to transfer across
+domains and platforms. Specifically, we design a plug-and-play framework
+architecture consisting of multi-modal item encoders, a fusion module, and a
+user encoder. To align the cross-modal item representations, we propose a novel
+next-item enhanced cross-modal contrastive learning objective, which is
+equipped with both inter- and intra-modality negative samples and explicitly
+incorporates the transition patterns of user behaviors into the item encoders.
+To ensure the robustness of user representations, we propose a novel noised
+item detection objective and a robustness-aware contrastive learning objective,
+which work together to denoise user sequences in a self-supervised manner.
+PMMRec is designed to be loosely coupled, so after being pre-trained on the
+source data, each component can be transferred alone, or in conjunction with
+other components, allowing PMMRec to achieve versatility under both
+multi-modality and single-modality transfer learning settings. Extensive
+experiments on 4 sources and 10 target datasets demonstrate that PMMRec
+surpasses the state-of-the-art recommenders in both recommendation performance
+and transferability. Our code and dataset is available at:
+https://github.com/ICDE24/PMMRec.
+
+
+
+
+
+
+
+
+ Aleksandr V. Petrov, Craig Macdonald
+
+
+ Sequential Recommendation is a popular recommendation task that uses the
+order of user-item interaction to model evolving users' interests and
+sequential patterns in their behaviour. Current state-of-the-art
+Transformer-based models for sequential recommendation, such as BERT4Rec and
+SASRec, generate sequence embeddings and compute scores for catalogue items,
+but the increasing catalogue size makes training these models costly. The Joint
+Product Quantisation (JPQ) method, originally proposed for passage retrieval,
+markedly reduces the size of the retrieval index with minimal effect on model
+effectiveness, by replacing passage embeddings with a limited number of shared
+sub-embeddings. This paper introduces RecJPQ, a novel adaptation of JPQ for
+sequential recommendations, which takes the place of item embeddings tensor and
+replaces item embeddings with a concatenation of a limited number of shared
+sub-embeddings and, therefore, limits the number of learnable model parameters.
+The main idea of RecJPQ is to split items into sub-item entities before
+training the main recommendation model, which is inspired by splitting words
+into tokens and training tokenisers in language models. We apply RecJPQ to
+SASRec, BERT4Rec, and GRU4rec models on three large-scale sequential datasets.
+Our results showed that RecJPQ could notably reduce the model size (e.g., 48%
+reduction for the Gowalla dataset with no effectiveness degradation). RecJPQ
+can also improve model performance through a regularisation effect (e.g. +0.96%
+NDCG@10 improvement on the Booking.com dataset). Overall, RecJPQ allows the
+training of state-of-the-art transformer recommenders in industrial
+applications, where datasets with millions of items are common.
+
+
+
+ comment: Accepted by ACM WSDM 2024
+
+
+
+
+
+
+ ♻ ☆ AT4CTR: Auxiliary Match Tasks for Enhancing Click-Through Rate
+ Prediction
+
+
+ Click-through rate (CTR) prediction is a vital task in industrial
+recommendation systems. Most existing methods focus on the network architecture
+design of the CTR model for better accuracy and suffer from the data sparsity
+problem. Especially in industrial recommendation systems, the widely applied
+negative sample down-sampling technique due to resource limitation worsens the
+problem, resulting in a decline in performance. In this paper, we propose
+\textbf{A}uxiliary Match \textbf{T}asks for enhancing
+\textbf{C}lick-\textbf{T}hrough \textbf{R}ate prediction accuracy (AT4CTR) by
+alleviating the data sparsity problem. Specifically, we design two match tasks
+inspired by collaborative filtering to enhance the relevance modeling between
+user and item. As the "click" action is a strong signal which indicates the
+user's preference towards the item directly, we make the first match task aim
+at pulling closer the representation between the user and the item regarding
+the positive samples. Since the user's past click behaviors can also be treated
+as the user him/herself, we apply the next item prediction as the second match
+task. For both the match tasks, we choose the InfoNCE as their loss function.
+The two match tasks can provide meaningful training signals to speed up the
+model's convergence and alleviate the data sparsity. We conduct extensive
+experiments on one public dataset and one large-scale industrial recommendation
+dataset. The result demonstrates the effectiveness of the proposed auxiliary
+match tasks. AT4CTR has been deployed in the real industrial advertising system
+and has gained remarkable revenue.
+
+
+ To help merchants/customers to provide/access a variety of services through
+miniapps, online service platforms have occupied a critical position in the
+effective content delivery, in which how to recommend items in the new domain
+launched by the service provider for customers has become more urgent. However,
+the non-negligible gap between the source and diversified target domains poses
+a considerable challenge to cross-domain recommendation systems, which often
+leads to performance bottlenecks in industrial settings. While entity graphs
+have the potential to serve as a bridge between domains, rudimentary
+utilization still fail to distill useful knowledge and even induce the negative
+transfer issue. To this end, we propose PEACE, a Prototype lEarning Augmented
+transferable framework for Cross-domain rEcommendation. For domain gap
+bridging, PEACE is built upon a multi-interest and entity-oriented pre-training
+architecture which could not only benefit the learning of generalized knowledge
+in a multi-granularity manner, but also help leverage more structural
+information in the entity graph. Then, we bring the prototype learning into the
+pre-training over source domains, so that representations of users and items
+are greatly improved by the contrastive prototype learning module and the
+prototype enhanced attention mechanism for adaptive knowledge utilization. To
+ease the pressure of online serving, PEACE is carefully deployed in a
+lightweight manner, and significant performance improvements are observed in
+both online and offline environments.
+
+
+ Recently decades have witnessed the empirical success of framing Knowledge
+Graph (KG) embeddings via language models. However, language model-based KG
+embeddings are usually deployed as static artifacts, making them difficult to
+modify post-deployment without re-training after deployment. To address this
+issue, we propose a new task of editing language model-based KG embeddings in
+this paper. This task is designed to facilitate rapid, data-efficient updates
+to KG embeddings without compromising the performance of other aspects. We
+build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and
+evaluate several knowledge editing baselines demonstrating the limited ability
+of previous models to handle the proposed challenging task. We further propose
+a simple yet strong baseline dubbed KGEditor, which utilizes additional
+parametric layers of the hypernetwork to edit/add facts. Our comprehensive
+experimental results reveal that KGEditor excels in updating specific facts
+without impacting the overall performance, even when faced with limited
+training resources. Code and datasets are available in
+https://github.com/zjunlp/PromptKG/tree/main/deltaKG.
+
+
+
+ comment: AAAI 2024. The project website is
+ https://zjunlp.github.io/project/KGE_Editing/
+
+
+
+
+
+
+ ♻ ☆ Understanding or Manipulation: Rethinking Online Performance Gains of
+ Modern Recommender Systems
+
+
+ Recommender systems are expected to be assistants that help human users find
+relevant information automatically without explicit queries. As recommender
+systems evolve, increasingly sophisticated learning techniques are applied and
+have achieved better performance in terms of user engagement metrics such as
+clicks and browsing time. The increase in the measured performance, however,
+can have two possible attributions: a better understanding of user preferences,
+and a more proactive ability to utilize human bounded rationality to seduce
+user over-consumption. A natural following question is whether current
+recommendation algorithms are manipulating user preferences. If so, can we
+measure the manipulation level? In this paper, we present a general framework
+for benchmarking the degree of manipulations of recommendation algorithms, in
+both slate recommendation and sequential recommendation scenarios. The
+framework consists of four stages, initial preference calculation, training
+data collection, algorithm training and interaction, and metrics calculation
+that involves two proposed metrics. We benchmark some representative
+recommendation algorithms in both synthetic and real-world datasets under the
+proposed framework. We have observed that a high online click-through rate does
+not necessarily mean a better understanding of user initial preference, but
+ends in prompting users to choose more documents they initially did not favor.
+Moreover, we find that the training data have notable impacts on the
+manipulation degrees, and algorithms with more powerful modeling abilities are
+more sensitive to such impacts. The experiments also verified the usefulness of
+the proposed metrics for measuring the degree of manipulations. We advocate
+that future recommendation algorithm studies should be treated as an
+optimization problem with constrained user preference manipulations.
+
+
+
+ comment: 33 pages, 11 figures, 4 tables, ACM Transactions on Information
+ Systems
+
+
+
+
+
+
+ ♻ ☆ CTRL: Connect Collaborative and Language Model for CTR Prediction
+
+
+
+
+
+
+
+
+ Xiangyang Li, Bo Chen, Lu Hou, Ruiming Tang
+
+
+ Traditional click-through rate (CTR) prediction models convert the tabular
+data into one-hot vectors and leverage the collaborative relations among
+features for inferring the user's preference over items. This modeling paradigm
+discards essential semantic information. Though some works like P5 and CTR-BERT
+have explored the potential of using Pre-trained Language Models (PLMs) to
+extract semantic signals for CTR prediction, they are computationally expensive
+and suffer from low efficiency. Besides, the beneficial collaborative relations
+are not considered, hindering the recommendation performance. To solve these
+problems, in this paper, we propose a novel framework \textbf{CTRL}, which is
+industrial-friendly and model-agnostic with superior inference efficiency.
+Specifically, the original tabular data is first converted into textual data.
+Both tabular data and converted textual data are regarded as two different
+modalities and are separately fed into the collaborative CTR model and
+pre-trained language model. A cross-modal knowledge alignment procedure is
+performed to fine-grained align and integrate the collaborative and semantic
+signals, and the lightweight collaborative model can be deployed online for
+efficient serving after fine-tuned with supervised signals. Experimental
+results on three public datasets show that CTRL outperforms the
+state-of-the-art (SOTA) CTR models significantly. Moreover, we further verify
+its effectiveness on a large-scale industrial recommender system.
+
+
+ The feedback that users provide through their choices (e.g., clicks,
+purchases) is one of the most common types of data readily available for
+training search and recommendation algorithms. However, myopically training
+systems based on choice data may only improve short-term engagement, but not
+the long-term sustainability of the platform and the long-term benefits to its
+users, content providers, and other stakeholders. In this paper, we thus
+develop a new framework in which decision makers (e.g., platform operators,
+regulators, users) can express long-term goals for the behavior of the platform
+(e.g., fairness, revenue distribution, legal requirements). These goals take
+the form of exposure or impact targets that go well beyond individual sessions,
+and we provide new control-based algorithms to achieve these goals. In
+particular, the controllers are designed to achieve the stated long-term goals
+with minimum impact on short-term engagement. Beyond the principled theoretical
+derivation of the controllers, we evaluate the algorithms on both synthetic and
+real-world data. While all controllers perform well, we find that they provide
+interesting trade-offs in efficiency, robustness, and the ability to plan
+ahead.
+
+
+
+
+
+
+
+ ♻ ☆ NIR-Prompt: A Multi-task Generalized Neural Information Retrieval
+ Training Framework
+
+
+ Information retrieval aims to find information that meets users' needs from
+the corpus. Different needs correspond to different IR tasks such as document
+retrieval, open-domain question answering, retrieval-based dialogue, etc.,
+while they share the same schema to estimate the relationship between texts. It
+indicates that a good IR model can generalize to different tasks and domains.
+However, previous studies indicate that state-of-the-art neural information
+retrieval (NIR) models, e.g, pre-trained language models (PLMs) are hard to
+generalize. Mainly because the end-to-end fine-tuning paradigm makes the model
+overemphasize task-specific signals and domain biases but loses the ability to
+capture generalized essential signals. To address this problem, we propose a
+novel NIR training framework named NIR-Prompt for retrieval and reranking
+stages based on the idea of decoupling signal capturing and combination.
+NIR-Prompt exploits Essential Matching Module (EMM) to capture the essential
+matching signals and gets the description of tasks by Matching Description
+Module (MDM). The description is used as task-adaptation information to combine
+the essential matching signals to adapt to different tasks. Experiments under
+in-domain multi-task, out-of-domain multi-task, and new task adaptation
+settings show that NIR-Prompt can improve the generalization of PLMs in NIR for
+both retrieval and reranking stages compared with baselines.
+
+
+
+ comment: This article is the extension of arXiv:2204.02725 and accepted by
+ TOIS
+
+
+
+
+
+
+ ♻ ☆ Pitfalls in Link Prediction with Graph Neural Networks: Understanding
+ the Impact of Target-link Inclusion & Better Practices WSDM'24
+
+
+ While Graph Neural Networks (GNNs) are remarkably successful in a variety of
+high-impact applications, we demonstrate that, in link prediction, the common
+practices of including the edges being predicted in the graph at training
+and/or test have outsized impact on the performance of low-degree nodes. We
+theoretically and empirically investigate how these practices impact node-level
+performance across different degrees. Specifically, we explore three issues
+that arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test
+leakage. The former two issues lead to poor generalizability to the test data,
+while the latter leads to overestimation of the model's performance and
+directly impacts the deployment of GNNs. To address these issues in a
+systematic way, we introduce an effective and efficient GNN training framework,
+SpotTarget, which leverages our insight on low-degree nodes: (1) at training
+time, it excludes a (training) edge to be predicted if it is incident to at
+least one low-degree node; and (2) at test time, it excludes all test edges to
+be predicted (thus, mimicking real scenarios of using GNNs, where the test data
+is not included in the graph). SpotTarget helps researchers and practitioners
+adhere to best practices for learning from graph data, which are frequently
+overlooked even by the most widely-used frameworks. Our experiments on various
+real-world datasets show that SpotTarget makes GNNs up to 15x more accurate in
+sparse graphs, and significantly improves their performance for low-degree
+nodes in dense graphs.
+
+
+
+ comment: Extended Version of our WSDM'24 paper. 8 pages, 2 page appendix
+
+
+
+
+
+
+
+ David Cole, Himanshu Sharma, Wei Wang
+
+
+ We propose a framework for applying reinforcement learning to contextual
+two-stage stochastic optimization and apply this framework to the problem of
+energy market bidding of an off-shore wind farm. Reinforcement learning could
+potentially be used to learn close to optimal solutions for first stage
+variables of a two-stage stochastic program under different contexts. Under the
+proposed framework, these solutions would be learned without having to solve
+the full two-stage stochastic program. We present initial results of training
+using the DDPG algorithm and present intended future steps to improve
+performance.
+
+
+
+
+
+
+
+ ☆ Development and Evaluation of Ensemble Learning-based Environmental
+ Methane Detection and Intensity Prediction Models
+
+
+
+
+
+
+
+
+ Reek Majumder, Jacquan Pollard, M Sabbir Salek, David Werth, Gurcan Comert, Adrian Gale, Sakib Mahmud Khan, Samuel Darko, Mashrur Chowdhury
+
+
+ The environmental impacts of global warming driven by methane (CH4) emissions
+have catalyzed significant research initiatives in developing novel
+technologies that enable proactive and rapid detection of CH4. Several
+data-driven machine learning (ML) models were tested to determine how well they
+identified fugitive CH4 and its related intensity in the affected areas.
+Various meteorological characteristics, including wind speed, temperature,
+pressure, relative humidity, water vapor, and heat flux, were included in the
+simulation. We used the ensemble learning method to determine the
+best-performing weighted ensemble ML models built upon several weaker
+lower-layer ML models to (i) detect the presence of CH4 as a classification
+problem and (ii) predict the intensity of CH4 as a regression problem.
+
+
+
+
+
+
+
+
+ Ahmad Chamma, Bertrand Thirion, Denis A. Engemann
+
+
+ Explaining the decision process of machine learning algorithms is nowadays
+crucial for both model's performance enhancement and human comprehension. This
+can be achieved by assessing the variable importance of single variables, even
+for high-capacity non-linear methods, e.g. Deep Neural Networks (DNNs). While
+only removal-based approaches, such as Permutation Importance (PI), can bring
+statistical validity, they return misleading results when variables are
+correlated. Conditional Permutation Importance (CPI) bypasses PI's limitations
+in such cases. However, in high-dimensional settings, where high correlations
+between the variables cancel their conditional importance, the use of CPI as
+well as other methods leads to unreliable results, besides prohibitive
+computation costs. Grouping variables statistically via clustering or some
+prior knowledge gains some power back and leads to better interpretations. In
+this work, we introduce BCPI (Block-Based Conditional Permutation Importance),
+a new generic framework for variable importance computation with statistical
+guarantees handling both single and group cases. Furthermore, as handling
+groups with high cardinality (such as a set of observations of a given
+modality) are both time-consuming and resource-intensive, we also introduce a
+new stacking approach extending the DNN architecture with sub-linear layers
+adapted to the group structure. We show that the ensuing approach extended with
+stacking controls the type-I error even with highly-correlated groups and shows
+top accuracy across benchmarks. Furthermore, we perform a real-world data
+analysis in a large-scale medical dataset where we aim to show the consistency
+between our results and the literature for a biomarker prediction.
+
+
+
+
+
+
+
+ ☆ The Right Losses for the Right Gains: Improving the Semantic Consistency
+ of Deep Text-to-Image Generation with Distribution-Sensitive Losses
+
+
+ One of the major challenges in training deep neural networks for
+text-to-image generation is the significant linguistic discrepancy between
+ground-truth captions of each image in most popular datasets. The large
+difference in the choice of words in such captions results in synthesizing
+images that are semantically dissimilar to each other and to their ground-truth
+counterparts. Moreover, existing models either fail to generate the
+fine-grained details of the image or require a huge number of parameters that
+renders them inefficient for text-to-image synthesis. To fill this gap in the
+literature, we propose using the contrastive learning approach with a novel
+combination of two loss functions: fake-to-fake loss to increase the semantic
+consistency between generated images of the same caption, and fake-to-real loss
+to reduce the gap between the distributions of real images and fake ones. We
+test this approach on two baseline models: SSAGAN and AttnGAN (with style
+blocks to enhance the fine-grained details of the images.) Results show that
+our approach improves the qualitative results on AttnGAN with style blocks on
+the CUB dataset. Additionally, on the challenging COCO dataset, our approach
+achieves competitive results against the state-of-the-art Lafite model,
+outperforms the FID score of SSAGAN model by 44.
+
+
+
+
+
+
+
+ ♻ ☆ Perspectives on the State and Future of Deep Learning -- 2023
+
+
+
+
+
+
+
+
+ Micah Goldblum, Anima Anandkumar, Richard Baraniuk, Tom Goldstein, Kyunghyun Cho, Zachary C Lipton, Melanie Mitchell, Preetum Nakkiran, Max Welling, Andrew Gordon Wilson
+
+
+ The goal of this series is to chronicle opinions and issues in the field of
+machine learning as they stand today and as they change over time. The plan is
+to host this survey periodically until the AI singularity
+paperclip-frenzy-driven doomsday, keeping an updated list of topical questions
+and interviewing new community members for each edition. In this issue, we
+probed people's opinions on interpretable AI, the value of benchmarking in
+modern NLP, the state of progress towards understanding deep learning, and the
+future of academia.
+
+
+
+
+
+
+
+ ♻ ☆ Teaching Specific Scientific Knowledge into Large Language Models
+ through Additional Training
+
+
+
+
+
+
+
+
+ Kan Hatakeyama-Sato, Yasuhiko Igarashi, Shun Katakami, Yuta Nabae, Teruaki Hayakawa
+
+
+ Through additional training, we explore embedding specialized scientific
+knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that
+effective knowledge integration requires reading texts from multiple
+perspectives, especially in instructional formats. We utilize text augmentation
+to tackle the scarcity of specialized texts, including style conversions and
+translations. Hyperparameter optimization proves crucial, with different size
+models (7b, 13b, and 70b) reasonably undergoing additional training. Validating
+our methods, we construct a dataset of 65,000 scientific papers. Although we
+have succeeded in partially embedding knowledge, the study highlights the
+complexities and limitations of incorporating specialized information into
+LLMs, suggesting areas for further improvement.
+
+
+
+ comment: added token information for some texts, and fixed typo
+
+
+
+
+
+
+ ♻ ☆ Is Channel Independent strategy optimal for Time Series Forecasting?
+
+
+ There has been an emergence of various models for long-term time series
+forecasting. Recent studies have demonstrated that a single linear layer, using
+Channel Dependent (CD) or Channel Independent (CI) modeling, can even
+outperform a large number of sophisticated models. However, current research
+primarily considers CD and CI as two complementary yet mutually exclusive
+approaches, unable to harness these two extremes simultaneously. And it is also
+a challenging issue that both CD and CI are static strategies that cannot be
+determined to be optimal for a specific dataset without extensive experiments.
+In this paper, we reconsider whether the current CI strategy is the best
+solution for time series forecasting. First, we propose a simple yet effective
+strategy called CSC, which stands for $\mathbf{C}$hannel
+$\mathbf{S}$elf-$\mathbf{C}$lustering strategy, for linear models. Our Channel
+Self-Clustering (CSC) enhances CI strategy's performance improvements while
+reducing parameter size, for exmpale by over 10 times on electricity dataset,
+and significantly cutting training time. Second, we further propose Channel
+Rearrangement (CR), a method for deep models inspired by the self-clustering.
+CR attains competitive performance against baselines. Finally, we also discuss
+whether it is best to forecast the future values using the historical values of
+the same channel as inputs. We hope our findings and methods could inspire new
+solutions beyond CD/CI.
+
+
+ This paper proposes the use of causal modeling to detect and mitigate
+algorithmic bias that is nonlinear in the protected attribute. We provide a
+general overview of our approach. We use the German Credit data set, which is
+available for download from the UC Irvine Machine Learning Repository, to
+develop (1) a prediction model, which is treated as a black box, and (2) a
+causal model for bias mitigation. In this paper, we focus on age bias and the
+problem of binary classification. We show that the probability of getting
+correctly classified as "low risk" is lowest among young people. The
+probability increases with age nonlinearly. To incorporate the nonlinearity
+into the causal model, we introduce a higher order polynomial term. Based on
+the fitted causal model, the de-biased probability estimates are computed,
+showing improved fairness with little impact on overall classification
+accuracy. Causal modeling is intuitive and, hence, its use can enhance
+explicability and promotes trust among different stakeholders of AI.
+
+
+
+ comment: 5 pages, 3 figures, 12 tables. arXiv admin note: text overlap with
+ arXiv:2310.12421
+
+
+
+
+
+
+
+
+
+ Multimedia 8
+
+
+
+
+
+ ☆ Emotion Based Prediction in the Context of Optimized Trajectory Planning
+ for Immersive Learning
+
+
+ In the virtual elements of immersive learning, the use of Google Expedition
+and touch-screen-based emotion are examined. The objective is to investigate
+possible ways to combine these technologies to enhance virtual learning
+environments and learners emotional engagement. Pedagogical application,
+affordances, and cognitive load are the corresponding measures that are
+involved. Students will gain insight into the reason behind their significantly
+higher post-assessment Prediction Systems scores compared to preassessment
+scores through this work that leverages technology. This suggests that it is
+effective to include emotional elements in immersive learning scenarios. The
+results of this study may help develop new strategies by leveraging the
+features of immersive learning technology in educational technologies to
+improve virtual reality and augmented reality experiences. Furthermore, the
+effectiveness of immersive learning environments can be raised by utilizing
+magnetic, optical, or hybrid trackers that considerably improve object
+tracking.
+
+
+
+ comment: 5 pages, 5 figures
+
+
+
+
+
+
+ ☆ Frequency Spectrum is More Effective for Multimodal Representation and
+ Fusion: A Multimodal Spectrum Rumor Detector AAAI-2024
+
+
+ Multimodal content, such as mixing text with images, presents significant
+challenges to rumor detection in social media. Existing multimodal rumor
+detection has focused on mixing tokens among spatial and sequential locations
+for unimodal representation or fusing clues of rumor veracity across
+modalities. However, they suffer from less discriminative unimodal
+representation and are vulnerable to intricate location dependencies in the
+time-consuming fusion of spatial and sequential tokens. This work makes the
+first attempt at multimodal rumor detection in the frequency domain, which
+efficiently transforms spatial features into the frequency spectrum and obtains
+highly discriminative spectrum features for multimodal representation and
+fusion. A novel Frequency Spectrum Representation and fUsion network (FSRU)
+with dual contrastive learning reveals the frequency spectrum is more effective
+for multimodal representation and fusion, extracting the informative components
+for rumor detection. FSRU involves three novel mechanisms: utilizing the
+Fourier transform to convert features in the spatial domain to the frequency
+domain, the unimodal spectrum compression, and the cross-modal spectrum
+co-selection module in the frequency domain. Substantial experiments show that
+FSRU achieves satisfactory multimodal rumor detection performance.
+
+
+ This paper presents a comprehensive solution to address the critical
+challenge of liquid leaks in the oil and gas industry, leveraging advanced
+computer vision and deep learning methodologies. Employing You Only Look Once
+(YOLO) and Real-Time Detection Transformer (RT DETR) models, our project
+focuses on enhancing early identification of liquid leaks in key infrastructure
+components such as pipelines, pumps, and tanks. Through the integration of
+surveillance thermal cameras and sensors, the combined YOLO and RT DETR models
+demonstrate remarkable efficacy in the continuous monitoring and analysis of
+visual data within oil and gas facilities. YOLO's real-time object detection
+capabilities swiftly recognize leaks and their patterns, while RT DETR excels
+in discerning specific leak-related features, particularly in thermal images.
+This approach significantly improves the accuracy and speed of leak detection,
+ultimately mitigating environmental and financial risks associated with liquid
+leaks.
+
+
+
+ comment: 13 pages, 9 figures
+
+
+
+
+
+
+ ☆ Leveraged Mel spectrograms using Harmonic and Percussive Components in
+ Speech Emotion Recognition
+
+
+ Speech Emotion Recognition (SER) affective technology enables the intelligent
+embedded devices to interact with sensitivity. Similarly, call centre employees
+recognise customers' emotions from their pitch, energy, and tone of voice so as
+to modify their speech for a high-quality interaction with customers. This work
+explores, for the first time, the effects of the harmonic and percussive
+components of Mel spectrograms in SER. We attempt to leverage the Mel
+spectrogram by decomposing distinguishable acoustic features for exploitation
+in our proposed architecture, which includes a novel feature map generator
+algorithm, a CNN-based network feature extractor and a multi-layer perceptron
+(MLP) classifier. This study specifically focuses on effective data
+augmentation techniques for building an enriched hybrid-based feature map. This
+process results in a function that outputs a 2D image so that it can be used as
+input data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also
+investigate other acoustic features such as MFCCs, chromagram, spectral
+contrast, and the tonnetz to assess our proposed framework. A test accuracy of
+92.79% on the Berlin EMO-DB database is achieved. Our result is higher than
+previous works using CNN-VGG16.
+
+
+ Emotion recognition (ER) from speech signals is a robust approach since it
+cannot be imitated like facial expression or text based sentiment analysis.
+Valuable information underlying the emotions are significant for human-computer
+interactions enabling intelligent machines to interact with sensitivity in the
+real world. Previous ER studies through speech signal processing have focused
+exclusively on associations between different signal mode decomposition methods
+and hidden informative features. However, improper decomposition parameter
+selections lead to informative signal component losses due to mode duplicating
+and mixing. In contrast, the current study proposes VGG-optiVMD, an empowered
+variational mode decomposition algorithm, to distinguish meaningful speech
+features and automatically select the number of decomposed modes and optimum
+balancing parameter for the data fidelity constraint by assessing their effects
+on the VGG16 flattening output layer. Various feature vectors were employed to
+train the VGG16 network on different databases and assess VGG-optiVMD
+reproducibility and reliability. One, two, and three-dimensional feature
+vectors were constructed by concatenating Mel-frequency cepstral coefficients,
+Chromagram, Mel spectrograms, Tonnetz diagrams, and spectral centroids. Results
+confirmed a synergistic relationship between the fine-tuning of the signal
+sample rate and decomposition parameters with classification accuracy,
+achieving state-of-the-art 96.09% accuracy in predicting seven emotions on the
+Berlin EMO-DB database.
+
+
+
+ comment: 12 pages
+
+
+
+
+
+
+ ♻ ☆ RTQ: Rethinking Video-language Understanding Based on Image-text Model ACM MM 2023
+
+
+ Recent advancements in video-language understanding have been established on
+the foundation of image-text models, resulting in promising outcomes due to the
+shared knowledge between images and videos. However, video-language
+understanding presents unique challenges due to the inclusion of highly complex
+semantic details, which result in information redundancy, temporal dependency,
+and scene complexity. Current techniques have only partially tackled these
+issues, and our quantitative analysis indicates that some of these methods are
+complementary. In light of this, we propose a novel framework called RTQ
+(Refine, Temporal model, and Query), which addresses these challenges
+simultaneously. The approach involves refining redundant information within
+frames, modeling temporal relations among frames, and querying task-specific
+information from the videos. Remarkably, our model demonstrates outstanding
+performance even in the absence of video-language pre-training, and the results
+are comparable with or superior to those achieved by state-of-the-art
+pre-training methods. Code is available at
+https://github.com/SCZwangxiao/RTQ-MM2023.
+
+
+
+ comment: Accepted by ACM MM 2023 as Oral representation
+
+
+
+
+
+
+ ♻ ☆ ControlLLM: Augment Language Models with Tools by Searching on Graphs
+
+
+ We present ControlLLM, a novel framework that enables large language models
+(LLMs) to utilize multi-modal tools for solving complex real-world tasks.
+Despite the remarkable performance of LLMs, they still struggle with tool
+invocation due to ambiguous user prompts, inaccurate tool selection and
+parameterization, and inefficient tool scheduling. To overcome these
+challenges, our framework comprises three key components: (1) a \textit{task
+decomposer} that breaks down a complex task into clear subtasks with
+well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG)
+paradigm} that searches the optimal solution path on a pre-built tool graph,
+which specifies the parameter and dependency relations among different tools;
+and (3) an \textit{execution engine with a rich toolbox} that interprets the
+solution path and runs the tools efficiently on different computational
+devices. We evaluate our framework on diverse tasks involving image, audio, and
+video processing, demonstrating its superior accuracy, efficiency, and
+versatility compared to existing methods. The code is at
+https://github.com/OpenGVLab/ControlLLM.
+
+
+ In recent years, the explosion of web videos makes text-video retrieval
+increasingly essential and popular for video filtering, recommendation, and
+search. Text-video retrieval aims to rank relevant text/video higher than
+irrelevant ones. The core of this task is to precisely measure the cross-modal
+similarity between texts and videos. Recently, contrastive learning methods
+have shown promising results for text-video retrieval, most of which focus on
+the construction of positive and negative pairs to learn text and video
+representations. Nevertheless, they do not pay enough attention to hard
+negative pairs and lack the ability to model different levels of semantic
+similarity. To address these two issues, this paper improves contrastive
+learning using two novel techniques. First, to exploit hard examples for robust
+discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module
+(DMAE) to mine hard negative pairs from textual and visual clues. By further
+introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively
+identify all these hard negatives and explicitly highlight their impacts in the
+training loss. Second, our work argues that triplet samples can better model
+fine-grained semantic similarity compared to pairwise samples. We thereby
+present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to
+construct partial order triplet samples by automatically generating
+fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL
+designs an adaptive token masking strategy with cross-modal interaction to
+model subtle semantic differences. Extensive experiments demonstrate that the
+proposed approach outperforms existing methods on four widely-used text-video
+retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
+
+